Overview of Speech and Language Models

This is a cheat sheet for speech and language model architectures.

Speech & Language Model Architectures

Speech & Language Model Architectures

A visual reference guide · 9 foundational architectures

Foundational · 2017

Transformer

Vaswani et al. · 2017
ENCODER Input Embedding + PE Multi-Head Self-Attention Add & LayerNorm Feed-Forward 2-layer MLP Add & LayerNorm ×N DECODER Output Embed + PE Masked Multi-Head Self-Attention Add & LayerNorm Cross-Attention (K,V from Encoder) Add & LayerNorm FFN + LayerNorm Linear + Softmax
"Attention is All You Need." Scaled dot-product attention: Q·Kᵀ/√d softmaxed and applied to V. Multi-head projects into h subspaces. Positional encoding adds sequence order. Foundation for all modern LLMs.
Attn Heads
8
d_model
512
Complexity
O(L²)
Year
2017
Self-AttentionCross-AttentionPositional Enc
NLU · Encoder-Only

BERT

Google · 2018
[CLS] tok1 [MASK] tok3 [SEP] Token + Segment + Position Embeddings TRANSFORMER ENCODER ×12/24 Bidirectional Self-Attention Feed-Forward (4× hidden) LayerNorm + Residual CLS→Cls MLM Head NSP Head Bidirectional ← full context → MLM: 15% masked NSP: sentence pair
Bidirectional encoder trained via Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). Sees full context in both directions. [CLS] token used for classification. Fine-tuned for NLU tasks.
Params
110M/340M
Direction
Bidir.
Layers
12/24
Pre-train
MLM+NSP
Encoder-OnlyBidirectionalMLMFine-tune
NLG · Decoder-Only

GPT

OpenAI · 2018–present
Causal (left-to-right) The cat sat →? Causal Mask ▼ ■ ■ ▼ ▼ ■ DECODER ONLY ×N layers Masked Self-Attention (causal) Feed-Forward Network LayerNorm (Pre-Norm in GPT-2+) Linear → Softmax → Next Token Versions GPT-1: 117M GPT-2: 1.5B GPT-3: 175B GPT-4: ~1.8T? RLHF (InstructGPT) ChatGPT, GPT-4o
Decoder-only autoregressive model with causal masking—each token attends only to previous tokens. Trained by next-token prediction. Scales predictably. RLHF alignment (InstructGPT) enables instruction following.
Paradigm
Autoregress.
Attention
Causal
Training
Next Token
Align
RLHF
Decoder-OnlyCausal LMScaling LawsRLHF
ASR · Streaming

Conformer-RNNT

Google · 2020
Audio CONFORMER FF Norm MHA Conv Module FF Norm Encoder Output RNNT Prediction Network Joint Network Tokens depthwise conv
Combines Convolutional and Transformer modules in each encoder block. Conv captures local acoustic features; attention captures global context. Paired with RNN-T for end-to-end streaming ASR.
Type
Encoder
Task
ASR/STT
Streaming
Yes
Loss
RNNT
ConformerAttentionConvRNN-T
SSL · ASR

Wav2Vec 2.0

Meta FAIR · 2020
Raw Audio CNN Encoder 7 conv layers Transformer Context Network 12-24 layers Masked Prediction Quantizer Product Quant. CTC / Fine- tune Output
Self-supervised learning on raw audio. A CNN extracts latent features; a Transformer learns contextual representations via masked prediction. A quantization module creates discrete speech units for contrastive training.
Pre-train
Self-SSL
Input
Raw Audio
Layers
12–24
Loss
Contrastive
Self-SupervisedCNNTransformerCTC
ASR · Multitask

Whisper

OpenAI · 2022
Log-Mel Spec. ENCODER Conv1D ×2 Sinusoidal PE Transformer self-attn blocks DECODER Learned PE Masked Self-Attn Cross-Attention Linear+Softmax <|lang|> <|task|> <|notimestamps|>
Encoder-decoder trained on 680K hrs of weakly-supervised web audio. Conv1D stem → Transformer encoder → cross-attention decoder. Multitask: ASR, translation, language detection, timestamp prediction via special tokens.
Training
680K hrs
Languages
99
Variants
Tiny→Large
Input
Log-Mel
Enc-DecMultitaskCross-AttnMultilingual
SSM · Selective

Mamba

Gu & Dao · 2023
Input x MAMBA SSM BLOCK Linear Linear z Conv1d Δ,B,C State Space h'=Āh+B̄x SiLU × Linear (output projection)
State Space Model with selective (input-dependent) state transitions. Linear-time complexity O(L) unlike Transformer's O(L²). Parameters Δ, B, C are computed from input, enabling the model to selectively remember or forget.
Complexity
O(L)
State Size
N=16
No Attn
SSM Only
Recurrent
Inference
SSMLinear TimeSelectiveHardware-Aware
Multimodal · SOTA

Gemini

Google DeepMind · 2023–2024
🖼 Image 🎵 Audio 📄 Text 🎬 Video Modality Encoders ViT/SigLIP USM Audio Tokenizer Frame Enc. GEMINI Transformer Multi-Query Attention RoPE Embeddings Gated MLP (GeGLU) Sliding Window Attn Outputs: Text Code Images Ultra / Pro / Flash / Nano
Natively multimodal architecture trained from scratch on text, image, audio, and video. Uses Multi-Query Attention, RoPE, GeGLU activations, and Sliding Window Attention. Mixture-of-Experts in Ultra variant.
Modalities
4
Context
1M+ tokens
MoE
Ultra
Attn
MQA+SWA
MultimodalMoERoPEGeGLU
Streaming · Real-time

SpeedStream

Conceptual Architecture
Streaming Audio Chunks t₁ t₂ t₃ Chunk Encoder State Buffer h_prev → h_cur (rolling) LOCAL ATTENTION WINDOW (Chunk-wise) Sparse Attn Causal Mask Linear Proj CTC Decode OUTPUT PIPELINE Partial Hypothesis VAD Detection Final Transcript
A conceptual streaming ASR architecture. Processes fixed-size audio chunks with a rolling state buffer for continuity. Local attention window avoids full-sequence recomputation. VAD gating suppresses silence. Emits partial hypotheses in real-time with sub-100ms latency.
Latency
<100ms
Attention
Local/Sparse
Decode
CTC
VAD
Built-in
StreamingChunk-wiseCTCVADReal-time
★ SOTA 2025
SALM · Speech-Augmented LLM · #1 Open ASR Leaderboard

NVIDIA Canary-Qwen 2.5B

NVIDIA · July 17, 2025 · 5.63% WER · 418× Real-Time
SPEECH-AUGMENTED LANGUAGE MODEL (SALM) ARCHITECTURE AUDIO INPUT 16kHz mono WAV or FLAC FastConformer ENCODER (canary-1b-flash) Sub-sampling Conv (8×) Conformer Blocks ×32 Local + Strided Attention Speech Representation h_audio Linear Projection + LoRA adapt. Concat audio tokens text tokens "Transcribe: ⟨audio⟩" Qwen3-1.7B LLM DECODER Causal Self-Attention (GQA) SwiGLU Feed-Forward RMSNorm + RoPE Linear → Vocab Softmax ASR Transcript LLM Summary/QA
The current #1 open-source ASR model (HuggingFace Open ASR Leaderboard, July 2025). A Speech-Augmented Language Model (SALM) that fuses a FastConformer encoder (32 blocks, 8× subsampling, strided local attention) with a full Qwen3-1.7B LLM decoder via a linear projection layer and LoRA adaptation. The audio encoder maps speech to the same embedding space as text tokens — both are then concatenated and fed to the LLM, enabling two operating modes: pure ASR transcription, and LLM-powered tasks (summarization, Q&A) directly from audio. Trained on 234,000 hours of speech data across 26 datasets. Not Mamba-based — the trend is toward LLM-decoder hybrids rather than SSMs for ASR.
WER (avg)
5.63%
LibriSpeech Clean
1.6%
RTFx Speed
418×
Params
2.5B
Training Data
234K hrs
Encoder
FastConformer ×32
SALMFastConformerQwen3 LLMLoRADual-ModeSOTA 2025NeMo
SPEECH & LANGUAGE MODEL REFERENCE · 9 ARCHITECTURES · 2017–2024 · Built with SVG + CSS
Next
Next

Similarities Between Brain Neurons Training and Deep Artificial Neural Network Training in AI