Overview of Speech and Language Models
This is a cheat sheet for speech and language model architectures.
Speech & Language Model Architectures
A visual reference guide · 9 foundational architectures
Foundational · 2017
Transformer
Vaswani et al. · 2017
"Attention is All You Need." Scaled dot-product attention: Q·Kᵀ/√d softmaxed and applied to V. Multi-head projects into h subspaces. Positional encoding adds sequence order. Foundation for all modern LLMs.
Attn Heads
8
d_model
512
Complexity
O(L²)
Year
2017
NLU · Encoder-Only
BERT
Google · 2018
Bidirectional encoder trained via Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). Sees full context in both directions. [CLS] token used for classification. Fine-tuned for NLU tasks.
Params
110M/340M
Direction
Bidir.
Layers
12/24
Pre-train
MLM+NSP
NLG · Decoder-Only
GPT
OpenAI · 2018–present
Decoder-only autoregressive model with causal masking—each token attends only to previous tokens. Trained by next-token prediction. Scales predictably. RLHF alignment (InstructGPT) enables instruction following.
Paradigm
Autoregress.
Attention
Causal
Training
Next Token
Align
RLHF
ASR · Streaming
Conformer-RNNT
Google · 2020
Combines Convolutional and Transformer modules in each encoder block. Conv captures local acoustic features; attention captures global context. Paired with RNN-T for end-to-end streaming ASR.
Type
Encoder
Task
ASR/STT
Streaming
Yes
Loss
RNNT
SSL · ASR
Wav2Vec 2.0
Meta FAIR · 2020
Self-supervised learning on raw audio. A CNN extracts latent features; a Transformer learns contextual representations via masked prediction. A quantization module creates discrete speech units for contrastive training.
Pre-train
Self-SSL
Input
Raw Audio
Layers
12–24
Loss
Contrastive
ASR · Multitask
Whisper
OpenAI · 2022
Encoder-decoder trained on 680K hrs of weakly-supervised web audio. Conv1D stem → Transformer encoder → cross-attention decoder. Multitask: ASR, translation, language detection, timestamp prediction via special tokens.
Training
680K hrs
Languages
99
Variants
Tiny→Large
Input
Log-Mel
SSM · Selective
Mamba
Gu & Dao · 2023
State Space Model with selective (input-dependent) state transitions. Linear-time complexity O(L) unlike Transformer's O(L²). Parameters Δ, B, C are computed from input, enabling the model to selectively remember or forget.
Complexity
O(L)
State Size
N=16
No Attn
SSM Only
Recurrent
Inference
Multimodal · SOTA
Gemini
Google DeepMind · 2023–2024
Natively multimodal architecture trained from scratch on text, image, audio, and video. Uses Multi-Query Attention, RoPE, GeGLU activations, and Sliding Window Attention. Mixture-of-Experts in Ultra variant.
Modalities
4
Context
1M+ tokens
MoE
Ultra
Attn
MQA+SWA
Streaming · Real-time
SpeedStream
Conceptual Architecture
A conceptual streaming ASR architecture. Processes fixed-size audio chunks with a rolling state buffer for continuity. Local attention window avoids full-sequence recomputation. VAD gating suppresses silence. Emits partial hypotheses in real-time with sub-100ms latency.
Latency
<100ms
Attention
Local/Sparse
Decode
CTC
VAD
Built-in
★ SOTA 2025
SALM · Speech-Augmented LLM · #1 Open ASR Leaderboard
NVIDIA Canary-Qwen 2.5B
NVIDIA · July 17, 2025 · 5.63% WER · 418× Real-Time
The current #1 open-source ASR model (HuggingFace Open ASR Leaderboard, July 2025). A Speech-Augmented Language Model (SALM) that fuses a FastConformer encoder (32 blocks, 8× subsampling, strided local attention) with a full Qwen3-1.7B LLM decoder via a linear projection layer and LoRA adaptation. The audio encoder maps speech to the same embedding space as text tokens — both are then concatenated and fed to the LLM, enabling two operating modes: pure ASR transcription, and LLM-powered tasks (summarization, Q&A) directly from audio. Trained on 234,000 hours of speech data across 26 datasets. Not Mamba-based — the trend is toward LLM-decoder hybrids rather than SSMs for ASR.
WER (avg)
5.63%
LibriSpeech Clean
1.6%
RTFx Speed
418×
Params
2.5B
Training Data
234K hrs
Encoder
FastConformer ×32