Overview of Speech and Language Models

Feb 26

Written By Shahab Jalalvand

This is a cheat sheet for speech and language model architectures.

Speech & Language Model Architectures

A visual reference guide · 9 foundational architectures

Foundational · 2017

Transformer

Vaswani et al. · 2017

"Attention is All You Need." Scaled dot-product attention: Q·Kᵀ/√d softmaxed and applied to V. Multi-head projects into h subspaces. Positional encoding adds sequence order. Foundation for all modern LLMs.

Attn Heads

d_model

512

Complexity

O(L²)

Year

2017

Self-AttentionCross-AttentionPositional Enc

NLU · Encoder-Only

BERT

Google · 2018

Bidirectional encoder trained via Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). Sees full context in both directions. [CLS] token used for classification. Fine-tuned for NLU tasks.

Params

110M/340M

Direction

Bidir.

Layers

12/24

Pre-train

MLM+NSP

Encoder-OnlyBidirectionalMLMFine-tune

NLG · Decoder-Only

GPT

OpenAI · 2018–present

Decoder-only autoregressive model with causal masking—each token attends only to previous tokens. Trained by next-token prediction. Scales predictably. RLHF alignment (InstructGPT) enables instruction following.

Paradigm

Autoregress.

Attention

Causal

Training

Next Token

Align

RLHF

Decoder-OnlyCausal LMScaling LawsRLHF

ASR · Streaming

Conformer-RNNT

Google · 2020

Combines Convolutional and Transformer modules in each encoder block. Conv captures local acoustic features; attention captures global context. Paired with RNN-T for end-to-end streaming ASR.

Type

Encoder

Task

ASR/STT

Streaming

Yes

Loss

RNNT

ConformerAttentionConvRNN-T

SSL · ASR

Wav2Vec 2.0

Meta FAIR · 2020

Self-supervised learning on raw audio. A CNN extracts latent features; a Transformer learns contextual representations via masked prediction. A quantization module creates discrete speech units for contrastive training.

Pre-train

Self-SSL

Input

Raw Audio

Layers

12–24

Loss

Contrastive

Self-SupervisedCNNTransformerCTC

ASR · Multitask

Whisper

OpenAI · 2022

Encoder-decoder trained on 680K hrs of weakly-supervised web audio. Conv1D stem → Transformer encoder → cross-attention decoder. Multitask: ASR, translation, language detection, timestamp prediction via special tokens.

Training

680K hrs

Languages

Variants

Tiny→Large

Input

Log-Mel

Enc-DecMultitaskCross-AttnMultilingual

SSM · Selective

Mamba

Gu & Dao · 2023

State Space Model with selective (input-dependent) state transitions. Linear-time complexity O(L) unlike Transformer's O(L²). Parameters Δ, B, C are computed from input, enabling the model to selectively remember or forget.

Complexity

O(L)

State Size

N=16

No Attn

SSM Only

Recurrent

Inference

SSMLinear TimeSelectiveHardware-Aware

Multimodal · SOTA

Gemini

Google DeepMind · 2023–2024

Natively multimodal architecture trained from scratch on text, image, audio, and video. Uses Multi-Query Attention, RoPE, GeGLU activations, and Sliding Window Attention. Mixture-of-Experts in Ultra variant.

Modalities

Context

1M+ tokens

MoE

Ultra

Attn

MQA+SWA

MultimodalMoERoPEGeGLU

Streaming · Real-time

SpeedStream

Conceptual Architecture

A conceptual streaming ASR architecture. Processes fixed-size audio chunks with a rolling state buffer for continuity. Local attention window avoids full-sequence recomputation. VAD gating suppresses silence. Emits partial hypotheses in real-time with sub-100ms latency.

Latency

<100ms

Attention

Local/Sparse

Decode

CTC

VAD

Built-in

StreamingChunk-wiseCTCVADReal-time

★ SOTA 2025

SALM · Speech-Augmented LLM · #1 Open ASR Leaderboard

NVIDIA Canary-Qwen 2.5B

NVIDIA · July 17, 2025 · 5.63% WER · 418× Real-Time

The current #1 open-source ASR model (HuggingFace Open ASR Leaderboard, July 2025). A Speech-Augmented Language Model (SALM) that fuses a FastConformer encoder (32 blocks, 8× subsampling, strided local attention) with a full Qwen3-1.7B LLM decoder via a linear projection layer and LoRA adaptation. The audio encoder maps speech to the same embedding space as text tokens — both are then concatenated and fed to the LLM, enabling two operating modes: pure ASR transcription, and LLM-powered tasks (summarization, Q&A) directly from audio. Trained on 234,000 hours of speech data across 26 datasets. Not Mamba-based — the trend is toward LLM-decoder hybrids rather than SSMs for ASR.

WER (avg)

5.63%

LibriSpeech Clean

1.6%

RTFx Speed

418×

Params

2.5B

Training Data

234K hrs

Encoder

FastConformer ×32

SALMFastConformerQwen3 LLMLoRADual-ModeSOTA 2025NeMo

Shahab Jalalvand

Overview of Speech and Language Models

Speech & Language Model Architectures

Transformer

BERT

GPT

Conformer-RNNT

Wav2Vec 2.0

Whisper

Mamba

Gemini

SpeedStream

NVIDIA Canary-Qwen 2.5B

SJ AI Services

Location

Contact

Overview of Speech and Language Models

Transformer

BERT

GPT

Conformer-RNNT

Wav2Vec 2.0

Whisper

Mamba

Gemini

SpeedStream

NVIDIA Canary-Qwen 2.5B

A True Story Behind SwayMind

Similarities Between Brain Neurons Training and Deep Artificial Neural Network Training in AI

SJ AI Services

Location

Contact