System Architecture

How a Large Language Model
Generates Text

Trace a single prompt through the full pipeline — from raw characters to predicted next tokens.

Everything begins with a text prompt typed by the user. The model receives raw characters — it doesn't "understand" them yet. It must first convert them into a mathematical representation before any reasoning can occur.

Prompt Input
0 characters · waiting for submission
What happens next
01Text is split into tokens (sub-word units)
02Tokens are mapped to high-dimensional vectors
03Vectors flow through stacked transformer layers
04A probability distribution over vocabulary is produced

A tokenizer splits text into tokens — roughly word-sized chunks, but common words stay whole while rare words split into sub-pieces. Each token maps to a unique integer ID in the model's vocabulary (~50,000 entries for GPT-style models).

Tokenization
"The cat sat on" →
The
ID: 464
cat
ID: 3797
sat
ID: 3332
on
ID: 319
Real-world tokenization examples
"cat"cat
"cats"cats
"unbelievable"unbelievable
"ChatGPT"ChatGPT

Each token ID is looked up in an embedding table — a giant matrix of learned floats. This converts each token into a dense vector of typically 768–12,288 numbers encoding semantic meaning. Positional encodings are added so the model knows word order.

Token → Vector (simplified to 12 dims)
Each bar = one dimension · Real embeddings: 768–12,288 dims
Why this matters

"king" − "man" + "woman" ≈ "queen". Words with similar meanings cluster together in this high-dimensional space, encoding enormous amounts of knowledge absorbed from training data.

The heart of the transformer. Self-attention lets each token "look at" all other tokens and decide what's relevant. For each token it computes Query, Key, and Value vectors, then scores how much attention to pay to every other position.

Attention heatmap (heads combined)
Brighter = more attention · "on" attends strongly to "sat"
QKV Mechanism
Query (Q)What am I looking for?
Key (K)What do I contain?
Value (V)What do I pass forward?
Multi-head attention

GPT-4 uses ~96 attention heads in parallel. Each head learns different patterns — syntax, coreference, semantics, and more.

After attention, each token's representation passes through a feed-forward network — two linear transformations with a nonlinearity. This is where most knowledge is stored. Stacked dozens of times, each layer builds increasingly abstract understanding.

Transformer stack (simplified)
Knowledge storage

Individual neurons in feed-forward layers activate for specific concepts — cities, dates, famous people. The FFN acts as a distributed key-value memory storing world knowledge learned from training.

The final token's representation is projected through a language model head and then softmax to produce a probability distribution over the full vocabulary. The model samples to choose the next token.

Next-token probabilities · "The cat sat on"
Selected token: "the" p = 34%
The autoregressive loop
Append"the" is appended → "The cat sat on the"
Re-runFull pipeline runs again on the extended sequence
SampleNext token produced — e.g. "mat" (p = 18%)
RepeatUntil end-of-sequence token or max length reached
The cat sat on → the