How an LLM Generates Text

Everything begins with a text prompt typed by the user. The model receives raw characters — it doesn't "understand" them yet. It must first convert them into a mathematical representation before any reasoning can occur.

Prompt Input

0 characters · waiting for submission

What happens next

01Text is split into tokens (sub-word units)

02Tokens are mapped to high-dimensional vectors

03Vectors flow through stacked transformer layers

04A probability distribution over vocabulary is produced

A tokenizer splits text into tokens — roughly word-sized chunks, but common words stay whole while rare words split into sub-pieces. Each token maps to a unique integer ID in the model's vocabulary (~50,000 entries for GPT-style models).

Tokenization

"The cat sat on" →

The

ID: 464

cat

ID: 3797

sat

ID: 3332

on

ID: 319

Real-world tokenization examples

"cat"→cat

"cats"→cats

"unbelievable"→unbelievable

"ChatGPT"→ChatGPT

Each token ID is looked up in an embedding table — a giant matrix of learned floats. This converts each token into a dense vector of typically 768–12,288 numbers encoding semantic meaning. Positional encodings are added so the model knows word order.

Token → Vector (simplified to 12 dims)

Each bar = one dimension · Real embeddings: 768–12,288 dims

Why this matters

"king" − "man" + "woman" ≈ "queen". Words with similar meanings cluster together in this high-dimensional space, encoding enormous amounts of knowledge absorbed from training data.

The heart of the transformer. Self-attention lets each token "look at" all other tokens and decide what's relevant. For each token it computes Query, Key, and Value vectors, then scores how much attention to pay to every other position.

Attention heatmap (heads combined)

Brighter = more attention · "on" attends strongly to "sat"

QKV Mechanism

Query (Q)What am I looking for?

Key (K)What do I contain?

Value (V)What do I pass forward?

Multi-head attention

GPT-4 uses ~96 attention heads in parallel. Each head learns different patterns — syntax, coreference, semantics, and more.

After attention, each token's representation passes through a feed-forward network — two linear transformations with a nonlinearity. This is where most knowledge is stored. Stacked dozens of times, each layer builds increasingly abstract understanding.

Transformer stack (simplified)

Knowledge storage

Individual neurons in feed-forward layers activate for specific concepts — cities, dates, famous people. The FFN acts as a distributed key-value memory storing world knowledge learned from training.

The final token's representation is projected through a language model head and then softmax to produce a probability distribution over the full vocabulary. The model samples to choose the next token.

Next-token probabilities · "The cat sat on"

Selected token: "the" p = 34%

The autoregressive loop

Append"the" is appended → "The cat sat on the"

Re-runFull pipeline runs again on the extended sequence

SampleNext token produced — e.g. "mat" (p = 18%)

RepeatUntil end-of-sequence token or max length reached

How a Large Language ModelGenerates Text

How a Large Language Model
Generates Text