Trace a single prompt through the full pipeline — from raw characters to predicted next tokens.
Everything begins with a text prompt typed by the user. The model receives raw characters — it doesn't "understand" them yet. It must first convert them into a mathematical representation before any reasoning can occur.
A tokenizer splits text into tokens — roughly word-sized chunks, but common words stay whole while rare words split into sub-pieces. Each token maps to a unique integer ID in the model's vocabulary (~50,000 entries for GPT-style models).
Each token ID is looked up in an embedding table — a giant matrix of learned floats. This converts each token into a dense vector of typically 768–12,288 numbers encoding semantic meaning. Positional encodings are added so the model knows word order.
"king" − "man" + "woman" ≈ "queen". Words with similar meanings cluster together in this high-dimensional space, encoding enormous amounts of knowledge absorbed from training data.
The heart of the transformer. Self-attention lets each token "look at" all other tokens and decide what's relevant. For each token it computes Query, Key, and Value vectors, then scores how much attention to pay to every other position.
GPT-4 uses ~96 attention heads in parallel. Each head learns different patterns — syntax, coreference, semantics, and more.
After attention, each token's representation passes through a feed-forward network — two linear transformations with a nonlinearity. This is where most knowledge is stored. Stacked dozens of times, each layer builds increasingly abstract understanding.
Individual neurons in feed-forward layers activate for specific concepts — cities, dates, famous people. The FFN acts as a distributed key-value memory storing world knowledge learned from training.
The final token's representation is projected through a language model head and then softmax to produce a probability distribution over the full vocabulary. The model samples to choose the next token.