Skip to content

How LLMs Actually Work

You use ChatGPT, Claude, or Copilot every day. But what actually happens between typing a prompt and getting a response?

This post traces the path of your text through the machine, step by step, in plain English, with the real math where it matters. No hand-waving. No "it's complicated." Just the actual pipeline, from the moment you hit Enter to the moment the first word appears.

Your text gets chopped into tokens

The first thing that happens: your text gets split into chunks called tokens. Not words. Tokens. A token is typically a piece of a word, 2–6 characters long.

Why not just use whole words? Because no fixed word list can handle everything. What about "unaffordable"? Or "GPT-4o"? Or "こんにちは"? A word-level system would choke on anything it hadn't seen before.

Instead, LLMs use an algorithm called Byte-Pair Encoding (BPE).1 The idea is simple: start with individual bytes (the 256 raw values a computer uses to store text) and scan through a massive pile of text looking for which two adjacent bytes appear together most often. Merge that pair into a single new token. Then scan again, merge again. Keep going until you've built up a vocabulary of ~32,000–100,000 tokens. Short common words like "the" end up as single tokens. Longer or rarer words get split into recognizable pieces.

Here's the step-by-step:

  1. Start with a vocabulary of 256 individual byte values, every possible byte a computer can store. (Unicode characters like "é" or "中" are multiple bytes in UTF-8, so they start as multiple base tokens.)
  2. Scan through all the text the model will be trained on (books, websites, code, Wikipedia, everything) and count every pair of adjacent symbols.
  3. Take the most frequent pair and merge it into a new single token. Add it to the vocabulary.
  4. Repeat until the vocabulary reaches the target size (e.g., 100,277 for GPT-4).

Common words like "the" stay as single tokens. Uncommon words get split into recognizable pieces:

"unaffordable"  →  "un" + "afford" + "able"
"lowering"      →  "low" + "er" + "ing"

Once this vocabulary is built, tokenization at inference applies the learned merge rules in priority order, always merging the highest-ranked pair first, until no more merges apply. Each token gets assigned a number, its token ID. These integers are the model's actual input. Here's what that looks like for a real sentence:

"The cat sat on the mat"

"The"   → 464
" cat"  → 3797    ← the space is part of the token
" sat"  → 3332
" on"   → 319
" the"  → 262
" mat"  → 2603
Input
Tokens
The464 cat3797 sat3332 on319 the262 mat2603
6tokens
Type anything to see how the model breaks it into tokens.

Notice the spaces are absorbed into the tokens themselves. Tokenization is not splitting on whitespace. " cat" (with the space) is a single token, different from "cat" without one.

Why this matters in practice: English is heavily favored. "Hello" is 1 token. "Привет" (Russian "hello") is 3 tokens. "नमस्ते" (Hindi) is 6. And "မင်္ဂလာပါ" (Burmese "hello") is 18 tokens. The same greeting, 18× the cost. This means LLMs process English faster, cheaper, and can fit more English text into their context window. The tokenizer is the first source of language inequality.

Each token becomes a list of numbers

So now we have token IDs: integers like 464, 3797, 3332. But the LLM can't do anything useful with bare integers. It needs something richer. So the very first thing the network does is look up each token ID in a giant table and swap it for a vector, a long list of numbers.

How long? In GPT-3, each token becomes a list of 12,288 numbers.2 That table (called the embedding matrix) has one row per token in the vocabulary. Looking up token 464 ("The") means grabbing row 464, which gives you 12,288 numbers that collectively represent what "The" means to this model. GPT-3's vocabulary is about 50,000 tokens, so the table is 50,000 rows × 12,288 columns. Roughly 617 million numbers, and that's before the model has done any actual processing.

???
The embedding matrix: each token ID looks up its column to get a unique vector of numbers.

Why so many numbers? Because each number is a dimension, and all those dimensions together place the token at a specific point in high-dimensional space. The thing that makes this useful: words with similar meanings end up near each other.

RoyaltyTechnologyPeopleEmotionsAnimalsFoodkingqueenprincecrownthroneroyalmonarchnoblepalacecomputercodealgorithmdatasoftwareprogramdigitalnetworksystemmanwomanboygirlchildpersonhumanfathermotherhappysadangrylovefearjoyhopecalmgriefcatdogkittenpuppybirdfishanimalhorserabbitpizzapastabreadcakericefruitmealsoupcheese
2D scatter plot of word embeddings.

"Cat" and "kitten" are close together. "Cat" and "legislature" are far apart. "King" and "queen" are close, and they're in the same direction from each other as "man" and "woman". That's the famous king - man + woman ≈ queen relationship. Meaning has geometry.3

From this point on, the LLM never sees words or text again. It only manipulates these lists of numbers, and every step from here is arithmetic on them.

One more thing happens before the next step: the model needs to know what order the words are in. Nothing we've done so far encodes position. Token 464 gets the same 12,288 numbers whether it's the first word or the hundredth. But word order obviously matters: "the cat sat on the mat" and "mat the on sat cat the" have the same tokens but completely different meanings. So a position signal is mixed into each token's numbers, essentially stamping each one with "I'm the 1st word," "I'm the 2nd word," and so on. Modern LLMs encode position using a technique called RoPE (Rotary Position Embedding),4 which, with additional scaling tricks, can be extended to handle sequences longer than the training window. But the basic idea is simple: each token knows where it sits in the sentence.

Attention: "which other words matter here?"

For every token in your input, the model asks: "Which other tokens should I pay attention to when understanding this one?"

Consider the word "bank" in two sentences:

  • "I deposited money at the bank" → financial institution
  • "The river bank was muddy" → edge of a river

Same word, completely different meaning. The attention mechanism resolves this by looking at context. In the first sentence, "bank" attends strongly to "deposited" and "money." In the second, it attends to "river" and "muddy." The word's internal representation changes based on what it's paying attention to.

Here's a full example. Click any word to see what it pays attention to:

The
cat
sat
on
the
mat
because
it
was
tired
100%
35%
65%
15%
55%
30%
10%
15%
45%
30%
30%
10%
20%
15%
25%
5%
25%
35%
12%
8%
15%
5%
15%
22%
8%
10%
18%
22%
3%
52%
8%
2%
5%
10%
8%
12%
2%
15%
12%
3%
5%
8%
10%
30%
15%
2%
38%
5%
2%
3%
4%
6%
18%
12%
10%
One attention head from a single layer — the model runs 96 of these in parallel, each learning different patterns.

Notice how "it" attends overwhelmingly to "cat". That's the model figuring out what the pronoun refers to. And "tired" attends to both "cat" and "it", connecting the adjective to its subject. This happens automatically, learned from data.5

So how does this actually work mechanically? Each token's vector (that list of numbers from the embedding step) gets multiplied by three separate tables of numbers that the model learned during training (called weights, you'll see this term a lot in ML) to produce three new vectors:

  • Query (Q): "What am I looking for?"
  • Key (K): "What do I have to offer?"
  • Value (V): "What information do I carry?"

Every token gets its own Q, K, and V. Then, to figure out how much token i should attend to token j, the model checks: how well does i's Query match j's Key? It does this with a dot product, which measures how similar the two vectors are. Similar vectors produce a high score, dissimilar ones produce a low score.

score(i,j)=QiKjdk\text{score}(i, j) = \frac{Q_i \cdot K_j}{\sqrt{d_k}}

That dk\sqrt{d_k} at the bottom is just a scaling factor. dkd_k is the length of the Key vector (e.g., 128 numbers). Without dividing by it, the dot products would grow unreasonably large with longer vectors, and the next step would break down. It's like normalizing a test score by the number of questions so you can compare fairly.

Now the model has a score for every pair of tokens. Next, it converts those raw scores into percentages using softmax, which makes all the scores positive and forces them to add up to 100%. So for the word "it" in our example sentence, you might end up with: "cat" gets 52%, "was" gets 20%, "it" (itself) gets 23%, and everything else shares the remaining 5%.

Finally, the model uses those percentages to create a weighted blend of everyone's Values. If "it" pays 52% attention to "cat," then 52% of "cat"'s Value vector gets mixed into "it"'s new representation. The full equation:

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V
QThe
Qcat
Qsat
Qon
Qthe
Qmat
KThe
+2.1
-0.3
-1.2
-0.8
+1.5
-0.5
Kcat
+3.2
+1.8
+0.2
+0.6
-0.1
Ksat
+2.8
+1.6
+0.3
+2.1
Kon
+1.4
+0.8
+0.7
Kthe
+2.3
+1.2
Kmat
+2.9
Toggle between raw dot-product scores and softmax-normalized attention weights. Hover to inspect individual cells.

That's the entire attention mechanism.6 Each token's output is a blend of all other tokens' information, weighted by relevance.

One important rule: no peeking forward. When generating text, the model produces tokens left to right. So each token is only allowed to attend to tokens that came before it (and itself), never to tokens that come after. Future positions are blocked out completely. This is what forces the model to generate one token at a time without seeing what's coming next.

The
cat
sat
on
the
mat
The
+1.3
-0.8
+2.1
-1.5
+0.7
-0.3
cat
+2.7
+1.9
-0.4
+3.1
-1.2
+0.8
sat
-1.1
+0.5
+2.8
+1.4
-0.9
+2.3
on
+0.6
-2.3
+1.7
-0.2
+1.8
-1.6
the
-0.4
+1.1
-1.8
+2.5
+0.3
+1.9
mat
+1.8
+0.2
-0.7
+1.3
+2.1
-0.5
softmax
The
cat
sat
on
the
mat
The
cat
sat
on
the
mat
Step through the causal mask: raw scores → mask future positions → normalize with softmax.

And it doesn't do this just once. It does it 96 times in parallel. Each parallel copy is called an "attention head," and each head has its own Q, K, and V weights. Why? Because different relationships require different kinds of attention. One head might learn to track grammar (matching verbs to their subjects). Another might learn pronoun resolution (figuring out that "it" refers to "cat"). Another might track topic or sentiment. GPT-3 runs 96 of these heads simultaneously, then combines all their outputs back into a single vector per token.

Four attention heads running in parallel, each learning a different pattern. Click one to expand it.

The result: every token now carries information from every relevant token that came before it. The word "bank" no longer has a single generic meaning. It has a specific, context-aware meaning shaped by its surroundings.

Do it again. 96 times.

We saw how attention lets each token look at every other token and figure out which ones are relevant. But attention is only half the job. After attending, each token also passes through a second step: a feed-forward network, a small network that takes a token's numbers in, runs them through two sets of weights (learned number tables, same idea as attention), and spits updated numbers out. It processes each token on its own, one at a time.

These two steps together (attention, then feed-forward) make up one Transformer layer. You can think of a layer as one round of processing. And the model doesn't do it just once. GPT-3 stacks 96 of these layers on top of each other. Your tokens flow through all 96, getting refined a little more at each one.

Here's what happens inside each layer. x is a token's vector, that list of numbers we've been tracking. It goes through attention, gets added back to itself, then goes through the feed-forward network and gets added back again:

// Step A: Attention - figure out which other tokens matter
a = LayerNorm(x)
x = x + MultiHeadAttention(a)

// Step B: Feed-forward - apply stored knowledge
f = LayerNorm(x)
x = x + FFN(f)
× 96 layers (GPT-3)Input Embeddingstokens → vectorsSelf-Attentionwhich words matter?Add & Normalizeresidual connectionFeed-Forward Networkknowledge lookupAdd & Normalizeresidual connectionOutput Probabilitiessoftmax → pick a token
One decoder layer — the model stacks 96 of these end to end. Click any block to learn what it does.

There's one important trick that makes 96 layers possible: a residual connection (also called a skip connection). At each layer, the output gets added to the input rather than replacing it. So if a particular layer has nothing useful to contribute for a given token, the original information just passes through untouched. Without this, by the time a token reached layer 96, the useful signal from the early layers would be washed out entirely. The residual connection is what keeps the signal intact across all 96 rounds.

What does the feed-forward part actually do? The attention step figures out relationships between tokens but the feed-forward step is where the model applies its stored knowledge to each token individually. It takes each token's vector, expands it to 4× its size (from 12,288 numbers to 49,152 in GPT-3), runs every number through a mathematical function that can amplify, dampen, or reshape it, then compresses it back down to the original size. In equation form:

FFN(x)=GELU(xW1+b1)W2+b2\text{FFN}(x) = \text{GELU}(x \cdot W_1 + b_1) \cdot W_2 + b_2

W1W_1 and W2W_2 are the learned weight tables. These two tables alone account for about two-thirds of each layer's total numbers. They're where most of the model's "knowledge" physically lives.

That activation function is what makes the feed-forward network nonlinear. Without it, the two weight tables would collapse into one bigger linear transform, and the model couldn't learn logic-like combinations of features.

English text?0.95A proper noun?0.92Refers to a person?0.10European country?0.97Part of source code?0.02In quotation marks?0.05Something metallic?0.03A four-legged animal?0.08
Each neuron in the feed-forward network acts as a feature detector. Switch tokens to see which features activate.

The parameter count adds up fast. Each token's vector is 12,288 numbers long. This is called the model dimension (dmodeld_\text{model}), and it's basically the "width" of the model. For GPT-3 with dmodel=12,288d_\text{model} = 12{,}288:

GPT-3 has 175 billion parameters total. Each parameter is a single floating-point number (typically stored as 16-bit), so a 70B model requires ~140 GB just for the weights, more than most computers have in RAM.

Early layers tend to capture surface patterns: grammar, syntax, word boundaries. Middle layers build up compositional meaning: who did what to whom, the topic of the passage. Later layers assemble the high-level representation needed to predict the next token: tone, intent, the specific factual or stylistic pattern that should come next.

The model makes its guess

The tokenizer splits text into tokens from a fixed list, about 50,000 of them. That list is the model's vocabulary: every token the model knows how to produce. It can't output a word or piece of a word that isn't in this list.

After flowing through all 96 layers, the last token's vector (12,288 numbers, now carrying context from every token before it) needs to be turned into a prediction: which of those ~50,000 tokens should come next?

The model does this by producing a raw score for every single token in its vocabulary called logits: one number per token, where higher means "the model thinks this token is more likely to come next." The math is one multiplication against a learned weight table (called the language model head, or WlmW_\text{lm}), followed by softmax:

logits=hnWlmP(next token=i)=elogitijelogitj\text{logits} = h_n \cdot W_\text{lm} \qquad P(\text{next token} = i) = \frac{e^{\text{logit}_i}}{\sum_j e^{\text{logit}_j}}

Here hnh_n is the last token's vector after all 96 layers (12,288 numbers), and WlmW_\text{lm} is a table of 12,288 × 50,257 numbers, one column per token in the vocabulary. Multiplying the vector against this table produces one score per token. Then softmax turns those scores into probabilities that add up to 100%.

For the input "The cat sat on the," the output distribution assigns a probability to every token in the vocabulary — about 50,000 of them. Most get near-zero probability. The top predictions:

The cat sat on the
mat
31%
floor
12%
table
9.0%
bed
6.0%
couch
5.0%
ground
4.0%
roof
3.0%
chair
3.0%
sofa
2.0%
rug
2.0%
counter
1.5%
bench
1.0%
~49,990 more tokens with tiny probabilities
The model assigns a probability to every token in its vocabulary — here are the top 10.

The model doesn't "know" the answer. It computed this distribution from patterns in its training data, where "the cat sat on the mat" appeared many times. It's a statistical reflection of the text it was trained on, not a fact lookup.

Sampling: how randomness makes it creative

The model has a probability distribution. Now it needs to pick one token. The simplest approach: always pick the most probable one (greedy decoding). But this produces boring, repetitive text. The model gets stuck in loops, repeating the same phrases.

Instead, LLMs use controlled randomness. The main parameter is temperature, a single number (called TT in the equation below) that you set when using the model. You've probably seen this as a slider in ChatGPT or an API setting. Before applying softmax, each logit is divided by TT:

P(tokeni)=elogiti/Tjelogitj/TP(\text{token}_i) = \frac{e^{\text{logit}_i / T}}{\sum_j e^{\text{logit}_j / T}}

This single parameter reshapes the entire distribution:

  • T0T \to 0: The distribution collapses to a spike on the highest-logit token. Nearly deterministic. Greedy.
  • T=1.0T = 1.0: The raw model distribution. What the model actually "thinks."
  • T>1.0T > 1.0: The distribution flattens. Low-probability tokens get boosted. More surprising, more creative, more likely to go off the rails.

Try it. Drag the temperature and watch the same set of probabilities reshape:

Temperature1.0
mat
51.8%
floor
14.1%
table
9.5%
bed
6.3%
couch
5.2%
ground
3.8%
roof
2.9%
chair
2.7%
sofa
1.9%
rug
1.7%
Drag the slider to see how temperature reshapes the probability distribution.

Two more techniques refine the selection:

  1. Top-k sampling restricts the choice to the kk highest-probability tokens (e.g., k=50k = 50). Everything ranked 51st and below gets zero probability, regardless of its logit. Simple but crude: it uses the same kk whether the model is confident or uncertain.

  2. Top-p (nucleus) sampling is smarter. Sort the tokens by descending probability. Walk down the list, accumulating probability. Stop when the cumulative sum reaches pp (e.g., p=0.9p = 0.9). Only sample from those tokens. This is adaptive: when the model is confident (peaked distribution), it considers few tokens. When uncertain, it considers many. Top-p is generally preferred over top-k for this reason.7

Typical production settings: temperature=0.7, top_p=0.9 (OpenAI's default for chat). For code or factual tasks: temperature=0 (greedy). For creative writing: temperature=1.0+.

This is also why the same prompt gives you different answers every time. Same logits, different dice rolls.

Repeat: one token at a time

Here's the thing that surprises most people: the model generates one token at a time. Not a word. Not a sentence. One token, and then the entire process runs again from scratch.

The chosen token is appended to the input sequence. Embedding, positional encoding, 96 layers of attention, projection, softmax, sampling, all of it, runs again. The new token's embedding flows through every layer, attending to all tokens that came before it, including the ones the model itself just produced.

This is called autoregressive generation. The model feeds its own output back as input. Each token is conditioned on all previous tokens, including its own prior outputs.

This is why LLM responses stream word by word rather than appearing all at once. Each token requires a complete forward pass: 175 billion multiply-adds for GPT-3. A 100-token response means 100 forward passes.

There is no planning step within the generation mechanism itself. The model doesn't draft a response and then output it. It commits to each token the moment it's generated and can never go back. This is why LLMs can start a sentence confidently and paint themselves into a corner. They're improvising one word at a time, with no ability to revise earlier choices. (Reasoning models like OpenAI's o1 work around this by generating a long chain of intermediate "thinking" tokens before the final answer, effectively planning through sequential generation rather than above it.)

Watch it happen. Pick a prompt and hit Generate to see tokens appear one by one:

The cat sat on the
Each token is generated by a full forward pass through the model — there is no lookahead.

How did it learn all this?

Everything above describes inference, what happens when you use the model. But how did it learn to produce these probability distributions in the first place? Three phases.

1: Pre-training

The model reads trillions of words (books, Wikipedia, Reddit, code, news, scientific papers) and learns one thing: predict the next token. That's the entire objective. No labels, no specific tasks.

For every token in the training text, the model tries to predict it, checks how wrong it was, and adjusts. The way it measures "how wrong" is a formula called the loss function, specifically cross-entropy, which boils down to: "how surprised was the model by the actual next token?"

L=logP(tactualall preceding tokens)L = -\log P(t_\text{actual} \mid \text{all preceding tokens})

tactualt_\text{actual} is just the token that actually came next in the training text, the right answer. So the formula reads: "take the probability the model assigned to the correct next token, and measure how bad that was." If the model gave it a high probability, LL is small (barely wrong). If it gave the correct token a tiny probability, LL is huge (very wrong). The model then works backward through all its weights and nudges each one slightly to make the correct answer more likely next time. This process is called backpropagation. Repeat this billions of times across trillions of tokens, and the weights gradually encode the patterns of language. This is self-supervised learning: the data labels itself, because the "right answer" is always just the next token in the text.

To give you a sense of scale:

ModelTraining dataComputeEstimated cost
Llama 2 70B (2023)2 trillion tokens1.7M A100 GPU hours~$3–5M*
DeepSeek-V3 671B (2024)14.8 trillion tokens2.8M H800 GPU hours$5.6M
Llama 3 405B (2024)15.6 trillion tokens16,000 H100 GPUs~$60–100M*

*Community estimates based on public GPU pricing. DeepSeek-V3's cost is stated directly in its technical report.

The data goes through heavy preprocessing: deduplication, quality filtering, safety filtering, and deliberate oversampling of high-quality sources like books and academic text.8

After pre-training, the model can continue any text fluently, but it has no concept of being helpful, safe, or honest. It'll happily continue a harmful prompt. It's a raw text-completion engine.

2: Fine-tuning

Researchers curate thousands of (instruction, ideal response) pairs and train the model on them. Same next-token objective, but now applied to "given this question, produce this kind of answer."

After fine-tuning on as few as ~13,000 curated examples (InstructGPT used this many),9 the model follows instructions, stays on topic, and produces appropriately structured responses.

3: RLHF

Fine-tuning gets it following instructions, but it can still produce harmful, biased, or unhelpful outputs. Reinforcement Learning from Human Feedback (RLHF) adds a preference signal:

  1. The model generates two answers to the same question.
  2. Human raters pick which one is better.
  3. A reward model learns to predict human preferences.
  4. The LLM is trained to maximize the reward while staying close to its original behavior:
Objective=E[reward(x,y)]βKL(πθπSFT)\text{Objective} = \mathbb{E}[\text{reward}(x, y)] - \beta \cdot \text{KL}(\pi_\theta \| \pi_\text{SFT})

The first part says "maximize the reward": generate answers that the reward model scores highly. The second part, KL(πθπSFT)\text{KL}(\pi_\theta \| \pi_\text{SFT}), is a KL penalty (short for Kullback–Leibler divergence, a way of measuring how far the model has drifted from its original behavior). β\beta controls how strong this leash is. Without it, the model would quickly learn to game the reward model, producing bizarre, unnatural text that technically scores high but is useless to humans. This gaming behavior is called reward hacking, and the KL penalty is what prevents it.

One result worth sitting with: InstructGPT showed that a 1.3 billion parameter model trained with RLHF was preferred by humans over the raw 175 billion parameter GPT-3. The small aligned model beat the big unaligned one. Alignment compounds on top of scale.

Why LLMs hallucinate

After tracing the full path, hallucination isn't mysterious. It's a direct consequence of the mechanism.

The model is always producing the most statistically likely next token based on patterns in its training data. When it encounters a question it doesn't have clear patterns for, it doesn't stop and say "I don't know." It generates the most plausible-sounding continuation, because that's literally what it was optimized to do at every single step.

In 2023, a lawyer used ChatGPT to research legal precedents. The model generated six entirely fictional court cases, complete with plausible-sounding citations, judge names, and case numbers. When the lawyer asked ChatGPT if the cases were real, it confirmed they were. The judge imposed a $5,000 fine on the lawyers and their firm. (Mata v. Avianca, Inc.)

This isn't a bug that can be patched. It's structural. The model compresses statistical patterns, not facts. When you ask for specifics, you sometimes get the shape of an answer rather than a real one. Ted Chiang captured it: LLM output is like a "blurry JPEG of the web."10 The overall picture is there, but fine details are approximations that can be wrong.

Andrej Karpathy frames it differently: the model is dreaming.11 Its outputs are internally coherent and plausible, following their own logic, but they aren't anchored in reality. Fine-tuning doesn't fix the dreaming. It just directs the dreams into "helpful assistant" territory.

Everything in this post (tokenization, embeddings, attention, sampling) is a mechanism for producing statistically plausible text. Not true text. Not fact-checked text. Plausible text. That's what makes LLMs both incredibly useful and fundamentally unreliable as sources of truth.12


  1. Sennrich et al. (2016), Neural Machine Translation of Rare Words with Subword Units. The paper that introduced Byte-Pair Encoding for neural language models.

  2. Brown et al. (2020), Language Models are Few-Shot Learners. The GPT-3 paper. Most of the architecture specifics in this post (12,288-dimensional embeddings, 96 layers, 175 billion parameters) come from here.

  3. The 3Blue1Brown Deep Learning series, Chapters 5-7, has the best animated explanation of embeddings and the geometry of meaning.

  4. Su et al. (2021), RoFormer: Enhanced Transformer with Rotary Position Embedding. RoPE is now the standard position encoding in most modern LLMs.

  5. Jay Alammar's The Illustrated Transformer is the canonical visual explainer for attention. Worth reading alongside this post.

  6. Vaswani et al. (2017), Attention Is All You Need. The paper that introduced the Transformer architecture and the scaled dot-product attention mechanism.

  7. Holtzman et al. (2020), The Curious Case of Neural Text Degeneration. Introduced nucleus (top-p) sampling and showed why greedy and pure sampling both produce degenerate text.

  8. Meta AI (2024), The Llama 3 Herd of Models. Documents the training data composition, compute requirements, and preprocessing pipeline at scale.

  9. Ouyang et al. (2022), Training Language Models to Follow Instructions with Human Feedback. The InstructGPT paper. Showed that RLHF on a 1.3B model could outperform the raw 175B GPT-3.

  10. Ted Chiang, ChatGPT Is a Blurry JPEG of the Web (The New Yorker, 2023). The best metaphor for how compression-based generation produces plausible but unreliable output.

  11. Andrej Karpathy, Intro to Large Language Models (1hr). He frames LLM outputs as "dreaming": internally coherent, statistically plausible, but not anchored in reality.

  12. For deeper explorations: Stephen Wolfram's What Is ChatGPT Doing and Why Does It Work? traces a similar pipeline in detail. Simon Willison's Catching Up on the Weird World of LLMs is an excellent primer on capabilities and limitations. And bbycroft.net/llm lets you watch the entire forward pass in interactive 3D.