The Big Leap: "Attention Is All You Need" (2017)

Before 2017, the best models for understanding text were recurrent neural networks (RNNs). They read sentences one word at a time, left to right — like a person reading with their eyes covered except for the current word. By the time they reached the end of a long sentence, the memory of the first few words was faded.

📚 Analogy — the party trick Imagine trying to understand a 300-page novel by only holding one sentence in your head at a time. You'd constantly forget who "he" referred to, or what event "it" points back to. That was the RNN problem.

Then a Google Brain team published a paper called "Attention Is All You Need" introducing the Transformer architecture. The key insight: instead of reading word by word, look at the whole sentence at once and compute relationships between every pair of words simultaneously. This has two huge consequences:

✓ Better understanding

Long-range dependencies (e.g. a pronoun referring to a noun 50 words earlier) are handled naturally — the model just "looks" directly at the relevant word.

✓ Parallelism

Because all words are processed simultaneously rather than one-by-one, Transformers can use modern GPUs and TPUs to their full potential — enabling training on enormous datasets.

Input Tokens (e.g. "The cat sat")
Embeddings + Positional Encoding
Multi-Head Self-Attention ✦ (the key innovation)
Feed-Forward Layer (per position)
Output: next-token probability distribution

Simplified single-layer Transformer block. Real models (GPT-4, Gemini) stack dozens of such layers.

👁️Self-Attention in Plain English

Self-attention is the engine inside every Transformer. The idea: when processing a word, the model asks "which other words in this sentence should I look at to understand what this word means here?" — and computes a weighted blend of all of them.

🐘 The Pronoun Problem Consider: "The animal didn't cross the street because it was too tired."
What does it refer to — the animal or the street? Humans immediately know: the animal gets tired, streets don't. Self-attention lets the model figure this out by giving "it" a very high attention weight toward "animal" and a low weight toward "street".

Mathematically, each word is represented as three vectors — a Query (what am I looking for?), a Key (what do I offer?), and a Value (what information do I actually provide?). Attention scores are dot products of queries against keys, run through a softmax (which turns raw scores into probabilities that sum to 1), then used to take a weighted sum of values.

💡 Multi-Head Attention Real Transformers run several attention calculations in parallel — called "heads". One head might learn grammatical relationships (subject → verb), another might focus on coreference (pronoun → noun), another on semantic similarity. Their outputs are concatenated and projected, giving the model richer representations.

Positional Encoding is also added to each word's embedding so the model knows the order of words — otherwise "Dog bites man" and "Man bites dog" would look the same.

🏗️What Makes an LLM "Large"?

A Large Language Model (LLM) is simply a Transformer (or similar architecture) trained at massive scale. Three numbers tell most of the story:

DimensionWhat it meansRough scale (modern LLMs)
ParametersThe adjustable numbers inside the neural network — its "memory" baked in during training7 billion → 1 trillion+
Training tokensThe amount of text seen during pre-training (each token ≈ 4 chars)1 – 15 trillion tokens
ComputeGPU/TPU hours to train; measured in FLOPs10²³ – 10²⁵ FLOPs

During pre-training, the model is given billions of sentences and learns one simple task over and over: predict the next token. Given "The sky is", what token comes next? "blue", "clear", "dark"? The model adjusts its billions of parameters to make better and better predictions. As a side effect, it learns grammar, facts, reasoning patterns, and even code — all from next-token prediction.

💡 Fine-tuning After pre-training, models are often fine-tuned on a smaller, curated dataset to specialise them — e.g., following instructions (instruction tuning), being helpful and harmless (RLHF), or writing medical reports. Fine-tuning adjusts only a small fraction of parameters, adapting the model without retraining from scratch.

🗝️Key Concepts Glossary

TermPlain-language definition
Token The smallest unit of text the model processes. Not exactly a word — punctuation, suffixes, and common sub-words each become separate tokens. "unhappiness" → ["un", "happiness"]. Roughly 1 token ≈ 4 characters in English. This is why AI counts characters oddly!
Context Window The maximum number of tokens the model can "see" at once (input + output combined). GPT-4 Turbo: 128k tokens ≈ 100,000 words — about a full novel. Older models: 2k–4k tokens. Longer windows = more expensive to run but better at long documents.
Parameters The billions of floating-point weights inside the model. These are fixed after training. They encode the model's "knowledge" implicitly — not as facts in a database, but as patterns in a massive web of numbers.
Pre-training The initial large-scale training phase: expose the model to a huge text corpus (web pages, books, code) and train it to predict next tokens. Extremely expensive; done once by the lab that created the model.
Fine-tuning Follow-up training on a smaller, task-specific dataset. Much cheaper. Used to make the model behave helpfully, safely, or specialise in a domain (legal, medical, customer service).
Hallucination When the model generates confident-sounding text that is factually wrong. Because LLMs are probability machines — they predict plausible-sounding next tokens — they don't consult a verified fact store. If the training data was wrong, thin, or the model is uncertain, it will still output something that sounds right.
💡 Probability Machine, Not a Database This is the most important mental model. When you ask an LLM "What is the capital of France?", it does not look up Paris in a verified table. It generates "Paris" because "Paris" is statistically the most likely continuation. Almost always correct — but never guaranteed. For high-stakes facts, always verify from authoritative sources.
Tokenizer Peek 🔤 Interactive

Type any text below and watch it split into approximate tokens in real time. Colors cycle through token groups. Real models use learned vocabularies (BPE/WordPiece); this demo uses a whitespace + punctuation + subword heuristic.

Tokens: Characters: Avg chars/token:

⚙ Real tokenizers (GPT-4 uses tiktoken / BPE) map each token to a unique integer ID and learn subword splits from the training corpus. The rule-of-thumb is ~4 characters ≈ 1 token for English text.

Attention Visualiser 👀 Interactive

Select an example sentence or type your own (up to ~12 words). The heatmap shows a heuristic "attention matrix": how much each word (row) attends to every other word (column). Click a word pill or a row to highlight its attention pattern.

Click any word pill or cell row to see its attention pattern.

⚠ This is a heuristic simulation — not a trained model. It illustrates the concept of an attention matrix.

🌀Why LLMs Hallucinate — and What to Do About It

Hallucination is a direct consequence of how LLMs work. They are trained to produce the most statistically plausible text — not the most factually accurate text. Some causes:

✅ Mitigation strategies Retrieval-Augmented Generation (RAG), tool use (web search, code execution), chain-of-thought prompting, asking the model to cite sources, and always verifying critical facts from authoritative primary sources.
🎲 Think of it like autocomplete on steroids Your phone's autocomplete uses statistics to predict likely next words. LLMs do the same thing — just with billions more parameters and trillions of training examples. Both can produce grammatically perfect nonsense.