⚡The Big Leap: "Attention Is All You Need" (2017)
Before 2017, the best models for understanding text were recurrent neural networks (RNNs). They read sentences one word at a time, left to right — like a person reading with their eyes covered except for the current word. By the time they reached the end of a long sentence, the memory of the first few words was faded.
Then a Google Brain team published a paper called "Attention Is All You Need" introducing the Transformer architecture. The key insight: instead of reading word by word, look at the whole sentence at once and compute relationships between every pair of words simultaneously. This has two huge consequences:
Long-range dependencies (e.g. a pronoun referring to a noun 50 words earlier) are handled naturally — the model just "looks" directly at the relevant word.
Because all words are processed simultaneously rather than one-by-one, Transformers can use modern GPUs and TPUs to their full potential — enabling training on enormous datasets.
Simplified single-layer Transformer block. Real models (GPT-4, Gemini) stack dozens of such layers.
👁️Self-Attention in Plain English
Self-attention is the engine inside every Transformer. The idea: when processing a word, the model asks "which other words in this sentence should I look at to understand what this word means here?" — and computes a weighted blend of all of them.
What does it refer to — the animal or the street? Humans immediately know: the animal gets tired, streets don't. Self-attention lets the model figure this out by giving "it" a very high attention weight toward "animal" and a low weight toward "street".
Mathematically, each word is represented as three vectors — a Query (what am I looking for?), a Key (what do I offer?), and a Value (what information do I actually provide?). Attention scores are dot products of queries against keys, run through a softmax (which turns raw scores into probabilities that sum to 1), then used to take a weighted sum of values.
Positional Encoding is also added to each word's embedding so the model knows the order of words — otherwise "Dog bites man" and "Man bites dog" would look the same.
🏗️What Makes an LLM "Large"?
A Large Language Model (LLM) is simply a Transformer (or similar architecture) trained at massive scale. Three numbers tell most of the story:
| Dimension | What it means | Rough scale (modern LLMs) |
|---|---|---|
| Parameters | The adjustable numbers inside the neural network — its "memory" baked in during training | 7 billion → 1 trillion+ |
| Training tokens | The amount of text seen during pre-training (each token ≈ 4 chars) | 1 – 15 trillion tokens |
| Compute | GPU/TPU hours to train; measured in FLOPs | 10²³ – 10²⁵ FLOPs |
During pre-training, the model is given billions of sentences and learns one simple task over and over: predict the next token. Given "The sky is", what token comes next? "blue", "clear", "dark"? The model adjusts its billions of parameters to make better and better predictions. As a side effect, it learns grammar, facts, reasoning patterns, and even code — all from next-token prediction.
🗝️Key Concepts Glossary
| Term | Plain-language definition |
|---|---|
| Token | The smallest unit of text the model processes. Not exactly a word — punctuation, suffixes, and common sub-words each become separate tokens. "unhappiness" → ["un", "happiness"]. Roughly 1 token ≈ 4 characters in English. This is why AI counts characters oddly! |
| Context Window | The maximum number of tokens the model can "see" at once (input + output combined). GPT-4 Turbo: 128k tokens ≈ 100,000 words — about a full novel. Older models: 2k–4k tokens. Longer windows = more expensive to run but better at long documents. |
| Parameters | The billions of floating-point weights inside the model. These are fixed after training. They encode the model's "knowledge" implicitly — not as facts in a database, but as patterns in a massive web of numbers. |
| Pre-training | The initial large-scale training phase: expose the model to a huge text corpus (web pages, books, code) and train it to predict next tokens. Extremely expensive; done once by the lab that created the model. |
| Fine-tuning | Follow-up training on a smaller, task-specific dataset. Much cheaper. Used to make the model behave helpfully, safely, or specialise in a domain (legal, medical, customer service). |
| Hallucination | When the model generates confident-sounding text that is factually wrong. Because LLMs are probability machines — they predict plausible-sounding next tokens — they don't consult a verified fact store. If the training data was wrong, thin, or the model is uncertain, it will still output something that sounds right. |
Type any text below and watch it split into approximate tokens in real time. Colors cycle through token groups. Real models use learned vocabularies (BPE/WordPiece); this demo uses a whitespace + punctuation + subword heuristic.
⚙ Real tokenizers (GPT-4 uses tiktoken / BPE) map each token to a unique integer ID and learn subword splits from the training corpus. The rule-of-thumb is ~4 characters ≈ 1 token for English text.
Select an example sentence or type your own (up to ~12 words). The heatmap shows a heuristic "attention matrix": how much each word (row) attends to every other word (column). Click a word pill or a row to highlight its attention pattern.
⚠ This is a heuristic simulation — not a trained model. It illustrates the concept of an attention matrix.
🌀Why LLMs Hallucinate — and What to Do About It
Hallucination is a direct consequence of how LLMs work. They are trained to produce the most statistically plausible text — not the most factually accurate text. Some causes:
- Training data gaps: If the model never saw reliable information about a niche topic, it will confabulate something plausible-sounding.
- Stale knowledge: Parameters are frozen at training time. Ask about events after the cutoff and the model has no data to draw from.
- Confidence from context: A question phrased with false premises ("When did Einstein visit Mars?") primes the model toward generating a completion rather than a correction.
- No uncertainty signal by default: Unlike a human who might say "I'm not sure", base LLMs have no built-in mechanism to flag low-confidence outputs.