The Core Idea: How a Model Knows It's Wrong

Every time an AI model makes a prediction — whether it's labeling a photo, translating a sentence, or finishing your sentence — it also has a way to measure how wrong that prediction was. This measurement is called the loss (or error). A high loss means the model is way off. A loss of zero means the model is perfect.

Gradient descent is the process a model uses to get that loss as low as possible. The model checks how wrong it is, makes a tiny adjustment to its internal settings (called weights and biases), then checks again. It keeps repeating this until the loss stops improving. Simple in principle — and surprisingly powerful in practice.

To understand how it adjusts those settings, we need one more concept: the gradient.

Analogy Imagine you're hiking in dense fog on a hilly landscape. You can't see the valley — your goal — but you can feel the slope of the ground under your feet. Gradient descent says: always step in the direction that goes downhill. Keep doing that, step after step, and eventually you'll reach a low point. In AI, "feeling the slope" means computing the gradient — a mathematical description of which direction the loss increases fastest. Stepping against that direction (downhill) reduces the loss.

What Is a Gradient, Really?

The gradient is just a collection of numbers, one for each setting in the model, that tells you: "if you nudge this setting up a tiny bit, does the loss go up or down, and by how much?" It's the slope of the loss landscape at your current position.

Once you know the gradient, the update rule is straightforward: move each setting slightly in the direction that reduces loss. Do that for all settings at once, and the whole model inches closer to making better predictions.

This is why the algorithm is called gradient descent — you're descending the gradient, rolling downhill toward the minimum loss.

Learning Rate: The Size of Each Step

Here's where things get interesting. You know which direction is downhill — but how big a step should you take? This is controlled by a setting you choose before training begins: the learning rate.

Learning RateWhat HappensProblem
Too small (e.g. 0.00001)Tiny steps, very slow progressTraining takes forever; may stall before reaching minimum
Too large (e.g. 10.0)Giant leaps across the landscapeOvershoots the valley; loss bounces or even grows
Just right (e.g. 0.01)Steady, confident descentNone — this is what you aim for

Finding a good learning rate is part science, part art. Most practitioners start with a small value like 0.001 and adjust based on how training behaves. Modern training pipelines often use learning rate schedules that start larger and shrink over time — fast progress early on, fine-tuning at the end.

Key takeaway The learning rate is one of the most important choices you make when training a model. Too small and you waste time. Too large and training breaks. Getting it right is often the difference between a model that learns and one that doesn't.

Local Minima: Getting Stuck in a Smaller Dip

Here's a complication in the hiking analogy. Imagine the landscape has not just one valley, but dozens of small dips and hollows scattered across it. If you always step downhill, you might wander into a small dip and stop — thinking you've found the bottom — when a much deeper valley exists just over the next ridge.

In AI training, these small dips are called local minima. The true best answer — the lowest loss possible — is the global minimum. Getting stuck in a local minimum means the model settled for "good enough" rather than "best possible."

In practice, large neural networks (especially LLMs) have so many settings that the loss landscape is extraordinarily complex — and researchers have found that local minima are rarely a serious problem at scale. The far bigger risk is a saddle point, a place where the loss is flat in some directions but not yet minimal. Modern variants of gradient descent, like Adam, add momentum and adaptive step sizes specifically to escape these traps.

The Training Loop, Step by Step

Put it all together and here's the full cycle that runs thousands or millions of times during AI training:

That's the entire secret. A model doesn't think, reason, or memorize rules — it just runs this loop over and over until the predictions get good. The loop is the learning.

If you want to see how neural networks are structured before they start learning, check out our article on what a neural network is, or dive into Lesson 4: Neural Networks for an interactive walkthrough.

Why Gradient Descent Matters So Much

Gradient descent is not a niche technique — it is the engine of modern AI. Every large language model (LLM) you've ever interacted with, every image generator, every speech recogniser, every recommendation algorithm was trained using some form of gradient descent.

The variants have different names — Stochastic Gradient Descent (SGD), Adam, AdamW, RMSProp — but the core loop is always the same: measure the error, compute which direction reduces it, take a step, repeat.

What makes this remarkable is that gradient descent was not invented for AI. It's a general-purpose optimisation technique from mathematics. AI researchers borrowed it, refined it, and scaled it up to models with hundreds of billions of settings — and it still works.

Analogy Think of training a large language model as hiking a landscape with a trillion hills and valleys, in pitch darkness, using only the feel of the ground underfoot. Gradient descent navigates that impossible terrain — and arrives somewhere useful — millions of times per second on modern hardware.

Frequently Asked Questions

Why is it called gradient "descent"?

The word "gradient" refers to the slope of the loss function — a mathematical measure of how steeply the error changes as you adjust the model's settings. "Descent" means you're moving downward along that slope, toward lower loss. Put them together and the name is literal: you are descending the gradient, just as a hiker descends a slope.

What is a good learning rate?

There's no universal answer, but a common starting point is somewhere between 0.0001 and 0.01. The right value depends on the model architecture, the dataset, and the specific optimiser you're using. In practice, most engineers run a quick learning rate finder — training for a few steps across a range of values and picking the one where the loss drops fastest without becoming unstable.

Is gradient descent used in deep learning and LLMs?

Absolutely — it underpins all of it. Deep learning is essentially the combination of neural networks (the structure) and gradient descent (the training method). LLMs like GPT and Claude are trained by running gradient descent across enormous datasets, using variants like AdamW that handle the scale and complexity of billions of parameters. Without gradient descent, modern AI as we know it would not exist.

Ready to see this in action? Lesson 5: Training & Gradient Descent lets you adjust the learning rate, watch the loss curve change in real time, and develop an intuition that no amount of reading can match.

Key takeaway Gradient descent is the single algorithm behind almost all of modern AI. Master the intuition — measure error, follow the slope downhill, repeat — and you'll understand the engine running inside every neural network and LLM.

Try the gradient descent simulator →