📉 What Does "Learning" Even Mean?

When we train a network, we feed it data and it spits out predictions. At first those predictions are basically random — the weights were initialized randomly, after all. Learning is the process of adjusting those weights so the predictions get better.

But "better" needs a number. That number is called the loss (or cost, or error). It measures the gap between what the network predicted and what the correct answer actually was. The lower the loss, the better the model.

Analogy Think of training like practicing free throws in basketball. After each shot, you see how far off you were (that's the loss). You then adjust your form slightly (that's updating the weights). You repeat thousands of times until you're making most shots.

Common loss functions:

TaskLoss FunctionPlain English
Regression (predict a number)Mean Squared Error (MSE)Average of squared differences between predicted and actual
Binary classificationBinary Cross-EntropyPenalty for being confidently wrong
Multi-class classificationCategorical Cross-EntropyLog-penalty on the probability assigned to the correct class

⛰️ Hiking Downhill in the Fog

Imagine you're dropped on a hilly landscape in thick fog. You can't see the whole valley. Your goal is to reach the lowest point. What do you do?

You feel the slope under your feet and take a step in the direction that goes downhill. Then feel again. Step again. Repeat. Eventually you settle into a low point.

The Analogy The landscape = the loss surface (a plot of loss vs. weight values)
Your position = the current weight values
The slope you feel = the gradient — how steeply loss changes as you move
Your step = one weight update
The valley floor = minimum loss = the best weights you can find

Mathematically, the gradient is a vector of partial derivatives — one per weight — each telling you: "if I nudge this weight up a tiny bit, does the loss go up or down, and how steeply?" Gradient descent then moves opposite to that slope:

w ← w − η · ∂L/∂w

where η (eta) is the learning rate, L is the loss, and w is a weight.

🎚️ The Learning Rate: Step Size Matters

The learning rate (η) controls how large each step is. It's one of the most important hyperparameters you'll tune.

Too Small (e.g. 0.00001) The ball barely moves. Training takes forever. You might give up before reaching the bottom.
Just Right (e.g. 0.01–0.1) Smooth, steady descent. Converges to a good minimum in reasonable time.
Too Large (e.g. 1.5+) The ball overshoots the valley, bounces to the other side, overshoots again — and may diverge to infinity.
Tip In practice, researchers often use learning rate schedules — start with a larger rate to explore quickly, then reduce it to fine-tune as you get close. Techniques like Adam optimizer also automatically adapt the step size per weight.

🏔️ Local vs. Global Minimum

Here's the fog problem: if you always step downhill, you might slide into a small dip that isn't the deepest valley. That's a local minimum — loss is lower than the surrounding area, but not the lowest possible.

The true lowest point is the global minimum. In practice, for deep neural networks, the loss surface has millions of dimensions and many local minima. Interestingly, research shows many local minima in deep nets are "good enough" — they generalize well. But shallow models with simple loss surfaces can genuinely get trapped.

Analogy You're hiking in fog and settle into what feels like the lowest point — but there's actually a deeper valley on the other side of the ridge you can't see. You'd need to go uphill first to find it. Techniques like momentum, random restarts, and stochastic gradient descent help escape local minima.

🔁 The Training Loop

Every time a neural network trains, it runs this loop — potentially millions of times:

1. Forward Pass: predict 2. Compute Loss 3. Backprop: compute gradients 4. Update weights 5. Repeat (next epoch / batch)

One pass through the entire training dataset is called an epoch. In practice, data is split into smaller mini-batches so weights are updated many times per epoch — this is stochastic gradient descent (SGD).

What is Backpropagation? Backprop is the algorithm that figures out how much each weight contributed to the error. It uses the chain rule from calculus to efficiently compute all those partial derivatives in one backward pass through the network — from the output layer back to the input layer. You don't need to understand every calculus detail; the key insight is that it automatically assigns blame to each weight.
Roll Downhill ⛷️ — Gradient Descent Simulator Interactive

Watch a "ball" descend the loss curve using gradient descent. Experiment with the learning rate to see smooth descent, crawling, or chaotic overshooting. Notice how the starting position can lead to different local minima.

Loss function: f(x) = 0.12·x² + sin(2.4x) + 0.3·cos(5x) — has a global minimum and at least one local minimum.

Step # 0
Position x
Loss f(x)
Gradient f′(x)
 

Slider maps linearly: left ≈ 0.01, middle ≈ 0.60, right ≈ 1.20. Try extremes to see divergence vs. crawling.