📉 What Does "Learning" Even Mean?
When we train a network, we feed it data and it spits out predictions. At first those predictions are basically random — the weights were initialized randomly, after all. Learning is the process of adjusting those weights so the predictions get better.
But "better" needs a number. That number is called the loss (or cost, or error). It measures the gap between what the network predicted and what the correct answer actually was. The lower the loss, the better the model.
Common loss functions:
| Task | Loss Function | Plain English |
|---|---|---|
| Regression (predict a number) | Mean Squared Error (MSE) | Average of squared differences between predicted and actual |
| Binary classification | Binary Cross-Entropy | Penalty for being confidently wrong |
| Multi-class classification | Categorical Cross-Entropy | Log-penalty on the probability assigned to the correct class |
⛰️ Hiking Downhill in the Fog
Imagine you're dropped on a hilly landscape in thick fog. You can't see the whole valley. Your goal is to reach the lowest point. What do you do?
You feel the slope under your feet and take a step in the direction that goes downhill. Then feel again. Step again. Repeat. Eventually you settle into a low point.
Your position = the current weight values
The slope you feel = the gradient — how steeply loss changes as you move
Your step = one weight update
The valley floor = minimum loss = the best weights you can find
Mathematically, the gradient is a vector of partial derivatives — one per weight — each telling you: "if I nudge this weight up a tiny bit, does the loss go up or down, and how steeply?" Gradient descent then moves opposite to that slope:
w ← w − η · ∂L/∂w
where η (eta) is the learning rate, L is the loss, and w is a weight.
🎚️ The Learning Rate: Step Size Matters
The learning rate (η) controls how large each step is. It's one of the most important hyperparameters you'll tune.
🏔️ Local vs. Global Minimum
Here's the fog problem: if you always step downhill, you might slide into a small dip that isn't the deepest valley. That's a local minimum — loss is lower than the surrounding area, but not the lowest possible.
The true lowest point is the global minimum. In practice, for deep neural networks, the loss surface has millions of dimensions and many local minima. Interestingly, research shows many local minima in deep nets are "good enough" — they generalize well. But shallow models with simple loss surfaces can genuinely get trapped.
🔁 The Training Loop
Every time a neural network trains, it runs this loop — potentially millions of times:
One pass through the entire training dataset is called an epoch. In practice, data is split into smaller mini-batches so weights are updated many times per epoch — this is stochastic gradient descent (SGD).
Watch a "ball" descend the loss curve using gradient descent. Experiment with the learning rate to see smooth descent, crawling, or chaotic overshooting. Notice how the starting position can lead to different local minima.
Loss function: f(x) = 0.12·x² + sin(2.4x) + 0.3·cos(5x) — has a global minimum and at least one local minimum.
Slider maps linearly: left ≈ 0.01, middle ≈ 0.60, right ≈ 1.20. Try extremes to see divergence vs. crawling.