๐๏ธWhat Does "Deep" Actually Mean?
A neural network is layers of simple mathematical units (neurons) stacked together. When we say deep learning, we just mean there are many layers โ sometimes hundreds. Each layer transforms the data slightly, learning a more abstract representation than the one before it.
raw pixels
edges
shapes
facial parts
"a face"
Earlier layers learn low-level features (edges, colours); later layers learn high-level concepts (objects, scenes). The network figures out what to look for โ you don't program the rules.
๐ขAn Image Is Just a Grid of Numbers
A computer has no eyes. Every image is stored as a grid (matrix) of pixels. Each pixel is one or more numbers representing brightness or colour:
- Grayscale image: one number per pixel, from
0(black) to255(white). - Colour (RGB) image: three numbers per pixel โ one for Red, one for Green, one for Blue.
- A 224ร224 colour image is actually a 3 ร 224 ร 224 = 150,528-number array.
[[23, 31, 29, 45, โฆ], [18, 22, 40, โฆ], โฆ]. Its job is to learn which number patterns correspond to "cat."
| Object | Size | Numbers stored |
|---|---|---|
| Tiny grayscale icon | 28ร28 | 784 |
| Thumbnail (colour) | 64ร64 | 12,288 |
| ImageNet input (colour) | 224ร224 | 150,528 |
| 4K photo (colour) | 3840ร2160 | 24,883,200 |
๐Convolution โ The Sliding Spotlight
The secret weapon of vision AI is the convolutional layer. Instead of connecting every pixel to every neuron (wasteful and slow), a filter (kernel) โ a tiny 3ร3 or 5ร5 grid of numbers โ slides across the image. At each position it multiplies its values by the pixels underneath and sums them up. The result highlights whatever pattern the filter was tuned to detect.
A deep CNN stacks many such filters. Early filters detect primitive patterns (edges, blobs of colour). Deeper filters detect increasingly complex patterns โ corners โ curves โ eyes โ faces. The network learns the best filter values during training via gradient descent.
-1 8 -1
-1 -1 -1
This kernel subtracts all 8 neighbours from the centre pixel ร 8. Where the centre matches its neighbours (flat region), the output is near 0. Where it contrasts sharply (an edge), the output is large. That's how the filter highlights edges.
๐ฆPooling โ Shrink and Keep What Matters
Pooling (usually max pooling) slides a small window across the feature map and keeps only the biggest value in each region, shrinking the map while preserving the strongest detected patterns โ reducing computation and making the network tolerant to small shifts in position.
๐Why CNNs Revolutionised Vision AI
Before 2012, the best image classifiers relied on hand-crafted features โ experts wrote rules for what to look for. Then AlexNet, a deep CNN, cut the error rate on the ImageNet benchmark nearly in half, overnight. It learned its own features directly from 1.2 million labelled images. The lesson: given enough data and compute, let the network discover the features itself.
Everyday uses today
- Photo tagging โ your phone identifies faces and scenes automatically.
- Medical imaging โ detecting tumours in X-rays and retinal scans, often at parity with radiologists.
- Self-driving perception โ recognising pedestrians, lanes, traffic signs in real time.
- Quality control โ spotting defects on factory lines faster than human inspectors.
- Content moderation โ flagging violent or inappropriate imagery at platform scale.
| Year | Model | ImageNet top-5 error |
|---|---|---|
| 2010 | Hand-crafted | 28.2 % |
| 2012 | AlexNet (CNN) | 16.4 % |
| 2014 | VGGNet | 7.3 % |
| 2015 | ResNet | 3.6 % |
| 2017 | SENet | 2.25 % |
Human-level error on this benchmark is roughly 5%.
Paint a digit (0, 1, 3, or 7) on the grid, then hit Recognise. The recogniser compares your drawing to built-in templates using pixel-matching โ the same intuition behind a real CNN, just much simpler.
โ ๏ธ This is a simplified template matcher, not a real neural network. It counts how many pixels in your drawing overlap with each stored template and normalises the score. Real CNNs learn their own multi-layer features from millions of examples โ but the core idea (compare learned patterns to the input) is the same.
Select a filter (kernel), apply it to the 8ร8 test image, and see the output side-by-side. Brighter cells = stronger response. The 3ร3 kernel values are shown so you can see exactly what the math does.
Math: for each pixel, multiply the 3ร3 neighbourhood by the kernel, sum the products, clamp to 0โ255. For edge/gradient kernels the raw output can be negative โ displayed values are the absolute value, then scaled. Borders use zero-padding.