Deep Learning & Computer Vision (CNNs)

🏗️What Does "Deep" Actually Mean?

A neural network is layers of simple mathematical units (neurons) stacked together. When we say deep learning, we just mean there are many layers — sometimes hundreds. Each layer transforms the data slightly, learning a more abstract representation than the one before it.

Analogy — Reading a Face Imagine how your own brain recognises a friend's face. Your eyes first pick up raw light intensities. Then your visual cortex detects edges (where light changes sharply). The next region groups edges into curves and shapes. Higher up, regions recognise eyes, noses, mouths. Finally the very top says "that's Alice." A deep network does the same thing — layer by layer, from dumb pixels to meaningful concepts.

📷

Input
raw pixels

→

〰️

Layer 1
edges

→

◯

Layer 2
shapes

→

👁️

Layer 3
facial parts

→

🙂

Output
"a face"

Earlier layers learn low-level features (edges, colours); later layers learn high-level concepts (objects, scenes). The network figures out what to look for — you don't program the rules.

🔢An Image Is Just a Grid of Numbers

A computer has no eyes. Every image is stored as a grid (matrix) of pixels. Each pixel is one or more numbers representing brightness or colour:

Grayscale image: one number per pixel, from 0 (black) to 255 (white).
Colour (RGB) image: three numbers per pixel — one for Red, one for Green, one for Blue.
A 224×224 colour image is actually a 3 × 224 × 224 = 150,528-number array.

💡 Key Insight When you look at a photo of a cat, you see a cat. The computer sees something like: [[23, 31, 29, 45, …], [18, 22, 40, …], …]. Its job is to learn which number patterns correspond to "cat."

Object	Size	Numbers stored
Tiny grayscale icon	28×28	784
Thumbnail (colour)	64×64	12,288
ImageNet input (colour)	224×224	150,528
4K photo (colour)	3840×2160	24,883,200

🔍Convolution — The Sliding Spotlight

The secret weapon of vision AI is the convolutional layer. Instead of connecting every pixel to every neuron (wasteful and slow), a filter (kernel) — a tiny 3×3 or 5×5 grid of numbers — slides across the image. At each position it multiplies its values by the pixels underneath and sums them up. The result highlights whatever pattern the filter was tuned to detect.

Analogy — A Magnifying Stamp Imagine a small rubber stamp with a pattern (say, a horizontal line). You stamp it at every position on the image and record how well the stamp matches. Where there's a horizontal edge, you get a high score. Where there isn't, you get a low score. That score map is the feature map — a new image that says "here are the horizontal edges."

A deep CNN stacks many such filters. Early filters detect primitive patterns (edges, blobs of colour). Deeper filters detect increasingly complex patterns — corners → curves → eyes → faces. The network learns the best filter values during training via gradient descent.

3×3 Edge filter

This kernel subtracts all 8 neighbours from the centre pixel × 8. Where the centre matches its neighbours (flat region), the output is near 0. Where it contrasts sharply (an edge), the output is large. That's how the filter highlights edges.

📦Pooling — Shrink and Keep What Matters

Pooling (usually max pooling) slides a small window across the feature map and keeps only the biggest value in each region, shrinking the map while preserving the strongest detected patterns — reducing computation and making the network tolerant to small shifts in position.

💡 One-line Summary Pooling is like taking the most important highlight from each neighbourhood and throwing away the rest — you keep the signal, lose the noise.

🚀Why CNNs Revolutionised Vision AI

Before 2012, the best image classifiers relied on hand-crafted features — experts wrote rules for what to look for. Then AlexNet, a deep CNN, cut the error rate on the ImageNet benchmark nearly in half, overnight. It learned its own features directly from 1.2 million labelled images. The lesson: given enough data and compute, let the network discover the features itself.

Everyday uses today

Photo tagging — your phone identifies faces and scenes automatically.
Medical imaging — detecting tumours in X-rays and retinal scans, often at parity with radiologists.
Self-driving perception — recognising pedestrians, lanes, traffic signs in real time.
Quality control — spotting defects on factory lines faster than human inspectors.
Content moderation — flagging violent or inappropriate imagery at platform scale.

Year	Model	ImageNet top-5 error
2010	Hand-crafted	28.2 %
2012	AlexNet (CNN)	16.4 %
2014	VGGNet	7.3 %
2015	ResNet	3.6 %
2017	SENet	2.25 %

Human-level error on this benchmark is roughly 5%.

✏️ Draw a Digit — and Watch a Tiny Recogniser Guess Interactive

Paint a digit (0, 1, 3, or 7) on the grid, then hit Recognise. The recogniser compares your drawing to built-in templates using pixel-matching — the same intuition behind a real CNN, just much simpler.

Templates the matcher compares against:

⚠️ This is a simplified template matcher, not a real neural network. It counts how many pixels in your drawing overlap with each stored template and normalises the score. Real CNNs learn their own multi-layer features from millions of examples — but the core idea (compare learned patterns to the input) is the same.

🔬 Convolution Lab Interactive

Select a filter (kernel), apply it to the 8×8 test image, and see the output side-by-side. Brighter cells = stronger response. The 3×3 kernel values are shown so you can see exactly what the math does.

Filter Edge Detect

Input Image

3×3 Kernel

Output (after convolution)

Math: for each pixel, multiply the 3×3 neighbourhood by the kernel, sum the products, clamp to 0–255. For edge/gradient kernels the raw output can be negative — displayed values are the absolute value, then scaled. Borders use zero-padding.