🍎Data Is AI's Food

Imagine trying to learn a new language by reading only five sentences — you'd end up with a very shaky grasp. The same is true for machine learning: more relevant, high-quality examples → a smarter, more reliable model.

🍔 Analogy If you feed a chef only burnt meals to taste, they'll learn to cook badly. Feed an AI bad data and it will confidently make bad predictions. This is the famous "Garbage in, garbage out" principle.

Two dimensions matter most:

📖Key Vocabulary — Plain English

Before going further, let's pin down the words you'll hear constantly in AI/ML:

Dataset

The whole collection of examples — like a spreadsheet with many rows.

Sample / Row

One single example in the dataset (one animal, one transaction, one email…).

Feature

A measurable input property of each example — one column used to make a prediction (weight, colour, age…).

Label

The "right answer" column we want to predict (cat vs dog, spam vs ham, house price…).

Training Set

The portion of data the model learns from — it sees the features and the labels.

Test Set

A held-out portion the model never sees during training — used to measure real-world performance honestly.

💡 Rule of thumb A common split is 80% training / 20% test. Never let the model peek at test data during training — that's cheating and gives you inflated scores that collapse in production.

Dataset anatomy at a glance

Sample # weight_kg feature has_whiskers feature bark_or_meow feature animal label
14.2yesmeowcat
222.0nobarkdog
33.8yesmeowcat
415.5nobarkdog

🔬Feature Engineering — Choosing the Right Clues

An algorithm is just a recipe. The ingredients you hand it — the features — determine the quality of the dish. A mediocre algorithm with excellent features will usually beat a fancy algorithm fed useless ones.

🍊 Analogy — Apple vs Orange Suppose you want to classify fruit. Colour is a great feature — apples are red/green, oranges are orange. "The day of the week you bought it" is a terrible feature — it has nothing to do with what the fruit is. Picking good features is 80% of the job.

Practical heuristics:

⚖️Normalisation — Stop Big Numbers Bullying Small Ones

Imagine two features: age (range 0–100) and annual salary (range 0–200,000). The salary numbers are 2,000× bigger than age numbers. Many algorithms — especially those that measure distances or add up numbers — will effectively ignore age because salary dominates.

📏 Analogy It's like comparing a mountain (salary) to a pebble (age) when deciding which is "more important". Normalisation puts everything on the same ruler — say, 0 to 1 — so every feature gets a fair vote.

Min-max normalisation is the simplest approach:

normalised = (value − min) / (max − min)

Result: every feature lives in [0, 1], regardless of its original scale. Other methods include z-score standardisation (mean = 0, std = 1), but the principle is the same — bring features to a comparable scale.

⚠️Bias in Data — A Quick Warning

If your training data over-represents certain groups or under-represents others, the AI will inherit those skews as if they were ground truth.

⚠️ Watch out A face-recognition model trained mostly on one demographic will perform poorly — or unfairly — on others. A loan-approval model trained on historical decisions can perpetuate past discrimination. Skewed data → skewed (and sometimes harmful) AI.

The fix isn't purely technical — it requires careful data collection, diversity audits, and human oversight. We'll do a deep-dive on AI fairness and bias in Lesson 10. For now, just remember: the world your data describes is the world your model will assume is normal.

Feature Detective 🔍  Playground

Which single feature best separates cats from dogs? Pick one from the dropdown, and see how well it draws a boundary between the two classes on a 1-D number line.

Animal weight_kg ear_pointiness (0–10) bark_or_meow (0=meow,1=bark) whisker_len_cm day_bought (1–7) Label

Each dot = one animal. = dog   = cat   Dashed line = best threshold found.

Normaliser ⚖️  Playground

Drag the sliders to set raw feature values. Watch how the raw bars (wildly different scales) compare to the normalised bars — all squeezed into 0–1. This is why models often need normalisation.

Raw values (original scale)

Min-max normalised (0 – 1)

Without normalisation, salary dwarfs age and height — the model would treat salary as if it's thousands of times more "important" just because of its scale. After normalisation, all three features compete fairly.