Data & Features in AI — Interactive Guide

🍎Data Is AI's Food

Imagine trying to learn a new language by reading only five sentences — you'd end up with a very shaky grasp. The same is true for machine learning: more relevant, high-quality examples → a smarter, more reliable model.

🍔 Analogy If you feed a chef only burnt meals to taste, they'll learn to cook badly. Feed an AI bad data and it will confidently make bad predictions. This is the famous "Garbage in, garbage out" principle.

Two dimensions matter most:

Quantity — generally, more training examples let the model see more patterns and generalise better.
Quality — accurate, unbiased, relevant data beats a mountain of noisy, mislabelled, or skewed data every single time.

📖Key Vocabulary — Plain English

Before going further, let's pin down the words you'll hear constantly in AI/ML:

Dataset

The whole collection of examples — like a spreadsheet with many rows.

Sample / Row

One single example in the dataset (one animal, one transaction, one email…).

Feature

A measurable input property of each example — one column used to make a prediction (weight, colour, age…).

Label

The "right answer" column we want to predict (cat vs dog, spam vs ham, house price…).

Training Set

The portion of data the model learns from — it sees the features and the labels.

Test Set

A held-out portion the model never sees during training — used to measure real-world performance honestly.

💡 Rule of thumb A common split is 80% training / 20% test. Never let the model peek at test data during training — that's cheating and gives you inflated scores that collapse in production.

Dataset anatomy at a glance

Sample #	weight_kg feature	has_whiskers feature	bark_or_meow feature	animal label
1	4.2	yes	meow	cat
2	22.0	no	bark	dog
3	3.8	yes	meow	cat
4	15.5	no	bark	dog

🔬Feature Engineering — Choosing the Right Clues

An algorithm is just a recipe. The ingredients you hand it — the features — determine the quality of the dish. A mediocre algorithm with excellent features will usually beat a fancy algorithm fed useless ones.

🍊 Analogy — Apple vs Orange Suppose you want to classify fruit. Colour is a great feature — apples are red/green, oranges are orange. "The day of the week you bought it" is a terrible feature — it has nothing to do with what the fruit is. Picking good features is 80% of the job.

Practical heuristics:

Domain knowledge is king — talk to an expert about what actually drives the outcome.
Correlation check — does changing this feature value actually change the label in a predictable direction?
Irrelevant features add noise — they can even hurt accuracy by giving the model random patterns to overfit on.
Derived features can be powerful — e.g., "body-mass index" derived from height and weight tells you more than either alone.

⚖️Normalisation — Stop Big Numbers Bullying Small Ones

Imagine two features: age (range 0–100) and annual salary (range 0–200,000). The salary numbers are 2,000× bigger than age numbers. Many algorithms — especially those that measure distances or add up numbers — will effectively ignore age because salary dominates.

📏 Analogy It's like comparing a mountain (salary) to a pebble (age) when deciding which is "more important". Normalisation puts everything on the same ruler — say, 0 to 1 — so every feature gets a fair vote.

Min-max normalisation is the simplest approach:

normalised = (value − min) / (max − min)

Result: every feature lives in [0, 1], regardless of its original scale. Other methods include z-score standardisation (mean = 0, std = 1), but the principle is the same — bring features to a comparable scale.

⚠️Bias in Data — A Quick Warning

If your training data over-represents certain groups or under-represents others, the AI will inherit those skews as if they were ground truth.

⚠️ Watch out A face-recognition model trained mostly on one demographic will perform poorly — or unfairly — on others. A loan-approval model trained on historical decisions can perpetuate past discrimination. Skewed data → skewed (and sometimes harmful) AI.

The fix isn't purely technical — it requires careful data collection, diversity audits, and human oversight. We'll do a deep-dive on AI fairness and bias in Lesson 10. For now, just remember: the world your data describes is the world your model will assume is normal.

Feature Detective 🔍 Playground

Which single feature best separates cats from dogs? Pick one from the dropdown, and see how well it draws a boundary between the two classes on a 1-D number line.

Animal	weight_kg	ear_pointiness (0–10)	bark_or_meow (0=meow,1=bark)	whisker_len_cm	day_bought (1–7)	Label

Choose a feature to inspect:

Each dot = one animal. ● = dog ● = cat Dashed line = best threshold found.

Normaliser ⚖️ Playground

Drag the sliders to set raw feature values. Watch how the raw bars (wildly different scales) compare to the normalised bars — all squeezed into 0–1. This is why models often need normalisation.

Age (years) 35

Salary ($) 55,000

Height (cm) 170

Raw values (original scale)

Min-max normalised (0 – 1)

Without normalisation, salary dwarfs age and height — the model would treat salary as if it's thousands of times more "important" just because of its scale. After normalisation, all three features compete fairly.