🍎Data Is AI's Food
Imagine trying to learn a new language by reading only five sentences — you'd end up with a very shaky grasp. The same is true for machine learning: more relevant, high-quality examples → a smarter, more reliable model.
Two dimensions matter most:
- Quantity — generally, more training examples let the model see more patterns and generalise better.
- Quality — accurate, unbiased, relevant data beats a mountain of noisy, mislabelled, or skewed data every single time.
📖Key Vocabulary — Plain English
Before going further, let's pin down the words you'll hear constantly in AI/ML:
The whole collection of examples — like a spreadsheet with many rows.
One single example in the dataset (one animal, one transaction, one email…).
A measurable input property of each example — one column used to make a prediction (weight, colour, age…).
The "right answer" column we want to predict (cat vs dog, spam vs ham, house price…).
The portion of data the model learns from — it sees the features and the labels.
A held-out portion the model never sees during training — used to measure real-world performance honestly.
Dataset anatomy at a glance
| Sample # | weight_kg feature | has_whiskers feature | bark_or_meow feature | animal label |
|---|---|---|---|---|
| 1 | 4.2 | yes | meow | cat |
| 2 | 22.0 | no | bark | dog |
| 3 | 3.8 | yes | meow | cat |
| 4 | 15.5 | no | bark | dog |
🔬Feature Engineering — Choosing the Right Clues
An algorithm is just a recipe. The ingredients you hand it — the features — determine the quality of the dish. A mediocre algorithm with excellent features will usually beat a fancy algorithm fed useless ones.
Practical heuristics:
- Domain knowledge is king — talk to an expert about what actually drives the outcome.
- Correlation check — does changing this feature value actually change the label in a predictable direction?
- Irrelevant features add noise — they can even hurt accuracy by giving the model random patterns to overfit on.
- Derived features can be powerful — e.g., "body-mass index" derived from height and weight tells you more than either alone.
⚖️Normalisation — Stop Big Numbers Bullying Small Ones
Imagine two features: age (range 0–100) and annual salary (range 0–200,000). The salary numbers are 2,000× bigger than age numbers. Many algorithms — especially those that measure distances or add up numbers — will effectively ignore age because salary dominates.
Min-max normalisation is the simplest approach:
normalised = (value − min) / (max − min)
Result: every feature lives in [0, 1], regardless of its original scale. Other methods include z-score standardisation (mean = 0, std = 1), but the principle is the same — bring features to a comparable scale.
⚠️Bias in Data — A Quick Warning
If your training data over-represents certain groups or under-represents others, the AI will inherit those skews as if they were ground truth.
The fix isn't purely technical — it requires careful data collection, diversity audits, and human oversight. We'll do a deep-dive on AI fairness and bias in Lesson 10. For now, just remember: the world your data describes is the world your model will assume is normal.
Which single feature best separates cats from dogs? Pick one from the dropdown, and see how well it draws a boundary between the two classes on a 1-D number line.
| Animal | weight_kg | ear_pointiness (0–10) | bark_or_meow (0=meow,1=bark) | whisker_len_cm | day_bought (1–7) | Label |
|---|
Each dot = one animal. ● = dog ● = cat Dashed line = best threshold found.
Drag the sliders to set raw feature values. Watch how the raw bars (wildly different scales) compare to the normalised bars — all squeezed into 0–1. This is why models often need normalisation.
Raw values (original scale)
Min-max normalised (0 – 1)
Without normalisation, salary dwarfs age and height — the model would treat salary as if it's thousands of times more "important" just because of its scale. After normalisation, all three features compete fairly.