Engineering Sandbox pilot · easy-first ML

Treat data splitting like a workflow decision, not vocabulary.

This route turns train, validation, and test sets into a concrete sequence: split the data, fit a boundary, drag examples around, compare models honestly, and resist the urge to optimize against the final score.

Drag the training examples Watch the boundary and tables react Feel why the test set stays untouched
How to use this page

Jump to the model scene first if you want a hands-on entry point. Drag a pet, switch the active features, then keep scrolling so the validation and test tables explain why that interaction belongs in the middle of the workflow rather than at the end.

Play first: move the boundary Start with the split

Train, Test, and Validation Sets

Interactive walkthrough

In most supervised machine learning tasks, best practice recommends splitting your data into three independent sets: a training set, a testing set, and a validation set.

To demo the reasons for splitting data in this manner, we will pretend that we have a dataset made of pets of the following two types:

Cats     Dogs

For each pet in the dataset we know only two features: weight and fluffiness.

Our goal is to make use of the different data splits and identify a best model for classifying a given pet as either a cat or a dog, based on the available features.

Model Features:

The Importance of Data Splitting

In most supervised machine learning tasks, best practice recommends splitting your data into three independent sets: a training set, a testing set, and a validation set.

To learn why, let's pretend that we have a dataset of two types of pets:

Cats      Dogs

Each pet in our dataset has two features: weight and fluffiness.

Our goal is to identify and evaluate suitable models for classifying a given pet as either a cat or a dog. We'll use train, test, and validation splits to do this.

Train, Test, and Validation Splits

The first step in our classification task is to randomly split our pets into three independent sets:

Training Set: The dataset that we feed our model to learn potential underlying patterns and relationships.

Validation Set: The dataset that we use to understand our model's performance across different model types and hyperparameter choices.

Test Set: The dataset that we use to approximate our model's unbiased accuracy in the wild.

The Training Set

The training set is the dataset that we employ to train our model. It is this dataset that our model uses to learn any underlying patterns or relationships that will enable making predictions later on.

The training set should be as representative as possible of the population that we are trying to model. Additionally, we need to be careful and ensure that it is as unbiased as possible, as any bias at this stage may be propagated downstream during inference.

Building Our Model

Our goal (to determine whether a given pet is a cat or a dog) is a binary classification task, so we will use a simple but effective model appropriate for this task: logistic regression .

Logistic regression will learn a decision boundary to best separate the cats from dogs in our training data, using the selected feature (None, Weight, Fluffiness, or both Weight and Fluffiness).

Select the feature to visualize the corresponding logistic regression model's decision boundary. Drag each animal in the training set to a new position to see how the boundary updates.

The Validation Set

We can build four different logistic regression models (one for each feature possibility), how do we decide which model to select?

We could compare the accuracy of each model on the training set, but if we use the same exact dataset for both training and tuning, the model will overfit and won't generalize well.

This is where the validation set comes in it acts as an independent, unbiased dataset for comparing the performance of different algorithms trained on our training set.

Select a feature to view the model's performance on the validation set in the table below. Drag the pets across the line to see how the model feedback updates.

The Testing Set

Once we have used the validation set to determine the algorithm and parameter choices that we would like to use in production, the test set is used to approximate the model's true performance in the wild. It is the final step in evaluating our model's performance on unseen data.

We should never, under any circumstance, look at the test set's performance before selecting a model.

Peeking at our test set performance ahead of time is a form of overfitting, and will likely lead to unreliable performance expectations in production. It should only be checked as the final form of evaluation, after the validation set has been used to identify the best model.

Summary

You may have noticed that the test accuracy of the "just fluffiness" model was higher than that of the "both features" model, despite the validation set selecting the latter model as the best. This occurrence of the validation performance not exactly matching the test performance might happen, yet it is not a bad thing. Remember that the test performance is not a number to optimize over — it is a metric to assess future performance. It allows us to estimate, with confidence, that our model can distinguish between cats and dogs with 87.5% accuracy.

Key Takeaways

It is best practice in machine learning to split our data into the following three groups:

Training Set: For training of the model.

Validation Set: For unbiased evaluation of the model.

Test Set: For final evaluation of the model.

Adhering to this setup helps ensure that we have a realistic understanding of our model's performance, and that we (hopefully) built a model that generalizes well to unseen data.

🐾

Animal icons by Adrien Coquet (cat) and Mauricio Brito (dog).

Engineering note

Validation is not a consolation prize for the training set. It is the buffer that lets you compare models and hyperparameters without teaching yourself to overtrust the final test number.

You may have noticed that the test accuracy of the "just fluffiness" model was higher than that of the "both features" model, despite the validation set selecting the latter model as the best. This occurrence of the validation performance not exactly matching the test performance might happen, yet it is not a bad thing. Remember that the test performance is not a number to optimize over — it is a metric to assess future performance. It allows us to estimate, with confidence, that our model can distinguish between cats and dogs with 87.5% accuracy.

Key Takeaways

It is best practice in machine learning to split our data into the following three groups:

Training Set: For training of the model.

Validation Set: For unbiased evaluation of the model.

Test Set: For final evaluation of the model.

Adhering to this setup helps ensure that we have a realistic understanding of our model's performance, and that we (hopefully) built a model that generalizes well to unseen data.

🐾

Animal icons by Adrien Coquet (cat) and Mauricio Brito (dog).