Engineering Sandbox rollout ยท data essays
This route gets better once you stop thinking about dimensionality reduction as a purely symbolic trick. Drag the points, rotate the view, and watch which directions keep the variation that matters before you move into the higher-dimensional examples.
Start in 2D and physically move the points until the first principal component feels inevitable. Then rotate the 3D cloud and only afterwards jump into the UK food example, where the meaning comes from the separation you have already learned to see.
Principal component analysis (PCA) is a technique used to emphasize variation and bring out strong patterns in a dataset. It's often used to make data easier to explore and visualize.
First, consider a dataset in only two dimensions, like (height, weight). This dataset can be plotted as points in a plane. But if we want to tease out variation, PCA finds a new coordinate system in which every point has a new (x,y) value. The axes don't actually mean anything physical; they're combinations of height and weight called principal components that are chosen to give one axis lots of variation.
Drag the points around in the following visualization to see the PC coordinate system adjust.
PCA is useful for eliminating dimensions. Below, we've plotted the data along a pair of lines: one composed of the x-values and another of the y-values.
If we're going to only see the data along one dimension, though, it might be better to make that dimension the principal component with most variation. We don't lose much by dropping PC2 since it contributes the least to the variation in the data set.
With three dimensions, PCA is more useful, because it's hard to see through a cloud of data. In the example below, the original data are plotted in 3D, but you can project the data into 2D through a transformation no different than finding a camera angle: rotate the axes to find the best angle. To see the "official" PCA transformation, click the "Show PCA" button. The PCA transformation ensures that the horizontal axis PC1 has the most variation, the vertical axis PC2 the second-most, and a third axis PC3 the least. Obviously, PC3 is the one we drop.
What if our data have way more than 3-dimensions? Like, 17 dimensions?! In the table is the average consumption of 17 types of food in grams per person per week for every country in the UK.
The table shows some interesting variations across different food types, but overall differences aren't so notable. Let's see if PCA can eliminate dimensions to emphasize how countries differ.
Here's the plot of the data along the first principal component. Already we can see something is different about Northern Ireland.
Now, see the first and second principal components, we see Northern Ireland a major outlier. Once we go back and look at the data in the table, this makes sense: the Northern Irish eat way more grams of fresh potatoes and way fewer of fresh fruits, cheese, fish and alcoholic drinks. It's a good sign that the structure we've visualized reflects a big fact of real-world geography: Northern Ireland is the only one of the four countries not on the island of Great Britain. (If you're confused about the differences among England, the UK and Great Britain, see: this video.)
Replica notes and parity details live separately in docs.