← All Projects

Unsupervised Pattern Discovery with PCA and t-SNE

Compressing high-dimensional education and air-pollution data to reveal hidden structure

Overview

We wanted to see whether squeezing many measurements down to a few dimensions could reveal natural groupings in the data.

Methodology

flowchart LR
  A[Raw Data] --> B[Scale / Standardize]
  B --> C["Reduce: PCA / t-SNE"]
  C --> D["Cluster: K-Means / Hierarchical"]
  D --> E["Evaluate: Elbow / Silhouette"]
  E --> F[Interpret Clusters]

The Data

We cleaned two datasets, dropping useless ID columns and filling in the gaps where readings were missing.

Exploratory Analysis

Before reducing anything, we checked how each measurement was distributed and which ones moved together.

Dimensionality Reduction (PCA)

PCA recombined the columns into a handful of summary scores that still carry most of the information.

Visualization & Clusters (t-SNE)

t-SNE laid the data out on a flat map so we could literally see which records group together.

Key Takeaways

PCA shrank the data efficiently and t-SNE exposed four real pollution profiles, but only where structure truly existed.

More Visualizations

Tech Stack

Attribution

This project was completed as part of the MIT Applied Data Science Program (MIT IDSS / Great Learning). The program provided the case-study scaffolding; the analysis, code, and results are my own. Published with permission, for portfolio use only.