← All Projects

Genomic Data Clustering: Decoding the Genetic Code

Using unsupervised learning to rediscover that DNA reads in three-letter codons

Overview

We wanted to see whether a computer could spot hidden patterns in raw DNA without being told what to look for.

Methodology

flowchart LR
  A[Raw Data] --> B[Scale / Standardize]
  B --> C["Reduce: PCA / t-SNE"]
  C --> D["Cluster: K-Means / Hierarchical"]
  D --> E["Evaluate: Elbow / Silhouette"]
  E --> F[Interpret Clusters]

The Data

The input is a single bacterial genome stored as a plain text file of DNA letters.

Approach: From Letters to Features

We turned the DNA text into numbers by counting how often short letter combinations appear, then compressed those counts into a 2-D map.

Clusters Discovered

When we looked at three-letter words, the data fell into clear, separate groups all on its own.

Interpretation

The three-letter groups the model found match the codons that biology already knows DNA is read in.

Key Takeaways

Simple counting plus clustering let a machine rediscover the structure of the genetic code on its own.

Tech Stack

Attribution

This project was completed as part of the MIT Applied Data Science Program (MIT IDSS / Great Learning). The program provided the case-study scaffolding; the analysis, code, and results are my own. Published with permission, for portfolio use only.