Genomic Data Clustering: Decoding the Genetic Code

Overview

We wanted to see whether a computer could spot hidden patterns in raw DNA without being told what to look for.

DNA stores all the information a living cell needs, but a raw sequence is just an unbroken string of A, C, G, and T letters.
With no spaces between words, it is unclear what the meaningful units of the genetic code actually are.
Objective: let an unsupervised model find structure in the genome of the bacterium Caulobacter crescentus (ccrescentus).
Success means surfacing biological organization purely from letter patterns, with no labels or prior domain knowledge.

Methodology

flowchart LR
  A[Raw Data] --> B[Scale / Standardize]
  B --> C["Reduce: PCA / t-SNE"]
  C --> D["Cluster: K-Means / Hierarchical"]
  D --> E["Evaluate: Elbow / Silhouette"]
  E --> F[Interpret Clusters]

The Data

The input is a single bacterial genome stored as a plain text file of DNA letters.

The ccrescentus genome is supplied in FASTA format, the standard text encoding used in bioinformatics.
We read the file, stripped out headers and unwanted characters, and stored the genome as one continuous string.
The only alphabet present is the four DNA bases A, C, G, and T, with no natural word boundaries.
This single long sequence becomes the raw material for building numerical features.

Approach: From Letters to Features

We turned the DNA text into numbers by counting how often short letter combinations appear, then compressed those counts into a 2-D map.

Split the genome into fragments and counted the frequency of every possible word for word lengths of 1, 2, 3, and 4 letters.
Built a frequency matrix (a DataFrame) where each column is a k-mer and each value is its count per fragment.
Standardized every column with StandardScaler so all features share a common scale for these distance-based methods.
Applied PCA to reduce the high-dimensional frequency tables to the first two principal components for visualization.
Tried multiple word lengths because, without domain knowledge, the most meaningful unit was not known in advance.

Clusters Discovered

When we looked at three-letter words, the data fell into clear, separate groups all on its own.

Only the 3-letter word frequencies produced clearly identifiable point clouds in the first two principal components.
Six distinct outer poles emerged, plus a central group of points sitting near the middle.
Ran K-means clustering on the standardized 3-letter frequency tables with n_clusters set to 7.
The seventh, central cluster captures fragments far from the six outer centroids, hinting at genes that carry no information.
1- and 4-letter words showed no comparable structure, isolating 3 letters as the meaningful length.

Interpretation

The three-letter groups the model found match the codons that biology already knows DNA is read in.

These 3-letter words are exactly what biologists call codons, the units the cell reads to build proteins.
Six clean clusters correspond to reading-frame and strand structure inherent in how codons are arranged.
The central cluster aligns with non-coding stretches that do not translate into a clear codon signal.
An unsupervised model independently recovered a known biological fact straight from raw letter statistics.

Key Takeaways

Simple counting plus clustering let a machine rediscover the structure of the genetic code on its own.

Unsupervised learning (K-means) and dimensionality reduction (PCA) revealed real biology with no labels.
Feature engineering mattered most: counting 3-letter k-mers was the key that exposed the codon structure.
Standardizing features before PCA and K-means is essential because both are distance-based algorithms.
The workflow validates a known discovery and shows how to probe an unknown sequence for hidden units.
Built with: pandas, numpy, tqdm, scikit-learn (PCA, KMeans, StandardScaler), matplotlib

Tech Stack

pandas — data wrangling and tabular manipulation
numpy — fast numerical arrays
scikit-learn — modeling, pipelines, and evaluation
matplotlib — plotting

Attribution

This project was completed as part of the MIT Applied Data Science Program (MIT IDSS / Great Learning). The program provided the case-study scaffolding; the analysis, code, and results are my own. Published with permission, for portfolio use only.