Overview
We wanted to sort countries into meaningful groups based on how developed and well-off they are.
- Governments and NGOs need data-driven ways to identify which countries most need development aid.
- Goal: group 167 countries by socio-economic and health indicators using unsupervised clustering.
- No labels exist, so clusters must be discovered purely from patterns in the data.
- Compare multiple algorithms to find groupings that are distinct and actionable.
Methodology
flowchart LR A[Raw Data] --> B[Scale / Standardize] B --> C["Reduce: PCA / t-SNE"] C --> D["Cluster: K-Means / Hierarchical"] D --> E["Evaluate: Elbow / Silhouette"] E --> F[Interpret Clusters]
The Data
Each country is described by nine numbers covering its economy, trade, and population health.
- 167 countries, 10 columns (9 numeric features plus country name), with no missing values or duplicates.
- Features include child mortality, exports, imports, health spend, income, inflation, life expectancy, fertility, GDP per capita.
- Child mortality ranges widely from 2.6 to 208 deaths per 1000 live births (mean approx. 38).
- Most variables are right-skewed with outliers; life expectancy is the only left-skewed feature.
- Features were standardized before clustering since distance-based methods are scale-sensitive.


Exploratory Analysis
Richer countries clearly tend to be healthier, and poverty tracks closely with high child mortality.
- Strong positive correlation between GDP per capita and income, as expected.
- Life expectancy rises with GDP per capita: people live longer in richer countries.
- Life expectancy is strongly negatively correlated with child mortality.
- High-fertility countries tend to have larger populations and lower per-capita income.
- Exports and imports span a huge range, with maxima near 200% of GDP.


Clusters Discovered
Several methods all pointed to roughly three groups: rich, poor, and a large middle.
- K-Means elbow plot dipped steadily from 2 to 8 with no clear elbow; silhouette score peaked at K=3.
- K-Means gave a skewed split: a tiny 3-country high-income cluster versus 100+ in the largest group.
- K-Medoids, GMM each found 3 clusters: high income, low income, and a large 'everything else' group.
- Hierarchical (complete linkage) dendrogram cut at distance approx. 9 yielded 4 clusters.
- DBSCAN returned 4 clusters, isolating extreme outliers (cluster -1) from compact core groups.


Interpretation & Recommendations
The cleanest grouping separates struggling countries that need aid from prosperous trade hubs.
- High-income cluster: 3 outlier trade hubs (Luxembourg, Malta, Singapore) with the highest import/export ratios.
- Low-income cluster: highest child mortality, trade deficits, inflation, and lowest GDP and net income.
- Hierarchical clustering isolated Nigeria alone, driven by its extreme 104% inflation rate.
- K-Medoids recommended as the practical choice: its extreme clusters are the most distinct from each other.
- Aid and development efforts should prioritize the low-income cluster's high-mortality, low-GDP nations.


Key Takeaways
Comparing five clustering methods produced a robust, three-tier view of global development.
- Five algorithms (K-Means, K-Medoids, GMM, Hierarchical, DBSCAN) converged on a rich/poor/middle structure.
- Best algorithm depends on use case, but K-Medoids gave the most distinct, interpretable clusters.
- Scaling and outlier awareness were essential since distance metrics drive every method.
- A few extreme economies (tiny trade hubs, high-inflation Nigeria) repeatedly split off as their own groups.
- Built with: pandas, numpy, matplotlib, seaborn, scikit-learn, scikit-learn-extra, scipy
More Visualizations




Tech Stack
- pandas — data wrangling and tabular manipulation
- numpy — fast numerical arrays
- scikit-learn — modeling, pipelines, and evaluation
- seaborn — statistical visualization
- matplotlib — plotting
- scipy — scientific computing
Attribution
This project was completed as part of the MIT Applied Data Science Program (MIT IDSS / Great Learning). The program provided the case-study scaffolding; the analysis, code, and results are my own. Published with permission, for portfolio use only.