Book Recommendation System

Overview

I built a system that suggests books a reader is likely to enjoy based on how people have rated books in the past.

With online retail growing, personalized recommendations drive discovery, engagement, and sales for book retailers.
Most user-book pairs are unrated, so the goal is to predict which unseen books a user would rate highly.
Objective: build and compare recommenders that turn sparse rating data into a ranked top-N book list per user.
Also address the cold-start problem of recommending to brand-new users with no rating history.

Methodology

flowchart LR
  A["User-Item Ratings"] --> B[EDA & Filtering]
  B --> C["Approaches: Popularity / Collaborative Filtering / SVD"]
  C --> D["Evaluate: RMSE / Precision@K"]
  D --> E[Top-N Recommendations]

The Data: Users, Books & Ratings

I started with over a million book ratings and cleaned them down to the meaningful ones before modeling.

Merged ratings and book datasets into 1,149,780 observations across 7 columns of user, book, and rating fields.
Ratings use a 1-10 scale; rating '0' dominated (~700K) and was treated as missing, then dropped.
After removing the 0 ratings, 433,671 genuine ratings remained from 77,805 users across 185,973 unique books.
Matrix is extremely sparse: only 433,671 of a possible ~14.5 billion user-book interactions are filled in.

Exploratory Analysis

I explored how ratings are spread out and found that a few popular books get most of the attention.

After cleaning, rating 8 was most common (~100K), followed by ratings 10 and 7 (~80K each); ratings 1-4 were rare.
Most-reviewed book (bookid 0316666343) drew 707 users, and was rated 8/9/10 by the majority who read it.
Power user 11676 rated 8,524 books, showing a small set of very active raters drives much of the data.
User-book interaction counts are heavily right-skewed: very few books have many ratings, most have few.

Recommender Approaches

I tried four methods, from a simple popularity ranking to advanced models that learn hidden user and book patterns.

Model 1 - Rank-based: recommends most popular books by average rating with a minimum-interactions threshold, solving cold start.
Model 2 - User-user collaborative filtering: cosine similarity with KNN to find like-minded users (scikit-surprise KNNBasic).
Model 3 - Item-item collaborative filtering: similarity computed between books rather than between users.
Model 4 - Matrix factorization (SVD): learns latent user and book features; all models tuned via GridSearchCV.
Evaluated with RMSE plus precision@k and recall@k, using rating 7 as the relevance threshold.

Results & Recommendations

Tuning improved every model, and item-based filtering gave the most accurate rating predictions.

Baseline user-user CF: RMSE 1.84 with ~0.81 precision and recall; tuning cut RMSE to 1.68 and raised F1 from 0.81 to 0.86.
Item-item CF was strongest on accuracy: RMSE improved from 1.62 to 1.58 with F1 around 0.80 after tuning.
Matrix factorization (SVD) F1 beat the user-user baseline but tuning yielded only marginal further gains.
Validated predictions case by case (e.g. user 1326, book 12344 predicted ~7.99 vs actual 8) and applied corrected ratings to rank ties.

Key Takeaways

Combining a simple popularity baseline with personalized models gives reliable book recommendations even from very sparse data.

Rank-based recommendations are a cheap, robust fallback for new users facing the cold-start problem.
Personalized collaborative filtering and matrix factorization clearly outperform popularity once a user has history.
Hyperparameter tuning with GridSearchCV consistently lowered RMSE and raised F1 across all model families.
Corrected ratings that weight the number of raters produce more trustworthy top-N rankings than raw averages.
Built with: pandas, numpy, matplotlib, seaborn, scikit-learn, scikit-surprise

Tech Stack

pandas — data wrangling and tabular manipulation
numpy — fast numerical arrays
scikit-learn — modeling, pipelines, and evaluation
seaborn — statistical visualization
matplotlib — plotting
scikit-surprise — collaborative-filtering recommenders

Attribution

This project was completed as part of the MIT Applied Data Science Program (MIT IDSS / Great Learning). The program provided the case-study scaffolding; the analysis, code, and results are my own. Published with permission, for portfolio use only.