← All Projects

Audio MNIST: Spoken-Digit Recognition with a Neural Network

Classifying spoken digits 0-9 from raw .wav audio using MFCC features and a Keras ANN

Overview

We taught a computer to listen to someone say a number from zero to nine and correctly recognize which digit was spoken.

Methodology

flowchart LR
  A[Image Dataset] --> B[Resize / Normalize / Augment]
  B --> C["CNN: Conv + Pooling layers"]
  C --> D[Dense + Softmax]
  D --> E[Train w/ Early Stopping]
  E --> F["Evaluate: Accuracy and Confusion Matrix"]

The Data (Audio)

The data is thousands of short sound clips of people saying digits, which we first looked at as wiggly waveforms.

Feature Extraction (MFCC)

Instead of feeding raw sound in, we summarized each clip into 40 numbers that capture the shape of its sound.

Model Architecture

We used a simple neural network with several layers of artificial neurons to learn the digit from the 40 features.

Results

Even after a short training run, the network correctly identified the spoken digit most of the time.

Key Takeaways

Turning sound into the right summary features let a small, simple network recognize spoken digits well.

More Visualizations

Tech Stack

Attribution

This project was completed as part of the MIT Applied Data Science Program (MIT IDSS / Great Learning). The program provided the case-study scaffolding; the analysis, code, and results are my own. Published with permission, for portfolio use only.