Car Breakdown Prediction Model

Context

This was a group project for my Artificial Intelligence course where I worked in a team of three. We were tasked with a real-world problem: looking at vehicle data to predict failures before they happen. The goal was to help a second-hand car dealer deciding whether to resell a car, service it, or send it to auction. To do this we had to answer the question: "What is the risk that this car will break down within the next 30 days?"

What We Built

We developed a complete machine learning workflow in a Jupyter Notebook that included:

Exploratory Data Analysis (EDA): A deep dive into the dataset to find patterns and outliers.
Custom Data Pipeline: A structured way to process raw data based on our EDA findings.
Model Comparison: We built and tested both Random Forest and Gradient Boosting models.
Feature Engineering: We refined our data inputs to help the models perform better.
Evaluation Framework: A final comparison focused on minimizing false negatives to ensure we didn't miss potential breakdowns.

Why We Built It

A second-hand car dealer has a massive risk every time they put a car on the lot. If they sell a car that breaks down a week later, they lose money on the warranty and ruin their reputation. We chose this approach because we wanted to give the dealer a "red flag" system. Instead of just guessing which cars were reliable, we wanted a data-driven way to catch the lemons before they ever reached a customer.

How We Built It

The biggest challenge was moving from a high Kaggle score to a model that actually worked for our specific goal. We spent a lot of time on feature engineering after realizing that our initial models were missing too many breakdown cases. By focusing on minimizing false negatives, we had to carefully tune our models to be more sensitive to the warning signs of a failure.

My Personal Contribution

While we made all our research and strategy decisions as a team, I took the lead on the technical side:

Primary Developer: I was responsible for writing most of the code once the team agreed on our direction.
Pipeline Construction: I built the data pipeline to ensure our training and testing data remained consistent.
Model Implementation: I handled the setup and tuning for both the Random Forest and Gradient Boosting algorithms.

Tech Stack

Language: Python
Libraries: Pandas, NumPy, Matplotlib, Scikit-learn
Models: Random Forest, Gradient Boosting
Tools: Jupyter Notebook, Git

Key Takeaways

This project taught me that you have to truly understand your data before you start training. I learned that a high accuracy score can be a trap. If the model has a high score but still misses the specific cases you care about, it isn't the right tool for the job. In the future, I would spend even more time on the EDA phase since that is where the most important insights usually hide.

Car Breakdown Prediction Model

Car Breakdown Prediction Model

Context

What We Built

Why We Built It

How We Built It

My Personal Contribution

Tech Stack

Key Takeaways

Visuals