Car Breakdown Prediction Model
Context
This was a group project for my Artificial Intelligence course where I worked in a team of three. We were tasked with a real-world problem: looking at vehicle data to predict failures before they happen. The goal was to help a second-hand car dealer deciding whether to resell a car, service it, or send it to auction. To do this we had to answer the question: "What is the risk that this car will break down within the next 30 days?"
What We Built
We developed a complete machine learning workflow in a Jupyter Notebook that included:
- Exploratory Data Analysis (EDA): A deep dive into the dataset to find patterns and outliers.
- Custom Data Pipeline: A structured way to process raw data based on our EDA findings.
- Model Comparison: We built and tested both Random Forest and Gradient Boosting models.
- Feature Engineering: We refined our data inputs to help the models perform better.
- Evaluation Framework: A final comparison focused on minimizing false negatives to ensure we didn't miss potential breakdowns.
Why We Built It
A second-hand car dealer has a massive risk every time they put a car on the lot. If they sell a car that breaks down a week later, they lose money on the warranty and ruin their reputation. We chose this approach because we wanted to give the dealer a "red flag" system. Instead of just guessing which cars were reliable, we wanted a data-driven way to catch the lemons before they ever reached a customer.
How We Built It
The biggest challenge was moving from a high Kaggle score to a model that actually worked for our specific goal. We spent a lot of time on feature engineering after realizing that our initial models were missing too many breakdown cases. By focusing on minimizing false negatives, we had to carefully tune our models to be more sensitive to the warning signs of a failure.
My Personal Contribution
While we made all our research and strategy decisions as a team, I took the lead on the technical side:
- Primary Developer: I was responsible for writing most of the code once the team agreed on our direction.
- Pipeline Construction: I built the data pipeline to ensure our training and testing data remained consistent.
- Model Implementation: I handled the setup and tuning for both the Random Forest and Gradient Boosting algorithms.
Tech Stack
- Language: Python
- Libraries: Pandas, NumPy, Matplotlib, Scikit-learn
- Models: Random Forest, Gradient Boosting
- Tools: Jupyter Notebook, Git
Key Takeaways
This project taught me that you have to truly understand your data before you start training. I learned that a high accuracy score can be a trap. If the model has a high score but still misses the specific cases you care about, it isn't the right tool for the job. In the future, I would spend even more time on the EDA phase since that is where the most important insights usually hide.