Yeah, this is my ML project based on PCOS detection. You know right, in our world, girls are facing so many issues, especially when it comes to health. One of the most painful things they face is period pain. And when PCOS is added to that, it becomes even worse. Some can’t even handle the pain, they go through a lot—but no one really sees that.
So I just thought—if I was a doctor, maybe I could help by giving free treatment to them. But I'm not. I’m just an engineer. But still, I felt like I should do something from my side. Whatever I know, I should try to give that back. That’s why I thought of doing this project.
This project won’t give a proper medical result or final answer, but it will help girls to know whether there is a chance of having PCOS or not. It’s just for awareness. If it shows a possible risk, they can go and check with a doctor and confirm it. Sometimes, just knowing early can help, right?
So yeah, this project is for that. I’ve divided this project into 6 small sections, and I’ll explain everything clearly, step by step. You don’t need any big technical background to understand this—I’ll explain it in my way, just like how I talk. So just keep reading and let's go through it together.
Let’s start with the first part of our project – Data Preprocessing.
Yeah, in this part, as usual, I’ve done all the basic things like importing the required libraries and the dataset. After that, I cleaned up the data by dropping unnecessary columns and renaming the column names, because some of them were mismatched and didn’t look professional. So I made them neat and clean.
Then I converted the categorical data into numerical form, which is important because machine learning models don’t understand text. I also replaced the “Not a Number” values (NaN) using median values, and converted the data types wherever needed.
One more small step I did was removing spaces from column names, just to make it look cleaner and easier to work with in the upcoming steps.
These are all simple but important steps in data preprocessing, and I’ve explained everything clearly with code snippets. Just scroll down to see a bit of the code and output. And if you want the full code, it’s available on my GitHub page — free to use for any learning or educational purpose.
Now we move to the second step of our project — Exploratory Data Analysis (EDA).
So yeah, EDA is all about understanding the data more deeply before building the model. In this step, I just explored the data to get some insights — like which features are important, how they are connected, and how they’re distributed.
First, I did a categorical feature analysis using bar plots. These helped me see how the categories relate to the target column. I also grouped similar categories to make the data clearer and more organized.
Then I moved on to numerical feature analysis. For that, I used histograms to check the distribution of the values. This gave me an idea about how the data is spread — whether it's skewed, balanced, or has any patterns.
Then I moved on to numerical feature analysis. For that, I used histograms to check the distribution of the values. This gave me an idea about how the data is spread — whether it's skewed, balanced, or has any patterns.
After that, I did a correlation analysis using a heatmap. This part was really useful. It helped me see how strongly the features are related to each other and also with the target column. Because later, when we split the data into X and Y for training and testing, we need to know which features actually matter — and correlation helps a lot with that.
In real-time data, things can behave differently, so doing this step properly is really important.
Or in simple terms, let’s just call it Feature Selection. Because yeah, not all features in our dataset are useful — we need to pick only the important ones.
Think about it… our data might have a lot of columns — maybe 20, 30, or even more. But when a real person uses this model, they won’t know every detail. No one knows their every health metric, right? Most people just know their age, weight, height, marital status, pregnancy status, and some details like their period cycle. These are common and easy-to-know features.
So I selected only the meaningful features from the data — these become our X values (input). And the target column (whether they might have PCOS or not) becomes our Y value (output).
After that, I split the dataset into training and testing sets using a 70:30 ratio — 70% of the data is used to train the model, and 30% is used to test how well the model is working.
And that’s how Feature Engineering works in this project. Simple, right?
Okay, this part is where things got a little crazy — in a good way!
You know, when we build a machine learning model, we can’t just pick one and say “Yeah, this is the one.” Because sometimes that “one” will completely mess up — either it’ll underfit or overfit or just act too smart without giving real results. So, like choosing the right outfit for the right event, I tried on a few. Actually, seven different models. 😅
I started off with good old Logistic Regression. It worked fine, gave decent accuracy, but... it kinda panicked when I gave it categorical data. It just couldn’t handle it well, and the model started overfitting. I thought, okay, let’s swipe left and move on.
Next, I tried KNN — K-Nearest Neighbors. It looked nice on paper, but when I actually used it, it was super slow, and again, overfitting! I didn't have the patience to deal with that much lag, so I moved again.
Then came SVC — Support Vector Classifier. This one had potential, but the training time was a killer and the overfitting still didn’t stop haunting me.
So, I turned to the Decision Tree. And finally, I felt something clicking! It was straightforward and gave good results. But the problem? It’s like a friend who always chooses one fixed path and refuses to explore. So it was time to meet its smarter cousin...
Random Forest — this one felt like a game-changer. Instead of just one path, it uses multiple decision trees and then mixes their answers. I was finally getting some stable, strong results. But being the curious mind I am, I thought, “Why stop here?”
So I introduced my model to XGBoost — basically, this model is a boosting genius. It takes errors seriously and keeps learning. And yeah, results were nice. But then… I thought, “Wait, why not take the best of both worlds?” — Random Forest + Boosting? That’s where XGBRF came in. This model boosted my Random Forest and honestly, I was impressed. The accuracy, the stability — it felt right.
And when I thought I had found the one, I stumbled on CatBoost Classifier. I was reading an article on Medium, and it popped up. I tried it just for fun, and surprise — even better handling of categorical data, and it matched the performance of XGBRF.
So yeah, I finally narrowed it down to XGBRF and CatBoost. These two models truly understood the vibe of my dataset. They weren’t overconfident, didn’t overfit, and stayed smart with training speed.
Also, if you’re wondering what this whole “boosting” and “bagging” thing is — don’t worry. I’ve already written about it in my earlier blogs in the simplest way possible. If you go and check those out, all this will make sense like pieces of a puzzle falling into place. 🧩
Also, if you’re wondering what this whole “boosting” and “bagging” thing is — don’t worry. I’ve already written about it in my earlier blogs in the simplest way possible. If you go and check those out, all this will make sense like pieces of a puzzle falling into place. 🧩
So this part really shocked me. I had tried all those fancy models with full confidence — Random Forest, XGBRF, everything boosted and bagged nicely. And just out of curiosity (literally after reading one article 😅), I threw CatBoost into the mix. I didn’t expect much, honestly. It was like casually inviting someone to a party you already planned — but guess what? CatBoost stole the show!
When I started evaluating the models using a confusion matrix, I couldn’t believe my eyes. All the models did decently well, sure. But CatBoost? It just stood out. Smooth handling, strong accuracy, and minimal errors. It matched, and in some cases even outperformed, my well-trained XGBRF model. And here’s the twist — CatBoost was the simplest model I applied.
So yeah, no drama — I finalized CatBoost as the model for my project. Sometimes the unexpected ones just fit best. 💫
Yeah, we finally made it to the end — model deployment! I converted my final CatBoost model into a .joblib file, plugged it into a neat little Flask app, and boom! I deployed it for free using Render. And guess what? It’s live now — ready to predict, support, and maybe even change someone’s life a little. 💡
You can check it out right away — I’ve shared my GitHub link above, which includes the full code, project structure, and all the study materials I’ve used and built through this blog series. Use it however you want — it’s all there to help you learn and grow.
And hey, remember where we started? From zero — learning what machine learning even means, understanding math (even though it made us sleepy 😅), exploring supervised and unsupervised learning, and now? You’ve seen how an actual ML project works — end to end.
It’s okay if it takes you a month or two. Learn at your own pace, bit by bit. And one day, I promise, you’ll build something even better than this. Just don’t stop. Keep going.
And yeah — get ready, because in the next blog, we’ll move to our next big learning journey. More real-world projects, deep dives, and some seriously cool stuff.
Until then, happy learning, and don’t forget — your spark is enough to light up something big. 🔥
Conttact me ( Click Here )
Comments
Post a Comment