This project aims to predict whether a person has diabetes or not using key health metrics such as glucose levels and BMI. The project involves data preprocessing, feature selection, model training, evaluation, and prediction using various machine learning algorithms.
- Overview
- Dataset
- Features
- Libraries Used
- Data Preprocessing
- Model Training and Evaluation
- Making Predictions
- Results
- Conclusion
- Usage
- Contributing
- License
The dataset contains health information about individuals, including:
- Glucose levels
- BMI
- Gender
- Age
- Hypertension
- Heart disease
- Smoking history
- HbA1c level
- Diabetes status (target variable)
The primary features used for prediction in this project are:
- Glucose levels
- BMI
- Python
- Pandas
- NumPy
- Scikit-learn
- Importing Libraries: Import necessary libraries for data manipulation and machine learning.
- Loading Dataset: Load the dataset into a pandas DataFrame.
- Checking for Null Values: Identify and handle any missing values.
- Checking for Duplicate Values: Identify and remove duplicate entries.
- Checking Data Types: Ensure all data types are correct for analysis.
- Generating Statistical Summaries: Summarize the data using descriptive statistics.
Split the data into training and testing sets:
from sklearn.model_selection import train_test_split
X = df_new.drop(['gender', 'age', 'hypertension', 'heart_disease', 'smoking_history', 'HbA1c_level', 'diabetes'], axis=1)
y = df_new['diabetes']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
abc = AdaBoostClassifier()
abc.fit(X_train, y_train)
abc_pred = abc.predict(X_test)
abc_accuracy = accuracy_score(y_test, abc_pred)
print(f"AdaBoostClassifier Accuracy: {abc_accuracy}")
from sklearn.ensemble import GradientBoostingClassifier
gc = GradientBoostingClassifier()
gc.fit(X_train, y_train)
gc_pred = gc.predict(X_test)
gc_accuracy = accuracy_score(y_test, gc_pred)
print(f"GradientBoostingClassifier Accuracy: {gc_accuracy}")
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_pred)
print(f"RandomForestClassifier Accuracy: {rf_accuracy}")
Use the trained GradientBoostingClassifier to make predictions on new data:
import numpy as np
input_data = np.array([[25.19, 140]])
prediction = gc.predict(input_data)
print(f"Prediction: {prediction}")
- AdaBoostClassifier Accuracy: 94.57%
- GradientBoostingClassifier Accuracy: 94.57%
- RandomForestClassifier Accuracy: 92.59%
The GradientBoostingClassifier and AdaBoostClassifier both achieved the highest accuracy of 94.57%. This project demonstrates the effectiveness of machine learning in predicting diabetes using health metrics.
- Clone the repository:
git clone https://github.com/Ehtisham33/Diabetes-Prediction.git
- Navigate to the project directory:
cd Diabetes-Prediction
- Install the required libraries:
pip install -r requirements.txt
- Run the project:
python main.py
Contributions are welcome! Please create a pull request or open an issue to discuss any changes.
This project is licensed under the MIT License.