Table of Contents
In this project, we seek to use machine learning to predict whether a patient has diabetes or not.
According to World Health Organisation, over 460 million people suffer from diabetes globally and abot 70 percent of these people especially in Africa don't know they have diabetes. While diabetes is an easily preventable disease and doesn't have to be fatal if diagnosed earlier, because of the high percentage of people who doesn't know their health status in Africa, 77% of recorded deaths associated with diabetes occur in Africa. Some of the reasons associcated with this includes:
- Lack of information about diseases earlier
- busy schedule of people thus not taking out time to go do the test
- Few doctors, nurses and other health practioners as compared with WHO standard of 1 - 600 doctors - patients ratio
- Lack of access to health facility for people in remote places
We proposed a machine learning solution that can predict over the web whether an individual has diabetes or not based on symptoms they are experiencing such as polyphagia, polydipsia, weakness and demographic information such as age, gender etc. Because the solution is packaged as a webapp, it therefore bridges the gap of accessibility.
Future It can be repackaged later in the future as an API to be used with USSD to also imporve accessibility
├── LICENSE
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ │
| ├── notebooks <- Jupyter notebooks.
| |
| └─ scripts <- Scripts to download or generate data
│ ├── make_dataset.py
| |
| ├── modelling <- Scripts to train models and then use trained models to make
│ │ predictions
│ |
| ├── preparation <- Scripts to turn raw data into features for modeling
│ |
| └── test
|
└── config.txt <- tox file with settings for running tox; see tox.readthedocs.io
The dataset ws obtained from the puma indian dataset that is freely available on Kaggle. The data has 17 columns namely:
The dataset also has n_rows The dataset is publicly available. Please refer to the [Kaggle](link)For this project, the road map was to
- Perform minimal feature engineering to determine which features to include in the model pipeline
- Try out different machine learning models Decision trees, Support vector machine Weiss et. al.
- Deploy the model
During preprocessing, missing values were removed. Categorical values were also transform using one-hot encoder
The features were engineered so we can use features that best describe the dataset were used. New columns such as policy duration were added. A polynomial relationship between the columns were also calculated after which a correlation was run between all the columns and columns that are most correlated with the target were used. The columns used for the model training were as listed.
Some charts were produced to better understand the data. This include the age distribution of the users
A bar chat was plotted also to see their best products
For our project we tried
- Logistic Rregression ---- AS THE BASELINE
- Support Vector Machine -----To visualize the kernel support and feature engineer
- Xgboost ------For state of the art result on tabular data
We also performed gridsearch on our models to select the best hyperparameters for our models and evaluated it using cross validation
For the evaluation of our model, the best parameter to evaluate our model is []. This is because using the confusion matrix There is a greater penalty on False P/N compared to False P/N. This is because for every F/N, the company had to do so so so which is more costly compared to so so for F/N. Therefore we are evaluating our models primarily by F/N followed by F1-sore which seeks to maintain a good balance between the precision and recall
From the experiments carried, using the best hyperparameter found with our gridsearch found, the result of our models showed This showed that the best model for our dataset is [name]
Here we will discuss the various technologies and techniques used to deploy the model.
To get started and set up the project in your local environment, please download the packages listed in the requirements
You can download them from the terminal from requirements.txt using
- pip
pip install requirements.txt
- or Conda
conda install requirements.txt
-
Download Jupyter notebook or Jupyter lab For linux or Mac Users
sudo install notebook
For windows users, you can download it from here Jupyter Homepage
-
Clone the repo
git clone https://github.com/Ajalamarvellous/autolearn.git
-
Install the necessary packages
pip install requirements
-
Open your jupyter lab or notebook
-
Go to the folder📂 where you just downloaded the project to
-
Open the Untitled.ipynb notebook📔 there.
And you are ready to rumble
To try out the deployed page, you can try it out here
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the MIT License. See LICENSE.txt
for more information.
- Ajala, Marvellous - @madeofajala - [email protected]
Project Link: https://github.com/ajalamarvellous/Diabetes-risk-factor