gimseng / 99-ml-learning-projects Goto Github PK
View Code? Open in Web Editor NEWA list of 99 machine learning projects for anyone interested to learn from coding and building projects
License: MIT License
A list of 99 machine learning projects for anyone interested to learn from coding and building projects
License: MIT License
This issue is especially for Hacktoberfest participants
How different algorithms give different results when implemented on a single dataset
[Explain and describe what the exercise is]
Implement different ML Algorithms like Logistic Regression, Random Forest, XG Boost for Employee Attrition dataset
Random-forest model, feature extraction, SVM, logistic Regression, etc
[Provide a succinct summary of what the data is and where it is from]
To predict Employee Attrition by the given data about his/her past history. This dataset is a modified version of the IBM Employee Analytics Dataset
Implement different data preprocessing techniques, algorithms. Feel free to use your creativity. Add your solution with an explanation in comments with the filename as the name of the models used and a short description of your solution like what techniques you used and the model accuracy in the solution readme.
Perform CI (purest) for code reviews. I am not entirely sure about how to do that for Jupyter notebook (.ipynb) but was reading around the internet and it seems like catching the error when converting it to .py file is the way to go.
Goal: Learn how to use linear regression
Packages needed: sklearn.linear_model.LinearRegression
Idea and task:
Obtain interesting dataset with rather linear relationship. E.g. include GDP vs Happiness Index, crop yield vs rain fall and etc
Apply Linear Regression.
Bonus and extra credits : Understanding outliers, multi-dimensional regressions
For .py files, we need to have a requirement file and instruction on how to install it. There may be a need to do virtual environment or docker file deployment.
For Jupyter notebook, should we have some sort of standardised header codes to install a standard version of libraries? Also for Jupyter file, if we recommend google colab, then we don't have to worry about Jupyter notebook version and dependencies; otherwise, we should think about that.
A in-depth exercise to explore and learn the different aspects/hyperparameters of decision tree. Preferably using scikit-learn.
A basic understanding of decision-tree, though this exercise is supposed to go into more detailed on how to use and optimize decision tree.
I'm agnostic to data source, as long as its useful to learn/teach the method.
Hacktoberfest has made participation for maintainers to the event as opt-in
https://hacktoberfest.digitalocean.com/hacktoberfest-update
@gimseng To get contributions for hacktoberfest you would need to add a 'hacktoberfest' topic in the about section.
Refer to the above guide for more info.
Anyone interested in maintaining and developing this repo as core team? Please comment below
For now, the projects are assigned the number on a first come first serve basis. So if we could categorize it now, it would be way easier when we have a lot of projects. For example, Linear Regression should come before Titanic. The least we could do is classify projects as Machine Learning, Deep Learning, NLP etc
Is your feature request related to a problem? Please describe.
CIFAR10 is one of the basic datasets in machine learning. And if you have already have worked with MNIST dataset, CIFAR dataset helps is where we should move to get familiar with images having 3 Channels(coloured images).
Describe the solution you'd like
I want to include exercise. And a basic solution implemented in PyTorch.
Please let me know if should make a PR for this or not.
Work with Tensorflow and image data & implement different models with this (eg- VGG16, VGG19, RSNET)
[Explain and describe what the exercise is]
The dataset contains parasitized and uninfected cells from the thin blood smear slide images of segmented cells. Here a VGG16 model is used to classify the cells as Infected & Uninfected
[Prerequisites, in terms of concepts or other exercises in this repo]
Tensorflow/Keras, Transfer Learning
This dataset is simple and interesting enough to learn to implement different CNN architectures
The Malaria dataset contains a total of 27,558 cell images with equal instances of parasitized and uninfected cells from the thin blood smear slide images of segmented cells.
https://www.kaggle.com/iarunava/cell-images-for-detecting-malaria
I have the solution using the VGG19 model in Tensorflow, & will be happy to create a pull request and will then implement other models on this dataset.
Learn the gradient descent algorithm using numpy and matplotlib.
This exercise helps classify pictures as cat and non-cat with the help of neural networks with 2 layers.
Must know the basics of logistic regression.
This is the solution to the Neural Networks Exercise in Coursera. It is a custom dataset.
I have the solution notebook for this exercise. I will be happy to create pull request to include the exercise.
Found this exercise here
Since this repo is aimed at people trying to learn machine learning, I think it would be helpful if subtasks were added in exercise 001, especially in regards to data analysis and feature engineering.
Like for example:
Since it can be a little overwhelming at the start, providing some sort of outline could be helpful.
If the above steps are too specific, it could be a little more broad to allow the person to think by themselves.
A bonus section could also be added, for those who want to go the extra mile.
Create a more specific fork-clone-... steps for our project. A step-by-step version of how to do that and contribute new exercise and/or solution will be useful to new git/github users.
Hello, I have a project on automation which is LinkedIn Automation. It automatically logs in and goes to the My Network page. And then It started making connections with the suggested users. It runs in a while loop. And it makes connections in an interval of time.
Could I add it to 99 Machine Learning Projects?
A good checklist of todos relating to ideas/suggestions will be useful.
Hello @gimseng, I have a project on automation which is LinkedIn Automation. It automatically logs in and goes to the My Network page. And then It started making connections with the suggested users. It runs in a while loop. And it makes connections in an interval of time.
[12:24 PM]
Could I add it to 99 Machine Learning Projects?
Part 1:
Part 2:
Part 1:
Apply different Decision Trees to train a model for detecting breast cancer using the breast-cancer-wisconsin-diagnostic-dataset (scikit-learn 7.2.7. Breast cancer wisconsin (diagnostic) dataset).
Goal is to predict whether breast cancer is Malignant or Bening.
Part 2:
Apply various transformations, imputers, encoders-scalers using Pipelines with DecisionTreeClassifiers. Work with gridsearch to find the best parameters. Goal is to predict whether income exceeds $50K/yr based on census data.
DecisionTreeClassifier
Pipeline
SimpleImputer
StandardScaler
OneHotEncoder
ColumnTransformer
GridSearchCV
Part 1:
569 instances with 30 numeric attributes. Class distribution: 212 - Malignant, 357 - Benign
Follow the link below for the full description of the dataset.
https://scikit-learn.org/stable/datasets/#breast-cancer-wisconsin-diagnostic-dataset
Part 2:
income.csv is used for training set.
32561 instances with 14 attributes, 6 numeric (e.x. age, capital gain, hours-per-week ) and 8 categorical (e.x. workclass, education, race).
income_test.csv is used for testing and report scores.
15315 instances with 14 attributes, 6 numeric (e.x. age, capital gain, hours-per-week ) and 8 categorical (e.x. workclass, education, race).
Goal is to predict whether income exceeds $50K/yr based on census data.
Link: http://archive.ics.uci.edu/ml/datasets/Adult
This exercise was assigned in the machine learning course at Aristotle University of THessaloniki and the solution was my submission at this.
Is your feature request related to a problem? Please describe.
It seems like a very time consuming task to write the exercise statement, write the codes, test the codes, and polish up all the above steps by one person.
Describe the solution you'd like
I am very impressed by the quality of freecodecamp and recently delved into their process of creating a exercise. It is very similar to what we are doing.
It involves having a few stages of creating an exercise. Roughly, someone started an exercise with some codes. Someone else could jump in after and polish up instruction text. Someone later on will test and break the codes, and give feedback to the first two stages. Repeat until convergence.
Check out their project board: https://github.com/orgs/freeCodeCamp/projects/10
I could implement this in our project board. On top of that we could follow the discord model of https://www.reddit.com/r/learnmachinelearning/comments/hthfds/completed_3_projects_with_100_data_scientists/ to have more in depth discussions among those involved in this step, with a channel dedicated to a particular exercise.
I realize that someone who's not in the loop who stumbles across this repo might not know what to do with it. Are they supposed to:
(a) fork it and just read through python codes or
(b) actively contributing exercises or
(c) be maintainers(?)
(d) do nothing?
Naive Bayes algorithm is one of the most simplest yet powerful ML algorithm out there, often used as a baseline for text based classification. Understanding the working of this algorithm will help in understanding:-
[Explain and describe what the exercise is]
The objective of this exercise is to implement the Naive Bayes algorithm along with using python 3 and numpy. The dataset to be used is Haberman's Survival Dataset
You must have the basic understanding of what machine learning is, what is supervised and unsupervised learning, what is classification, etc. Knowledge of python 3 and numpy if also required.
Haberman's Survival Dataset is a dataset containing cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer. Click here to learn more about this dataset.
To understand the Naive Bayes Algorithm, check out this wonderful blog by ShatterLine
Check out this amazing medium article by Pratik Mirjapure to understand the Haberman's Survival Dataset better.
Learn the concept of KNN algorithm for classification.
This exercise helps in classifying likelihood of a person to get insurance policy for his/her vehicle based on different factors including age, gender, having health insurance etc.
Must know the basic statistics and a programming language and concepts of scikit-learn.
This data is available on Kaggle, a task open for anyone interested to take up.
I have a solution notebook for this using KNN available on my Kaggle profile and would love to share my contribution by creating a pull request.
This task is available here
One of the most important component of neural networks is the learning algorithms it uses.
Since for most of the beginners it is like a black box.
I am planning to contribute code for these learning algorithms. These will include:
Learn the basic and how to optimize / fine-tune hyperparameters of SVM.
Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final images have 400x 400 pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a resolution of about 660 dpi were gained. Wavelet Transform tool were used to extract features from images.
The dataset is available on UCL reporsitory
[e.g. This exercise and solution's proposal came from a lab session from DL2020]
Goal: To learn one of the most important part of doing a ML project/question, Data Wrangling and then perform basic Classification based Supervised Learning Algorithms.
What and Why : To predict the survival of the passengers of The Titanic. We will be doing this to be able to have a hands-on learning experience for classification problems and an introduction to solve various ML problems. It might even give you a head start on Kaggle.
Apply Logistic Regression, Decision Tree, Random Forest, Gaussian Naive Bayes and much more
Bonus and extra credits : Understanding regularization, model selection, confusion matrix and cross validation.
Describe the bug
Data paths are not correct for google colab use. Users either have to download and then upload to their google drive OR they should point the data source to the correct data in our GitHub repo
Proposed Solution
Either point all our data sources to our GitHub links OR/AND provide docs on how to upload files to their google drive and mount google drive on google colab.
In this exercise, one would be able to know how to analyse sentiments of text data using various conventional and advanced algorithms along with textual data processing techniques.
This exercise goes from basic methods to advanced ones, so there are no hard requisites for this exercise. But it is recommended that one should know basic ML workflow to grasp things conveniently.
Data has been taken from various sources and public datasets available.
I would update this later.
I have a solution and I would be adding a PR for the same. However, one is welcome:
I would update this later with various references.
As stated in the title. If this is copied/scraped somewhere, provide a source of that. At least provide description.
@Rajwrita Maybe you would be able to help. Thanks !
Learn different methods of regularizing the models. This could be as basic as the L1, L2 (or ridge/Lasso) regularization or more sophisticated ones in other methods (like regularization hyperparameters in SVM) or even dropouts in neural network.
Feel free to use any data and models as you see fit.
The notebook seems to have some missing codes at the end regarding distributing positive and negative reviews.
@tejasvi541 Could you take a look since you last worked on it? Please fork the current/updated master file and add your codes to the end, if you have them. Thanks !
For people who want to not just create a machine learning model but also create an interactive dashboard to go with it, this exercise might be a great starting point!
I will provide a sample machine learning project along with a small tutorial about how to convert it to an interactive project using streamlit and heroku.
Working knowledge of python and sklearn should suffice as a good prerequisite.
Here is a link to my Motivation
Generator Dashboard that I created!
https://mymotivationalapp.herokuapp.com/
This kind of application will be the result of this exercise. What do you guys think?
Following PR #47, one should provide visualization of the data. I rephrased my previous suggestions to below:
Add either in one or both exercise and solution, a plot of the data before running the model, preferable after one has loaded the data. It is useful to list out a few data points (in pandas, that's the head method) and/or plot the graph. In that way, the learners understand how/why we are using linear regression, instead of just blindly running the model
After training (20,000 epochs!), plot the data with the model predictions from linear regression? Optionally/bonus, if one could plot a few predictions, say one from after 1000 epochs, one from 10,000 epochs and one from 20,000 epochs. This illustrates how the fitting get better and better (if it doesn't, then we should have stopped much earlier in our training process).
Need a first pass/draft of code of conduct. Probably will be modified appropriately once the core team is more organized
Expanding on issue #26 it may be worth adding multiple solutions written using both tf/keras and pytorch. I would be happy to work on solutions to current exercises.
More detailed descriptions and provide credit to the source of the housing dataset in exercise 002. Please do so in /exercise/readme.md
and/or in the exercise notebook.
OpenAI has generated lots of hype with GPT-3. In some way, its like GAN but for NLP to generate 'fake' texts with some primes and prompts.
The goal of the exercise is to teach the learners on how to use the model, rather than how to train/build.
Pick some prompts / primes from some recent controversial and generate texts. For eg it could be about masks, police brutality or etc.
Some familiarity with using pretrained models. Reading GPT-2 docs will be useful.
No data needed as we are using a generative pretrained model.
A template for readme page for each exercise and solution which would also contain links to quality resources on the topic in the solution readme. This would make learning more systematic and help people revise a topic before starting the project.
Learn kNN algorithm for supervised classifications. Preferably use the kNN package from scikit-learn.
Some basic of kNN will be assumed. If scikit-learn is used, some basics of how to install scikit-learn library is assumed.
I'm agnostic about which dataset to use, so anything suggested from a textbook exercise/blog is good.
I'm currently doing Andrew Ng's famous ML course on Coursera (https://www.coursera.org/learn/machine-learning/home/welcome) and found this neat Github repo which provides Juptyer Notebooks for students who prefer to use Python for the programming assignments rather than Octave/MATLAB (since Python is much more applicable these days, but the course is unparalleled in teaching the theory / fundamentals in my opinion). The Juptyer Notebooks also include excellent documentation and notes from the lectures.
https://github.com/dibgerge/ml-coursera-python-assignments
Since this is such a popular course, I was going to suggest including it as a submodule to this repo, and I can contribute my Python solutions to the programming assignments for other students to refer to. What do you guys think?
In particular, help a learner to learn full gradient descent, mini-batch stochastic gradient descent and etc.
It could be on linear regression or some simple neural network or etc.
But should focus on understanding the differences between the different gradient descent approaches, pros and cons, how fast/slow learning, how about data-set size and etc?
1.a. Curate a list of GitHub-hosted code examples/exercises. From the top of my head, I could think of joelgrus and ageron. Both are fantastic resources.
b. Since these other resources are great open source codes, perhaps we could incorporate some of them (with appropriate credit and source links of course).
a. Those textbook-to-github-hosted-codes examples are typically curated and maintained by the writers (with some contributions from everyone). However, for us, we are mainly community-driven, so there might be a difference in how we approach exercise. Perhaps a feature (or a bug?) for us is we might start with some solution that is very rough, purely styled and full of bugs, but the hope is we can slowly polish it as a community and learn good practice and style, and eventually reach a great code solution. It is a little bit like a learning code journey rather than trying to get the perfect code from the get-go.
b. Moreover, I'd really like this project to have a large contribution pools, for people from different levels and backgrounds. Since none of us is master in most things, this could give the project a large scope. I envision that we might branch into a few main umbrellas of either topics or coding-expertise, such that it will transcend a particular textbook or a particular field.
Seems like we have a Tensorflow solution, it will be nice to have a PyTorch implementation of a simple neural network for survival classification.
[Learning goals, bulleted/numbered list is preferred]
Starting with randomforest, adaboost, etc up to popular LightGBM, XGBoost, expose learners to a variety of bagging/boosting/ensemble/stacking techniques.
This could be a series of 4-5 exercises.
As number of exercises grow, we might want to restructure things a little to properly categorize them. For e.g., maybe intermediate should take 300 above and advanced/state-of-the-arts will take 900 above or something. Then within each range of numbers, we could have different topics taking different numberings.
Some exercises and solution based on using huggingface will be cool. This is an advanced project. For examples, use the library to perform BERT or GPT-2 analysis on some fun NLP-related data.
[Learning goals, bulleted/numbered list is preferred]
[e.g. learn the concept and the use of train/validation/test dataset using scikit-learn ]
Learn to preprocess images, use a new mobile neural network architecture, learn tensorflow.
[Explain and describe what the exercise is]
[e.g. apply simple random-forest model to classify titanic survivability from titanic data ]
Apply Mobile Net V2 model to detect whether someone is wearing a mask from live video.
[Prerequisites, in terms of concepts or other exercises in this repo]
[e.g. random-forest model, stochastic gradient descent, exercise #32]
Python
[Provide a succinct summary of what the data is and where it is from]
[e.g. This involves covid19 fatality dataset from John Hopkin's website (links..) ]
Mask images form Github.
[e.g. I have the solution using PyTorch, will be happy to create pull request to include the exercise statement/solution]
[e.g. I think chapter 3 of A. Geron's textbook works out the solution for this exercise]
[e.g. fast.ai's chapter 5 has the perfect solution for this]
I have the solution in scikit learn, will be happy to share
[e.g. This exercise and solution's proposal came from a lab session from DL2020]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.