Git Product home page Git Product logo

99-ml-learning-projects's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

99-ml-learning-projects's Issues

[IMP] Implement Basic ML Algorithms on a Employee Attrition Dataset

This issue is especially for Hacktoberfest participants

Learning Goals

How different algorithms give different results when implemented on a single dataset

Exercise Statement

[Explain and describe what the exercise is]
Implement different ML Algorithms like Logistic Regression, Random Forest, XG Boost for Employee Attrition dataset

Prerequisites

Random-forest model, feature extraction, SVM, logistic Regression, etc

Data source/summary:

[Provide a succinct summary of what the data is and where it is from]
To predict Employee Attrition by the given data about his/her past history. This dataset is a modified version of the IBM Employee Analytics Dataset

(Optional) Suggest/Propose Solutions

Implement different data preprocessing techniques, algorithms. Feel free to use your creativity. Add your solution with an explanation in comments with the filename as the name of the models used and a short description of your solution like what techniques you used and the model accuracy in the solution readme.

Create unit test / CI

Perform CI (purest) for code reviews. I am not entirely sure about how to do that for Jupyter notebook (.ipynb) but was reading around the internet and it seems like catching the error when converting it to .py file is the way to go.

Idea for exercise (linear regression)

Goal: Learn how to use linear regression

Packages needed: sklearn.linear_model.LinearRegression

Idea and task:

  • Obtain interesting dataset with rather linear relationship. E.g. include GDP vs Happiness Index, crop yield vs rain fall and etc

  • Apply Linear Regression.

  • Bonus and extra credits : Understanding outliers, multi-dimensional regressions

Dependencies and libraries

For .py files, we need to have a requirement file and instruction on how to install it. There may be a need to do virtual environment or docker file deployment.

For Jupyter notebook, should we have some sort of standardised header codes to install a standard version of libraries? Also for Jupyter file, if we recommend google colab, then we don't have to worry about Jupyter notebook version and dependencies; otherwise, we should think about that.

[EXE] An exercise to learn decision tree

Learning Goals

A in-depth exercise to explore and learn the different aspects/hyperparameters of decision tree. Preferably using scikit-learn.

Prerequisites

A basic understanding of decision-tree, though this exercise is supposed to go into more detailed on how to use and optimize decision tree.

Data source/summary:

I'm agnostic to data source, as long as its useful to learn/teach the method.

Recruiting for core

Anyone interested in maintaining and developing this repo as core team? Please comment below

[EXE] CIFAR10 Machine Learning Project

Is your feature request related to a problem? Please describe.
CIFAR10 is one of the basic datasets in machine learning. And if you have already have worked with MNIST dataset, CIFAR dataset helps is where we should move to get familiar with images having 3 Channels(coloured images).

Describe the solution you'd like
I want to include exercise. And a basic solution implemented in PyTorch.
Please let me know if should make a PR for this or not.

[EXE] Malaria Detection

Learning Goals

Work with Tensorflow and image data & implement different models with this (eg- VGG16, VGG19, RSNET)

Exercise Statement

[Explain and describe what the exercise is]
The dataset contains parasitized and uninfected cells from the thin blood smear slide images of segmented cells. Here a VGG16 model is used to classify the cells as Infected & Uninfected

Prerequisites

[Prerequisites, in terms of concepts or other exercises in this repo]
Tensorflow/Keras, Transfer Learning

Data source/summary:

This dataset is simple and interesting enough to learn to implement different CNN architectures
The Malaria dataset contains a total of 27,558 cell images with equal instances of parasitized and uninfected cells from the thin blood smear slide images of segmented cells.
https://www.kaggle.com/iarunava/cell-images-for-detecting-malaria

(Optional) Suggest/Propose Solutions

I have the solution using the VGG19 model in Tensorflow, & will be happy to create a pull request and will then implement other models on this dataset.

[EXE] Logistic Regression as a Neural Network- Basic Deep Learning Algorithms

Learning Goals

Learn the gradient descent algorithm using numpy and matplotlib.

Exercise Statement

This exercise helps classify pictures as cat and non-cat with the help of neural networks with 2 layers.

Prerequisites

Must know the basics of logistic regression.

Data source/summary:

This is the solution to the Neural Networks Exercise in Coursera. It is a custom dataset.

(Optional) Suggest/Propose Solutions

I have the solution notebook for this exercise. I will be happy to create pull request to include the exercise.

(Optional) Further Links/Credits to Relevant Resources:

Found this exercise here

[IMP] Adding subtasks in exercise 001

Since this repo is aimed at people trying to learn machine learning, I think it would be helpful if subtasks were added in exercise 001, especially in regards to data analysis and feature engineering.

Like for example:

  • data analysis
    • plot the survival rate of males and females
    • Survival rate based on age
    • etc..
    • finally, what observations do you make of the analysis?
  • feature engineering
    • dealing with missing values
    • dealing with categorical values
    • etc..
  • building the model
    • choose a model
    • fit the data
    • get the accuracy metrics
    • bonus: test on multiple models and compare accuracy

Since it can be a little overwhelming at the start, providing some sort of outline could be helpful.
If the above steps are too specific, it could be a little more broad to allow the person to think by themselves.

A bonus section could also be added, for those who want to go the extra mile.

more details on git docs

Create a more specific fork-clone-... steps for our project. A step-by-step version of how to do that and contribute new exercise and/or solution will be useful to new git/github users.

LinkedIn Automation Bot

Hello, I have a project on automation which is LinkedIn Automation. It automatically logs in and goes to the My Network page. And then It started making connections with the suggested users. It runs in a while loop. And it makes connections in an interval of time.
Could I add it to 99 Machine Learning Projects?

[EXE] LinkedIn Automation Bot

Hello @gimseng, I have a project on automation which is LinkedIn Automation. It automatically logs in and goes to the My Network page. And then It started making connections with the suggested users. It runs in a while loop. And it makes connections in an interval of time.
[12:24 PM]
Could I add it to 99 Machine Learning Projects?

[EXE] pt1: Simple Decision Tree exercise, pt2: Pipelines

Learning Goals

Part 1:

  • Work with scikit-learn library, train-test set split, report different scores.
  • Decision Trees.

Part 2:

  • Work with Pipelines (with DecisionTrees), imputers, scalers and encoders.
  • Grid Search.

Exercise Statement

Part 1:
Apply different Decision Trees to train a model for detecting breast cancer using the breast-cancer-wisconsin-diagnostic-dataset (scikit-learn 7.2.7. Breast cancer wisconsin (diagnostic) dataset).
Goal is to predict whether breast cancer is Malignant or Bening.

Part 2:
Apply various transformations, imputers, encoders-scalers using Pipelines with DecisionTreeClassifiers. Work with gridsearch to find the best parameters. Goal is to predict whether income exceeds $50K/yr based on census data.

Prerequisites

DecisionTreeClassifier
Pipeline
SimpleImputer
StandardScaler
OneHotEncoder
ColumnTransformer
GridSearchCV

Data source/summary:

Part 1:
569 instances with 30 numeric attributes. Class distribution: 212 - Malignant, 357 - Benign
Follow the link below for the full description of the dataset.
https://scikit-learn.org/stable/datasets/#breast-cancer-wisconsin-diagnostic-dataset

Part 2:
income.csv is used for training set.
32561 instances with 14 attributes, 6 numeric (e.x. age, capital gain, hours-per-week ) and 8 categorical (e.x. workclass, education, race).

income_test.csv is used for testing and report scores.
15315 instances with 14 attributes, 6 numeric (e.x. age, capital gain, hours-per-week ) and 8 categorical (e.x. workclass, education, race).

Goal is to predict whether income exceeds $50K/yr based on census data.
Link: http://archive.ics.uci.edu/ml/datasets/Adult

(Optional) Further Links/Credits to Relevant Resources:

This exercise was assigned in the machine learning course at Aristotle University of THessaloniki and the solution was my submission at this.

[FEA] Organise the exercise creation workflow

Is your feature request related to a problem? Please describe.

It seems like a very time consuming task to write the exercise statement, write the codes, test the codes, and polish up all the above steps by one person.

Describe the solution you'd like
I am very impressed by the quality of freecodecamp and recently delved into their process of creating a exercise. It is very similar to what we are doing.

It involves having a few stages of creating an exercise. Roughly, someone started an exercise with some codes. Someone else could jump in after and polish up instruction text. Someone later on will test and break the codes, and give feedback to the first two stages. Repeat until convergence.

Check out their project board: https://github.com/orgs/freeCodeCamp/projects/10

I could implement this in our project board. On top of that we could follow the discord model of https://www.reddit.com/r/learnmachinelearning/comments/hthfds/completed_3_projects_with_100_data_scientists/ to have more in depth discussions among those involved in this step, with a channel dedicated to a particular exercise.

How learners should use this project?

I realize that someone who's not in the loop who stumbles across this repo might not know what to do with it. Are they supposed to:

(a) fork it and just read through python codes or

(b) actively contributing exercises or

(c) be maintainers(?)

(d) do nothing?

[EXE] Implementing naive bayes algorithm from scratch

Learning Goals

Naive Bayes algorithm is one of the most simplest yet powerful ML algorithm out there, often used as a baseline for text based classification. Understanding the working of this algorithm will help in understanding:-

  • Conditional probability
  • Bayes theorem
  • Laplace smoothing

Exercise Statement

[Explain and describe what the exercise is]
The objective of this exercise is to implement the Naive Bayes algorithm along with using python 3 and numpy. The dataset to be used is Haberman's Survival Dataset

Prerequisites

You must have the basic understanding of what machine learning is, what is supervised and unsupervised learning, what is classification, etc. Knowledge of python 3 and numpy if also required.

Data source/summary:

Haberman's Survival Dataset is a dataset containing cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer. Click here to learn more about this dataset.

Further Links:

[EXE] Health Insurance Cross Sell Production

Learning Goals

Learn the concept of KNN algorithm for classification.

Exercise Statement

This exercise helps in classifying likelihood of a person to get insurance policy for his/her vehicle based on different factors including age, gender, having health insurance etc.

Prerequisites

Must know the basic statistics and a programming language and concepts of scikit-learn.

Data source/summary:

This data is available on Kaggle, a task open for anyone interested to take up.

(Optional) Suggest/Propose Solutions

I have a solution notebook for this using KNN available on my Kaggle profile and would love to share my contribution by creating a pull request.

(Optional) Further Links/Credits to Relevant Resources:

This task is available here

[EXE] Implement Different learning algorithm from scratch with visualization

One of the most important component of neural networks is the learning algorithms it uses.
Since for most of the beginners it is like a black box.
I am planning to contribute code for these learning algorithms. These will include:

  1. Gradient Descent
  2. Momentum based Gradient Descent
  3. Nesterov Accelarated Gradient Descent
  4. AdaGrad
  5. RMSProp
  6. Adam
    It will be a .ipynb file with complete implementation from scratch also with proper visualization as well as documentation, so that beginners can get a better understanding of these concepts.

[EXE] Logistic Regression from scratch

Learning Goals

  • Implement logistic regression from scratch using numpy

Exercise Statement

  • The exercise will focus on implementing logistic regression model using a single sigmoid neuron.

Prerequisites

  • Basic knowlegde of numpy, pandas and matplotlib
  • anaconda installed to run jupyter notebook

Data source/summary:

  • Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final images have 400x 400 pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a resolution of about 660 dpi were gained. Wavelet Transform tool were used to extract features from images.

  • The dataset is available on UCL reporsitory

(Optional) Suggest/Propose Solutions

  • I have implemented ipython notebook to implement it from scratch.

(Optional) Further Links/Credits to Relevant Resources:

[e.g. This exercise and solution's proposal came from a lab session from DL2020]

Beginner's ML coding exercises

Goal: To learn one of the most important part of doing a ML project/question, Data Wrangling and then perform basic Classification based Supervised Learning Algorithms.

What and Why : To predict the survival of the passengers of The Titanic. We will be doing this to be able to have a hands-on learning experience for classification problems and an introduction to solve various ML problems. It might even give you a head start on Kaggle.

Apply Logistic Regression, Decision Tree, Random Forest, Gaussian Naive Bayes and much more

Bonus and extra credits : Understanding regularization, model selection, confusion matrix and cross validation.

[BUG] For google colab use, data paths not correct

Describe the bug
Data paths are not correct for google colab use. Users either have to download and then upload to their google drive OR they should point the data source to the correct data in our GitHub repo

Proposed Solution
Either point all our data sources to our GitHub links OR/AND provide docs on how to upload files to their google drive and mount google drive on google colab.

[EXE] idea for exercise 005 : Sentiment analysis

Learning Goals

  • learn preprocessing of text data and various important processes such as tokenization and stemming
  • learn regular expressions through stopwords removal from text
  • learn logistic regression
  • learn naive bayes
  • learn application of neural networks on text data
  • learn vectorization techniques and mathematical implementations through numpy
  • learn and use libraries such as pandas, numpy, nltk, tensorflow, keras and scikit-learn.

Exercise Statement

In this exercise, one would be able to know how to analyse sentiments of text data using various conventional and advanced algorithms along with textual data processing techniques.

  • apply logistic reegression for sentiment analysis
  • apply naive bayes for sentiment analysis
  • apply word2vec algorithms

Prerequisites

This exercise goes from basic methods to advanced ones, so there are no hard requisites for this exercise. But it is recommended that one should know basic ML workflow to grasp things conveniently.

Data source/summary:

Data has been taken from various sources and public datasets available.

I would update this later.

(Optional) Suggest/Propose Solutions

I have a solution and I would be adding a PR for the same. However, one is welcome:

  • to enhance the existing solution
  • to add other methods and algorithms for sentiment analysis

(Optional) Further Links/Credits to Relevant Resources:

I would update this later with various references.

[EXE] Learn regularizations

Learning Goals

Learn different methods of regularizing the models. This could be as basic as the L1, L2 (or ridge/Lasso) regularization or more sophisticated ones in other methods (like regularization hyperparameters in SVM) or even dropouts in neural network.

Feel free to use any data and models as you see fit.

[IMP] 005 sentiment analysis missing codes

The notebook seems to have some missing codes at the end regarding distributing positive and negative reviews.

@tejasvi541 Could you take a look since you last worked on it? Please fork the current/updated master file and add your codes to the end, if you have them. Thanks !

[EXE] New idea for exercise on creating a interactive ML application

Learning Goals

For people who want to not just create a machine learning model but also create an interactive dashboard to go with it, this exercise might be a great starting point!

Exercise Statement

I will provide a sample machine learning project along with a small tutorial about how to convert it to an interactive project using streamlit and heroku.

Prerequisites

Working knowledge of python and sklearn should suffice as a good prerequisite.

Example

Here is a link to my Motivation
poster

Generator Dashboard that I created!
https://mymotivationalapp.herokuapp.com/

This kind of application will be the result of this exercise. What do you guys think?

[IMP] Data visualization of Exercise 002

Following PR #47, one should provide visualization of the data. I rephrased my previous suggestions to below:

  1. Add either in one or both exercise and solution, a plot of the data before running the model, preferable after one has loaded the data. It is useful to list out a few data points (in pandas, that's the head method) and/or plot the graph. In that way, the learners understand how/why we are using linear regression, instead of just blindly running the model

  2. After training (20,000 epochs!), plot the data with the model predictions from linear regression? Optionally/bonus, if one could plot a few predictions, say one from after 1000 epochs, one from 10,000 epochs and one from 20,000 epochs. This illustrates how the fitting get better and better (if it doesn't, then we should have stopped much earlier in our training process).

Create code of conduct

Need a first pass/draft of code of conduct. Probably will be modified appropriately once the core team is more organized

[IMP] Add Pytorch solutions

Expanding on issue #26 it may be worth adding multiple solutions written using both tf/keras and pytorch. I would be happy to work on solutions to current exercises.

[EXE] GPT-2 Use

Learning Goals

OpenAI has generated lots of hype with GPT-3. In some way, its like GAN but for NLP to generate 'fake' texts with some primes and prompts.

The goal of the exercise is to teach the learners on how to use the model, rather than how to train/build.

Exercise Statement

Pick some prompts / primes from some recent controversial and generate texts. For eg it could be about masks, police brutality or etc.

Prerequisites

Some familiarity with using pretrained models. Reading GPT-2 docs will be useful.

Data source/summary

No data needed as we are using a generative pretrained model.

[DOC] Create a detailed format for the project

A template for readme page for each exercise and solution which would also contain links to quality resources on the topic in the solution readme. This would make learning more systematic and help people revise a topic before starting the project.

[EXE] Learning KNN supervised classification

Learning Goals

Learn kNN algorithm for supervised classifications. Preferably use the kNN package from scikit-learn.

Prerequisites

Some basic of kNN will be assumed. If scikit-learn is used, some basics of how to install scikit-learn library is assumed.

Data source/summary:

I'm agnostic about which dataset to use, so anything suggested from a textbook exercise/blog is good.

Python versions of programming assignments for Andrew Ng's ML Coursera course

I'm currently doing Andrew Ng's famous ML course on Coursera (https://www.coursera.org/learn/machine-learning/home/welcome) and found this neat Github repo which provides Juptyer Notebooks for students who prefer to use Python for the programming assignments rather than Octave/MATLAB (since Python is much more applicable these days, but the course is unparalleled in teaching the theory / fundamentals in my opinion). The Juptyer Notebooks also include excellent documentation and notes from the lectures.

https://github.com/dibgerge/ml-coursera-python-assignments

Since this is such a popular course, I was going to suggest including it as a submodule to this repo, and I can contribute my Python solutions to the programming assignments for other students to refer to. What do you guys think?

[EXE] A simple exercise to understand stochastic gradient descent

In particular, help a learner to learn full gradient descent, mini-batch stochastic gradient descent and etc.

It could be on linear regression or some simple neural network or etc.

But should focus on understanding the differences between the different gradient descent approaches, pros and cons, how fast/slow learning, how about data-set size and etc?

Links and relations to other code-based open source learning project

1.a. Curate a list of GitHub-hosted code examples/exercises. From the top of my head, I could think of joelgrus and ageron. Both are fantastic resources.

b. Since these other resources are great open source codes, perhaps we could incorporate some of them (with appropriate credit and source links of course).

  1. How do we distinguish our project, as compared to other projects from above in 1? Some suggestions/ideas:

a. Those textbook-to-github-hosted-codes examples are typically curated and maintained by the writers (with some contributions from everyone). However, for us, we are mainly community-driven, so there might be a difference in how we approach exercise. Perhaps a feature (or a bug?) for us is we might start with some solution that is very rough, purely styled and full of bugs, but the hope is we can slowly polish it as a community and learn good practice and style, and eventually reach a great code solution. It is a little bit like a learning code journey rather than trying to get the perfect code from the get-go.

b. Moreover, I'd really like this project to have a large contribution pools, for people from different levels and backgrounds. Since none of us is master in most things, this could give the project a large scope. I envision that we might branch into a few main umbrellas of either topics or coding-expertise, such that it will transcend a particular textbook or a particular field.

Titanic exercise with PyTorch

Seems like we have a Tensorflow solution, it will be nice to have a PyTorch implementation of a simple neural network for survival classification.

[EXE] Ensemble and stacking methods

Learning Goals

[Learning goals, bulleted/numbered list is preferred]
Starting with randomforest, adaboost, etc up to popular LightGBM, XGBoost, expose learners to a variety of bagging/boosting/ensemble/stacking techniques.

This could be a series of 4-5 exercises.

Create category for different levels and topics?

As number of exercises grow, we might want to restructure things a little to properly categorize them. For e.g., maybe intermediate should take 300 above and advanced/state-of-the-arts will take 900 above or something. Then within each range of numbers, we could have different topics taking different numberings.

[EXE] Exercise to use huggingface

Some exercises and solution based on using huggingface will be cool. This is an advanced project. For examples, use the library to perform BERT or GPT-2 analysis on some fun NLP-related data.

Create cheatsheets

  • Python refresher
  • Numpy and Pandas cheatsheet
  • Sklearn cheatsheet
  • PyTorch cheatsheet
  • TensorFlow cheatsheet

[EXE] Mask Detector in Live Cameras

Learning Goals

[Learning goals, bulleted/numbered list is preferred]
[e.g. learn the concept and the use of train/validation/test dataset using scikit-learn ]
Learn to preprocess images, use a new mobile neural network architecture, learn tensorflow.

Exercise Statement

[Explain and describe what the exercise is]
[e.g. apply simple random-forest model to classify titanic survivability from titanic data ]
Apply Mobile Net V2 model to detect whether someone is wearing a mask from live video.

Prerequisites

[Prerequisites, in terms of concepts or other exercises in this repo]
[e.g. random-forest model, stochastic gradient descent, exercise #32]
Python

Data source/summary:

[Provide a succinct summary of what the data is and where it is from]
[e.g. This involves covid19 fatality dataset from John Hopkin's website (links..) ]
Mask images form Github.

(Optional) Suggest/Propose Solutions

[e.g. I have the solution using PyTorch, will be happy to create pull request to include the exercise statement/solution]
[e.g. I think chapter 3 of A. Geron's textbook works out the solution for this exercise]
[e.g. fast.ai's chapter 5 has the perfect solution for this]
I have the solution in scikit learn, will be happy to share

(Optional) Further Links/Credits to Relevant Resources:

[e.g. This exercise and solution's proposal came from a lab session from DL2020]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.