Georgia Tech Junior Design Project

HTML 2.11% CSS 11.69% JavaScript 48.73% Python 37.46%

team-2130-machine-learning-roulette's Introduction

Machine Learning Roulette - Team 2130

Our project is to create a website that allows users to receive an evaluation on the performance of selected machine learning algorithms on the dataset (csv form) which the users upload. In front-end, the website accepts a data set, selects a model or models, and displays statistical model quality results based on other parameter selection. For back-end, we will run the various models, as configured, on the uploaded dataset and offer aggregated quality metrics for the website to render in an informative way.

Team members

Hyelin Lee: [email protected] (Role: Summarizer)

Yuanzhi (David) Liu: [email protected] (Role: Opinion Seeker)

Ruokun (Tommy) Niu: [email protected] (Role: Information Giver)

Harrison L O'Neal: [email protected] (Role: Clarifier)

Junqi (Jacky) Xu: [email protected] (Role: Initiator, Information Giver)

Haoran (Marty) Zhao: [email protected] (Role: Information Seeker)

Release Note

Version 0.4.0

New Features

Display metrics for the Machine learning model training result (including accuracy, prior probability, mean, standard deviation, etc)
Seperate login page and register page. After new user is registered, they will be redirected into the upload page automatically.
Set up database procedure to store history data
Default training percentage set to 70%

Bug Fixes

Disabled the model selection once the user has uploaded their dataset

Version 0.3.0

New Features

Implemented Naive Bayes
Implemented Hierarchical Clustering
Implemented Decision Tree
Database setup
Supported y-label for accuracy comparison

Bug Fixes

Modified the order of frontend upload stage. User will choose the ML model first and then upload their dataset. Y-label is required for some ML models and is optional for the others. The logic will be much clear if the user chooses ML model first, so that our frontend can decide whether y-label is must or not.

Version 0.2.0

New Features

Implemented KMeans Algorithm that takes in CSV dataset and # of clusters as parameters
Built backend API that call KMeans algorithm to get cluster assessment
User authentication (Register and Login)
Deploy the website (https://www.mleroulette.com/)

Bug Fixes

Disabled upload button when the user is not in the first stage of upload.

Version 0.1.0

New Features

Frontend page for uploading dataset (CSV Format) and selecting ML models and parameters.
Frontend page for Login and registration
Error Modal
Backedn APIs to receive CSV dataset and parameters

Bug Fixes

Large margin in "Upload: step2"

Installation Guide

Install Git https://github.com/git-guides/install-git
Git is a distributed version control system, tracking changes in any set of files. In this project, we use git to do the version control.

git clone https://github.com/JackyXu-Cool/Team-2130-Machine-Learning-Roulette

Install node and npm For our backend, we use node.js After node is installed, follow the instruction here to install backend
Install Frontend related package Follow the instruction here
Database Integration The features we'd like to have are fully set up but we don't integrate it into our application yet. Follow the guide here to learn more about how our database works
IDE installation It is strongly recommende to use a light-weighted IDE to run our application. For example, Visual Studio Code

Client

Jay Lofstead, Sandia National Laboratories

team-2130-machine-learning-roulette's People

Contributors

Stargazers

Watchers

Forkers

mmmmartyzhao lcsouzamenezes mmmmarty ruokun-niu hloneal

team-2130-machine-learning-roulette's Issues

Allow registered users to select a previousl-uploaded dataset

Registered users should be able to use a previously-uploaded dataset.

As a user, I want to upload my own dataset so I can generate models.

Dataset storage and publicly available

In Frontend, add a checkbox "Are you agree to make this dataset public"?
Add a new page for user to select current publicly available dataset
Store Dataset in AWS S3

Implement Kmeans

Implement Kmeans
Write Unit tests

'/training' support Naive BayesAlgorithm

When NB is selected, the backend will call the NB algorithm with the provided dataset

Control Random Seed Variable

As a researcher, I want to be able to control the random seed variable so I can easily reproduce my work.
Scenario: A user wants to control the random seed variable for the Machine Learning models
Given that the user wants to reproduce his/her previous work by controlling the random seed variable;
When the user inputs a numerical integer for the random seed value before executing the models;
Then the models will execute using the inputted value as the random seed variable.
Scenario: A researcher wants to verify a peer’s results.
Given the initial research has the random variable seed variable clearly indicated;
When the peer reviewer inputs the same random seed variable and dataset;
Then they will see the same exact results.
Scenario: A researcher wants to make sure their models are replicable.
Given they know what model they want to use.
When the researcher is entering hyperparameters they will manually control and note down their random variable seed.
Then anyone who has the researcher’s data and random variable seed will be able to exactly recreate the researcher’s models.

Connect the base frontend code with cloud service

Users should be able to trigger backend ML scripts from frontend

Store training result and parameters into the database

AWS bucket construction

Setting up the AWS bucket. Build one(start with "www") public for future website use and another one for internal test

As a user, I want to have the training done in the background and notified when completed so I can leave the computer and do some other tasks at the same time.

Website Deploy

Create domains for the website and deploy it through route53

Implement Naive Bayes

Implement Naive Bayes
Implement Unit Tests

Documentation (for project tracking assignment)

All the required features for our application are done, however, for the stretch goal, which is the database implementation & integration, we don't have enough time to implement all of them. See our database documentation for more details. https://github.com/JackyXu-Cool/Team-2130-Machine-Learning-Roulette/tree/master/mlr_database

Features we fail to implement:

Select previously uploaded dataset https://app.zenhub.com/workspaces/team-2130-61f8345f61dce90014cd6cc5/issues/jackyxu-cool/team-2130-machine-learning-roulette/11
Store training result https://app.zenhub.com/workspaces/team-2130-61f8345f61dce90014cd6cc5/issues/jackyxu-cool/team-2130-machine-learning-roulette/12
Cloud service infrastructure https://app.zenhub.com/workspaces/team-2130-61f8345f61dce90014cd6cc5/issues/jackyxu-cool/team-2130-machine-learning-roulette/13

Uploading datasets

As a registered user, I want to upload my own dataset so I can generate models.
Scenario: A user has a fresh dataset and doesn’t know what models to generate.
Given the user is on the upload page;
When the user selects which dataset to upload;
Then they will be able to select a wide variety of models to start determining the best model to move forward with.
Scenario: The user knows which model they want to generate.
Given the user has already uploaded their dataset;
When the user is selecting their models to generate;
Then they will be able to fine tune their hyperparameters.
Scenario: The user does not have a dataset that they want to use;
Given the user is on the upload page;
When they don’t know what to upload;
Then they can follow links to find publicly available datasets.
Scenario: The dataset exceeds the maximum size allowed
Given the user has already uploaded their dataset and the size of the dataset has exceeded the limit;
When the user presses the “start training” button;
Then an error will be thrown, notifying the user that the data has exceeded the size limit

Review results from previous runs

As a user, I want to see old results so I can avoid having to run the algorithm again.
Scenario: An existing user is looking for their results they got several days ago.
Given the user was logged in when they began running their models;
When the user goes to their results history page;
Then they will be able to see their past results.
Scenario: A new user is looking for historical result
Given the user was just a guest user and does not have an account
When the user tries to see their historical result
Then they will be directed to the account sign-up page.
Scenario: The user was not logged in when running their models.
Given the user has closed the webpage since running their models;
When the user goes to the results history page;
Then they will be asked to login but their results won’t be available.
Scenario: The user was not logged in but selected to make their results public.
Given the user has closed the webpage since running their models;
When the user goes to the general database;
Then they will be able to find the results.

How to run

please help with instructions on how to run this , i am pretty new to this , can i use visual studio code

Complete the front-end portions for dataset upload, selecting ML models and hyperparameters

Users should be able to click on the "upload" button to upload their datasets (in csv format).
There should be an array of checkboxes that allows the users to select their preffered ML models
There should be an array of text input fields that allows the users to modify the values of the hyperparameters

As a user, I want to choose the parameters so I can figure out the best fit case for different analysis.

Allow users to upload Y labels

Backend system setup

Backend system setup to run multiple Machine Learning algorithms

Training notification

As a user, I want to have the training done in the background and notified when completed so I can leave the computer and do some other tasks at the same time.
Scenario: A user wants to leave his/her workplace
Given the user has started the training process by clicking the “start training” button
When the user shuts off the computer,.
Then the training should still keep running in the background until it is completed.
Scenario: the user wants to run two trainings at the same time
Given the user has started one training session
When the user wants to train another dataset
Then the user could open another tab to train another dataset while the first dataset will still keep training in the background.
Scenario: after the training is done in the background, the user wants to be notified via email.
Given the user has already uploaded the dataset and selected the parameters.
When the user wants to enable notification via emails.
Then the user could click the “enable notification when finished” button, which then a prompt will appear asking the user to type in his email or use a saved email address. After the training is completed, it will then send a notification email to this email address.

Decision Tree Algorithm fail to run

I ran the decision tree algorithm with the X_test.csv and Y_test.csv in /testdata folder, but I got this error

  File "C:\JackyXu\GT\Spring 2022\Junior Design\code\mlr_backend\run.py", line 90, in trainData       
    dtree_accuracy = metrics.calculateAccuracy(prediction, Ytest)
  File "C:\JackyXu\GT\Spring 2022\Junior Design\code\mlr_backend\metrics.py", line 6, in calculateAccuracy
    model_output = [int(row[0]) for row in y.tolist()]
  File "C:\JackyXu\GT\Spring 2022\Junior Design\code\mlr_backend\metrics.py", line 6, in <listcomp>   
    model_output = [int(row[0]) for row in y.tolist()]
TypeError: 'float' object is not subscriptable

Can you take a look at it @ruokun-niu

Prompt registered users to login/reset passpord

When a user tries to create an account with an email address that is already registered, the system should prompt the user to log in or reset the password if needed.

Select parameters

As a user, I want to choose the parameters so I can figure out the best fit case for different analysis.

Scenario: User wants to select the parameters for the specific ML models
Given the user has already uploaded the dataset and selected the ML model(s);
when the user selects the related parameters;
then ML models will adjust based on the selection when executed
Scenario: User wants to find the random variable seed that generates the optimal results for their dataset.
Given the user has selected their algorithm to use and identified a range of seed variables to try.
When the user generates models for every random seed.
Then they will be able to determine and utilize the best seed for their data in future uses.
Scenario: User does not select any parameters by accident and the ML model they select requires the user to put at least one parameter
Given the user does not select any parameter
When they click “run the model”
The model will not run and the user should select at least one parameter to continue.

Cloud Service construction and management

Cloud Service construction and management
Deploy Machine Learning scripts to a cloud service provider:

AWS: Linux
Azure: Linux & Windows

jackyxu-cool / team-2130-machine-learning-roulette Goto Github PK

team-2130-machine-learning-roulette's Introduction

Machine Learning Roulette - Team 2130

Team members

Release Note

Version 0.4.0

New Features

Bug Fixes

Version 0.3.0

New Features

Bug Fixes

Version 0.2.0

New Features

Bug Fixes

Version 0.1.0

New Features

Bug Fixes

Installation Guide

Client

team-2130-machine-learning-roulette's People

Contributors

Stargazers

Watchers

Forkers

team-2130-machine-learning-roulette's Issues

Recommend Projects

Recommend Topics

Recommend Org