Git Product home page Git Product logo

climbsrocks / machinejs Goto Github PK

View Code? Open in Web Editor NEW
410.0 35.0 64.0 478 KB

[UNMAINTAINED] Automated machine learning- just give it a data file! Check out the production-ready version of this project at ClimbsRocks/auto_ml

Home Page: https://github.com/ClimbsRocks/auto_ml

JavaScript 49.30% Python 50.70%
machine-learning data-science javascript machine-learning-library machine-learning-algorithms ml data-scientists javascript-library scikit-learn kaggle

machinejs's Introduction

machineJS

a fully-featured default process for machine learning- all the parts are here and have functional default values in place. Modify to your heart's delight so you can focus on the important parts for your dataset, or run it all the way through with the default values to have fully automated machine learning!

auto_ml - machineJS, but better!

I just built out v2 of this project that now gives you analytics info from your models, and is production-ready. machineJS is an amazing research project that clearly proved there's a hunger for automated machine learning.

auto_ml tackles this exact same goal, but with more features, cleaner code, and the ability to be copy/pasted into production.

Check it out! https://github.com/ClimbsRocks/auto_ml

What is machineJS?

machineJS provides a fully automated framework for applying machine learning to a dataset.

All you have to do is give it a .csv file, with some basic information about each column in the first row, and it will go off and do all the machine learning for you!

If you've already done this kind of thing before, it's useful as an outline, putting in place a working structure for you to make modifications within, rather than having to build from scratch again every time.

machineJS will tell you:

  • Which algorithms are going to be most effective for this problem
  • Which features are most useful
  • Whether this problem is solvable by machine learning at all (useful if you're not sure you've collected enough data yet)
  • How effective machine learning can be with this problem, to compare against other potential solutions (like just taking a grouped average)

If you haven't done much (or any) machine learning before- it does some fairly advanced stuff for you!

Installation:

As a standalone directory (recommended)

If you want to install this in it's own standalone repo, and work on the source code directly, then from the command line, type the following:

  1. git clone https://github.com/ClimbsRocks/machineJS.git
  2. cd machineJS
  3. npm install
  4. pip install -r requirements.txt
  5. git clone https://github.com/scikit-learn/scikit-learn.git
  6. cd scikit-learn
  7. python setup.py build
  8. sudo python setup.py install

From the command line

node machineJS.js path/to/trainData.csv --predict path/to/testData.csv

Format of Data Files:

We use the data-formatter module to automatically format your data, and even perform some basic feature engineering on it. Please refer to data-formatter's docs for information on how to label each column to be ready for machineJS.

How to customize/dive in deeper:

machineJS is designed to be super easy to use without diving into any of the internals. Be a conjurer- just give it data and let it run! That said, it's super powerful once you start customizing it.

It's designed to be relatively easy to modify, and well-documented. The obvious place to start is inside processArgs.js. Here we set nearly all the parameters that are used throughout the project.

The other obvious area many people will be interested in is adding in new models, and different hyperparameter search spaces. This can be found in the pySetup folder. The exact steps are listed in stepsToAddNewClassifier.txt.

What types of problems does this library work on?

machineJS works on both regression and categorical problems, as long as there is a single output column in the training data. This includes multi-category (frequently called multi-class) problems, where the category you are predicting is one of many possible categories. There are no immediate plans to support multiple output columns in the training data. If you have three output columns you're interested in predicting, and they cannot be combined into a single column in the training data, you could run machineJS once for each of those three columns.

This library is well-tested on Macs. I've designed it to work on PCs as well, but I haven't tested that at all yet. If you're a PC user, I'd love some issues or Pull Requests to make this work for PCs!

Note: This library is designed to run across all but one cores on the host machine. What this means for you:

  1. Please plug in.
  2. Close all programs and restart right before invoking (this will clear out as much RAM as possible).
  3. Expect some noise from your fan- you're finally putting your computer to use!
  4. Don't expect to be able to do anything intense while this is running. Internet browsing or code editing is fine, but watching a movie may get challenging.
  5. Please don't run any other Python scripts while this is running.

Thanks for inviting us along on your machine learning journey!

machinejs's People

Contributors

climbsrocks avatar jalehman avatar kuychaco avatar sunnmy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

machinejs's Issues

training order

Almost all these ideas are into release 4.0. stay super focused on 2.0 and 3.0 first.

  1. nn and RF first, as these generalize well
    train the best nn until the rf finishes

only once these two have finished, start the next ones
don't try to start training them all in parallel. just two classifiers at a time.

create the ability for people to add in more over time. so they can start with just a nn and a rf, and then train more later that night.

consider modularizing this out into different repos entirely. have one repo for nns, one repo for RFs, etc. the master parent repo will then just aggregate all the child repos together intelligently.

modularize even further: have one python formatting repo, which all other python repos include.

i really want to make grid search for neural nets in JS available super broadly.

offer three training speeds: thorough, medium, fast, and super light. just make it a 5 point scale. let people choose how much compute power they want to give to this.

make the saved .pkl and .json files super modular- if they get one from a colleague, they should be able to swap that out in the directory, invoke rePredict and reEnsemble, and be good to go.

if possible, it certainly never hurts to keep training the nn for longer :)

figure out how to pass in a kaggle test file

this will force me to deal with a number of unknowns like:

A. How do i make the bestNet publicly available?
B. do i have to make a server publicly available in order to have it continuously listening for prediction data?
C. how do i make sure the prediction data gets formatted the same way as our training data?
D. how do i clean up extraneous files created during this process? (make sure to create a --leaveGarbage flag to not destroy any files)

naming ideas

best-brain
automatic-brain
automatic-neural-network (i like this best so far)
automated-machine-learning

i'm torn between something obvious that will make it easy for people to find, and something with a bit more personality.

writing bestNet to file limitations

if it is the first training round, and the time from the previous bestNet write was less than a half second ago, and the average bestNet writing velocity is above a certain threshold (averaging more than 5 nets in 3 seconds, say), don't write the new one to a file

then, at the end of the first training round, write the bestNet to a file.

refactor away from the use of the globals file

it had the issues i feared it would, but my tired brain decided to push ahead anyways and give it a shot.

i could refactor to have a globals object that gets passed around as an explicit input and output, but it might be cleaner to just create each variable individually to then get passed around as explicit arguments.

let the user pass in more information

it would probably be best to let them pass in more info in the first row. this info would label:

"ID"- identifies the id field
"Categorical Prediction" identifies an output field, and notes that the expected output is categorical
"Numerical Prediction" identifies an output field and notes that the expected output is numerical
"Categorical Field" notes that this field is categorical, even if we might otherwise assume it is a number.

prepare a 1.0 release

as soon as the docs are updated, make a 1.0 release.

2.0 will be once i've got more classifiers trained. this includes training the neural network for a considerably longer period of time.

3.0 will be once i've got more creative ensembling up and running.

4.0 will be generalizing it past just kaggle, and maybe giving an api to use from within javascript. this release might also include more control/advanced options for those who want it. but this has to be laser focused on 2.0 and 3.0 first.

5.0 will be revisiting the stats/ML part of this. frankly, the ensembling and grid search take care of most concerns here, but i'm sure there are ways we can do this more accurately.

training, testing, training

here's a fun one to try out at some way future point:

  1. train the algos on 80% of the dataset
  2. test them on 20%
  3. assemble the ensemble based on how they did on this 20%.
  4. this should, theoretically, test which ones generalize well and which do not. once we have figured this out, go back and train classifiers with those exact same parameters that we already identified train well against the entire input data set. but keep the same ensembling method we already developed that we know should generalize well. in this way, we've tested for generalization, but trained on the entire dataset.

this seems like it could be fairly controversial. but just test it and see if it decreases accuracy or not. admittedly, that will be somewhat expensive in terms of development time if it doesn't work out, but it could be pretty useful if it does.

write everything to a file

create a json that is a summary of everything.

ok, it will probably have to be a folder, and in that folder we'll have a .pkl of each of the classifiers, a json representing the ensembling.

give this the ability to just load up the classifiers that can use more training. i want people to turn this on at night right before they go to bed to put in more iterations training things, like the neural net.

feauture request: "does my newly engineered feature matter?"

let users quickly figure out if their new feature is highly predictive or not

it would just give directional guidance

it would go off and quickly train a lasso (probably not a rf, since they're unstable), and tell you how important your new feature is.

far from robust, but could be convenient

would likely rely on data-formatter under the hood so they can just write their results to a .csv file.

break everything out into separate repos

  1. automated brain
  2. atuomated python classifier trainers
  3. automated ensembler
  4. a script to automatically upload the files to kaggle, get the results, and then write those results to the json object sitting in the first row (above the headers) of our predictions file in the predictions folder. then, if we want, we can invoke our automated ensembler module again with this new piece of information about how well it generalizes.

change trainingCallback frequency to be more frequent

right now you can pass in a --maxTrainingIterations value that is less than our trainingCallback frequency. So it will train for 10 iterations regardless, even if you said --maxTrainingIterations 3.

My concern with making this too frequent is that sending messages from parent to child might be resource intensive. see if that might be blocking, and if so, if there's a way around it.

train the best neural net for considerably longer

once we've figured out the params for the best neuralNet, set it off to train by itself for a while. this is similar to what we do with RFs, getting the params from gridSearch, then training a rf with those params but a ton of trees for a long time.

modularization ideas

  1. data-formatter: format a dataset for neural networks
  2. best-brain: grid search for brain.js. Just returns the most optimized neural network available- no predictions or anything. would not include the extra training time, but would include instructions on how to warmStart the brain returned to you so you could control the extra training time yourself.
  3. automated-brain: combining these two to automate the entire process of making predictions against a dataset using a neural network. would add in the making predictions part
  4. ensembler- takes in the prediction files from various other ml algos, and ensembles them together in creative ways
  5. python-data-formatter: gets data ready for machine learning in python's scikit-learn library (summarizes it so the user can easily spot errors, runs the same transformations against the combined training/testing data set so they're binarized/normalized/whateverized in the same way, imputes missing values, etc.). ideally this would be flexible enough to format it for different ml algos (maybe svms need normalization, while random forests don't)
  6. automated-machine-learning: run all the python classifiers, making predictions agains the datasets and writing those to predictions files
  7. assembling all this together to make a single master predictions file at the end with all these ml algos

documentation- getting started

note what it takes to set up their dev environment

pip install: scikit learn, quite possibly other things like pandas

python itself is installed. instructions on how to do that, and super clear intructions that they should not have to if they are on a mac

is there a way of automatically installing pip dependencies?!

allow user to specify type of input field

this will be useful, particularly for things like date and time, which otherwise could look a lot like text.

i'm sure there are modules out there that will take an uncertain bit of text and convert it into a proper date object, no matter what format it was in to start with.

fix up the neural network; right now it's not doing too well

it quite likely has to do with how we are formatting the data coming in.

consider scaling things a bit more. see if maybe sklearn has a module for this? right now i have a feeling that the outliers are vastly reducing down the differences between the more normally sized data points.

handle categorical data while ensembling

i still like the idea of predicting probabilities, and then using those probabilities to make a more informed categorical prediction.

this is definitely easiest with binary data.

i'm not sure how python handles a category where they might be predicting many different labels.

but let's keep the mvp focus and just predict binary data for now.

FUTURE: allow the user to load up a bestNet file

immediate use case: train it overnight for who knows how long, load it up in the morning to makeKagglePredictions.

future use case: let users train in one place and then pass around the results to another machine.

whenever possible, split classifiers, don't combine

for example, with random forests, we could use criterion='gini' or criterion='entropy'. when we had this combined, and had gridsearch choose which of those two was best, it doubled the size of the space that gridsearch had to process through, and gave us a classifier that placed 250th on kaggle give credit.

when i broke those out to each be their own separate classifier (holding all the other hyperparameters to test the same, but now having two separate grid searches, one for entropy and one for gini), the training time was probably equivalent (we have cut the training space in half for each classifier, but doubled the number of classifiers), but we got way better results.

turns out that gini generalizes super well here, despite not scoring as well as entropy on gridsearch. gini by itself placed 133, entropy continued to score around 238 (slight improvement even, it seems), and the ensembler placed in between (164).

so we close-to-doubled our placing simply by breaking out classifiers, and i would not expect this to have any kind of a time penalty.

redo the data formatting pipeline for neural nets

i have a feeling that's part of what's driving such poor behavior for our nets. and right now it's rather inflexible, and likely more complicated than it has to be.

we should consider using python for this too. they have some good modules on feature scaling.

and it should definitely follow the pattern i'm about to implement in python of appending the test data to the end of the training data file for formatting purposes, so we know it is all handled the exact same way.

update documentation!

simplify it down

give a pretty thorough explanation of what it does- not how it does it
make the interface super, super simple.

break advanced options out into their own file. label that file clearly with the fact that everything here is still under super active development. but make the public api stable.

include a note about caffeine: http://lightheadsw.com/caffeine/

have the user pass in a flag for categorical or regression, instead of output

that will tell us what kind of output to look for.

make it work for both categorical and regression data.

right now we assume a single column of regression output. categorical could be difficult. continue to focus on a single column for now.

categorical could be difficult because it has so many columns potentially. write the classifier's name to each column.

have a single huge mapping obj that we write to json that maps from teh net name, to the type, to the prediction file it created, to the file where it was persisted to disk, etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.