climbsrocks / machinejs Goto Github PK

[UNMAINTAINED] Automated machine learning- just give it a data file! Check out the production-ready version of this project at ClimbsRocks/auto_ml

Home Page: https://github.com/ClimbsRocks/auto_ml

JavaScript 49.30% Python 50.70%

machine-learning data-science javascript machine-learning-library machine-learning-algorithms ml data-scientists javascript-library scikit-learn kaggle

machinejs's Introduction

machineJS

a fully-featured default process for machine learning- all the parts are here and have functional default values in place. Modify to your heart's delight so you can focus on the important parts for your dataset, or run it all the way through with the default values to have fully automated machine learning!

`auto_ml` - machineJS, but better!

I just built out v2 of this project that now gives you analytics info from your models, and is production-ready. machineJS is an amazing research project that clearly proved there's a hunger for automated machine learning.

auto_ml tackles this exact same goal, but with more features, cleaner code, and the ability to be copy/pasted into production.

Check it out! https://github.com/ClimbsRocks/auto_ml

What is machineJS?

machineJS provides a fully automated framework for applying machine learning to a dataset.

All you have to do is give it a .csv file, with some basic information about each column in the first row, and it will go off and do all the machine learning for you!

If you've already done this kind of thing before, it's useful as an outline, putting in place a working structure for you to make modifications within, rather than having to build from scratch again every time.

machineJS will tell you:

Which algorithms are going to be most effective for this problem
Which features are most useful
Whether this problem is solvable by machine learning at all (useful if you're not sure you've collected enough data yet)
How effective machine learning can be with this problem, to compare against other potential solutions (like just taking a grouped average)

If you haven't done much (or any) machine learning before- it does some fairly advanced stuff for you!

Installation:

As a standalone directory (recommended)

If you want to install this in it's own standalone repo, and work on the source code directly, then from the command line, type the following:

git clone https://github.com/ClimbsRocks/machineJS.git
cd machineJS
npm install
pip install -r requirements.txt
git clone https://github.com/scikit-learn/scikit-learn.git
cd scikit-learn
python setup.py build
sudo python setup.py install

From the command line

node machineJS.js path/to/trainData.csv --predict path/to/testData.csv

Format of Data Files:

We use the data-formatter module to automatically format your data, and even perform some basic feature engineering on it. Please refer to data-formatter's docs for information on how to label each column to be ready for machineJS.

How to customize/dive in deeper:

machineJS is designed to be super easy to use without diving into any of the internals. Be a conjurer- just give it data and let it run! That said, it's super powerful once you start customizing it.

It's designed to be relatively easy to modify, and well-documented. The obvious place to start is inside processArgs.js. Here we set nearly all the parameters that are used throughout the project.

The other obvious area many people will be interested in is adding in new models, and different hyperparameter search spaces. This can be found in the pySetup folder. The exact steps are listed in stepsToAddNewClassifier.txt.

What types of problems does this library work on?

machineJS works on both regression and categorical problems, as long as there is a single output column in the training data. This includes multi-category (frequently called multi-class) problems, where the category you are predicting is one of many possible categories. There are no immediate plans to support multiple output columns in the training data. If you have three output columns you're interested in predicting, and they cannot be combined into a single column in the training data, you could run machineJS once for each of those three columns.

This library is well-tested on Macs. I've designed it to work on PCs as well, but I haven't tested that at all yet. If you're a PC user, I'd love some issues or Pull Requests to make this work for PCs!

Note: This library is designed to run across all but one cores on the host machine. What this means for you:

Please plug in.
Close all programs and restart right before invoking (this will clear out as much RAM as possible).
Expect some noise from your fan- you're finally putting your computer to use!
Don't expect to be able to do anything intense while this is running. Internet browsing or code editing is fine, but watching a movie may get challenging.
Please don't run any other Python scripts while this is running.

Thanks for inviting us along on your machine learning journey!

machinejs's People

Contributors

Stargazers

Watchers

machinejs's Issues

training order

Almost all these ideas are into release 4.0. stay super focused on 2.0 and 3.0 first.

nn and RF first, as these generalize well
train the best nn until the rf finishes

only once these two have finished, start the next ones
don't try to start training them all in parallel. just two classifiers at a time.

create the ability for people to add in more over time. so they can start with just a nn and a rf, and then train more later that night.

consider modularizing this out into different repos entirely. have one repo for nns, one repo for RFs, etc. the master parent repo will then just aggregate all the child repos together intelligently.

modularize even further: have one python formatting repo, which all other python repos include.

i really want to make grid search for neural nets in JS available super broadly.

offer three training speeds: thorough, medium, fast, and super light. just make it a 5 point scale. let people choose how much compute power they want to give to this.

make the saved .pkl and .json files super modular- if they get one from a colleague, they should be able to swap that out in the directory, invoke rePredict and reEnsemble, and be good to go.

if possible, it certainly never hurts to keep training the nn for longer :)

there are too many references to dataFileLocation and rfLocation

find a way of setting the rootDir globally and making that accessible to everyone.

figure out how to pass in a kaggle test file

this will force me to deal with a number of unknowns like:

A. How do i make the bestNet publicly available?
B. do i have to make a server publicly available in order to have it continuously listening for prediction data?
C. how do i make sure the prediction data gets formatted the same way as our training data?
D. how do i clean up extraneous files created during this process? (make sure to create a --leaveGarbage flag to not destroy any files)

finish process of refactoring from errorRate being a number to being an array

let users pass in a flag for don't grid search at all, just ensemble

naming ideas

best-brain
automatic-brain
automatic-neural-network (i like this best so far)
automated-machine-learning

i'm torn between something obvious that will make it easy for people to find, and something with a bit more personality.

refactor paramMakers to not use duplicate files when the split is determined by makeClassifiers

right now we have two different paramMakers for random forests, despite the fact that they are returning the same thing. have both key names point to the same file, so we still have all the key names as we would expect

Determine the most important features and use only those

writing bestNet to file limitations

if it is the first training round, and the time from the previous bestNet write was less than a half second ago, and the average bestNet writing velocity is above a certain threshold (averaging more than 5 nets in 3 seconds, say), don't write the new one to a file

then, at the end of the first training round, write the bestNet to a file.

refactor away from the use of the globals file

it had the issues i feared it would, but my tired brain decided to push ahead anyways and give it a shot.

i could refactor to have a globals object that gets passed around as an explicit input and output, but it might be cleaner to just create each variable individually to then get passed around as explicit arguments.

include a contributing file with instructions on how to develop

stuff like --devKaggle, instructions on how to download and name the right kaggle files, etc.

don't include id in the training data

right now it's being included just like any other feature field.

let the user pass in more information

it would probably be best to let them pass in more info in the first row. this info would label:

"ID"- identifies the id field
"Categorical Prediction" identifies an output field, and notes that the expected output is categorical
"Numerical Prediction" identifies an output field and notes that the expected output is numerical
"Categorical Field" notes that this field is categorical, even if we might otherwise assume it is a number.

prepare a 1.0 release

as soon as the docs are updated, make a 1.0 release.

2.0 will be once i've got more classifiers trained. this includes training the neural network for a considerably longer period of time.

3.0 will be once i've got more creative ensembling up and running.

4.0 will be generalizing it past just kaggle, and maybe giving an api to use from within javascript. this release might also include more control/advanced options for those who want it. but this has to be laser focused on 2.0 and 3.0 first.

5.0 will be revisiting the stats/ML part of this. frankly, the ensembling and grid search take care of most concerns here, but i'm sure there are ways we can do this more accurately.

training, testing, training

here's a fun one to try out at some way future point:

train the algos on 80% of the dataset
test them on 20%
assemble the ensemble based on how they did on this 20%.
this should, theoretically, test which ones generalize well and which do not. once we have figured this out, go back and train classifiers with those exact same parameters that we already identified train well against the entire input data set. but keep the same ensembling method we already developed that we know should generalize well. in this way, we've tested for generalization, but trained on the entire dataset.

this seems like it could be fairly controversial. but just test it and see if it decreases accuracy or not. admittedly, that will be somewhat expensive in terms of development time if it doesn't work out, but it could be pretty useful if it does.

print out the column headers that came in

right now we are manually hardcoding these values, rather than passing them through

format python test data and train data at the same time

the test data will then have to be written to a file for later use.

this makes sure that when we turn dicts into arrays, they're put in the same position.

store the observed error rate for each classifier

we will use this when creating the ensemble

add .gitkeep to all folders we are going to need

write everything to a file

create a json that is a summary of everything.

ok, it will probably have to be a folder, and in that folder we'll have a .pkl of each of the classifiers, a json representing the ensembling.

give this the ability to just load up the classifiers that can use more training. i want people to turn this on at night right before they go to bed to put in more iterations training things, like the neural net.

feauture request: "does my newly engineered feature matter?"

let users quickly figure out if their new feature is highly predictive or not

it would just give directional guidance

it would go off and quickly train a lasso (probably not a rf, since they're unstable), and tell you how important your new feature is.

far from robust, but could be convenient

would likely rely on data-formatter under the hood so they can just write their results to a .csv file.

break everything out into separate repos

automated brain
atuomated python classifier trainers
automated ensembler
a script to automatically upload the files to kaggle, get the results, and then write those results to the json object sitting in the first row (above the headers) of our predictions file in the predictions folder. then, if we want, we can invoke our automated ensembler module again with this new piece of information about how well it generalizes.

missing rows in neuralNetwork.csv

there are 70 missing rows in the final output dataset

figure out how to make this more generally useful for passing in non-Kaggle data

basically, make the makePrediction function usable within node, not just by passing in a new file

change trainingCallback frequency to be more frequent

right now you can pass in a --maxTrainingIterations value that is less than our trainingCallback frequency. So it will train for 10 iterations regardless, even if you said --maxTrainingIterations 3.

My concern with making this too frequent is that sending messages from parent to child might be resource intensive. see if that might be blocking, and if so, if there's a way around it.

train the best neural net for considerably longer

once we've figured out the params for the best neuralNet, set it off to train by itself for a while. this is similar to what we do with RFs, getting the params from gridSearch, then training a rf with those params but a ton of trees for a long time.

add support for multiple output columns

right now we are assuming there is only one output column

modularization ideas

data-formatter: format a dataset for neural networks
best-brain: grid search for brain.js. Just returns the most optimized neural network available- no predictions or anything. would not include the extra training time, but would include instructions on how to warmStart the brain returned to you so you could control the extra training time yourself.
automated-brain: combining these two to automate the entire process of making predictions against a dataset using a neural network. would add in the making predictions part
ensembler- takes in the prediction files from various other ml algos, and ensembles them together in creative ways
python-data-formatter: gets data ready for machine learning in python's scikit-learn library (summarizes it so the user can easily spot errors, runs the same transformations against the combined training/testing data set so they're binarized/normalized/whateverized in the same way, imputes missing values, etc.). ideally this would be flexible enough to format it for different ml algos (maybe svms need normalization, while random forests don't)
automated-machine-learning: run all the python classifiers, making predictions agains the datasets and writing those to predictions files
assembling all this together to make a single master predictions file at the end with all these ml algos

delete previous bestNet files

Keep a record of what the file name is
on successful write of the next one, delete the previous one

format data by putting train and test into one big combined datafile

this ensures we are handling things consistently
we can just keep track of how many lines are in train, and split back out at that point again.

documentation- getting started

note what it takes to set up their dev environment

pip install: scikit learn, quite possibly other things like pandas

python itself is installed. instructions on how to do that, and super clear intructions that they should not have to if they are on a mac

is there a way of automatically installing pip dependencies?!

allow user to specify type of input field

this will be useful, particularly for things like date and time, which otherwise could look a lot like text.

i'm sure there are modules out there that will take an uncertain bit of text and convert it into a proper date object, no matter what format it was in to start with.

add header fields to neuralNetwork.csv

fix up the neural network; right now it's not doing too well

it quite likely has to do with how we are formatting the data coming in.

consider scaling things a bit more. see if maybe sklearn has a module for this? right now i have a feeling that the outliers are vastly reducing down the differences between the more normally sized data points.

force all child processes to exit immediately if parent process exits unexpectedly

handle categorical data while ensembling

i still like the idea of predicting probabilities, and then using those probabilities to make a more informed categorical prediction.

this is definitely easiest with binary data.

i'm not sure how python handles a category where they might be predicting many different labels.

but let's keep the mvp focus and just predict binary data for now.

refactor random forest to make use of global.rf namespace

particularly, i know there are a ton of functions i'm passing argv into. we can move away from that.

FUTURE: allow the user to load up a bestNet file

immediate use case: train it overnight for who knows how long, load it up in the morning to makeKagglePredictions.

future use case: let users train in one place and then pass around the results to another machine.

whenever possible, split classifiers, don't combine

for example, with random forests, we could use criterion='gini' or criterion='entropy'. when we had this combined, and had gridsearch choose which of those two was best, it doubled the size of the space that gridsearch had to process through, and gave us a classifier that placed 250th on kaggle give credit.

when i broke those out to each be their own separate classifier (holding all the other hyperparameters to test the same, but now having two separate grid searches, one for entropy and one for gini), the training time was probably equivalent (we have cut the training space in half for each classifier, but doubled the number of classifiers), but we got way better results.

turns out that gini generalizes super well here, despite not scoring as well as entropy on gridsearch. gini by itself placed 133, entropy continued to score around 238 (slight improvement even, it seems), and the ensembler placed in between (164).

so we close-to-doubled our placing simply by breaking out classifiers, and i would not expect this to have any kind of a time penalty.

redo the data formatting pipeline for neural nets

i have a feeling that's part of what's driving such poor behavior for our nets. and right now it's rather inflexible, and likely more complicated than it has to be.

we should consider using python for this too. they have some good modules on feature scaling.

and it should definitely follow the pattern i'm about to implement in python of appending the test data to the end of the training data file for formatting purposes, so we know it is all handled the exact same way.

add in robust user guidance, data validation, and useful error messages

this will have to be broken out into multiple sub-issues

Test everything

update documentation!

simplify it down

give a pretty thorough explanation of what it does- not how it does it
make the interface super, super simple.

break advanced options out into their own file. label that file clearly with the fact that everything here is still under super active development. but make the public api stable.

include a note about caffeine: http://lightheadsw.com/caffeine/

delete unneeded formattingData files and such

be sure to leave in an advanced flag to continue to leave them in

figure out how to source control the warmStarting brain.js

i'm hoping i can have it pull directly from my own fork of the project?! imitate some of the other package.json's i've seen- be a conjurer here!

run the ensembling automatically once all algos are finished predicting individually

right now each one will write their results to a file, but neither is communicating back up to say that they're done, so the createEnsemble() method doesn't know when to kick off.

add in ability to specify which column is output and which is id

some data sets are simply too large to move the columns around easily. but it's relatively easy to append a new line to the beginning of the file.

go through and replace all "None", "N/A", "undefined", etc.

right now I think I'm just replacing NA, at least for the randomForest.

have the user pass in a flag for categorical or regression, instead of output

that will tell us what kind of output to look for.

make it work for both categorical and regression data.

right now we assume a single column of regression output. categorical could be difficult. continue to focus on a single column for now.

categorical could be difficult because it has so many columns potentially. write the classifier's name to each column.

have a single huge mapping obj that we write to json that maps from teh net name, to the type, to the prediction file it created, to the file where it was persisted to disk, etc.

continue refactor of nn by modularizing dataFormatting

break it into it's own folder and formattingUtils.js.

do the same for brainChildMemoryHog.js- this might require a little bit of rearranging, or just add it to childUtils.js