hdi-project / atm Goto Github PK

View Code? Open in Web Editor NEW

523.0 523.0 140.0 8.78 MB

Auto Tune Models - A multi-tenant, multi-data system for automated machine learning (model selection and tuning).

Home Page: https://hdi-project.github.io/ATM/

License: MIT License

Python 96.91% Makefile 3.09%

automl data-science distributed-computing hyperparameter-optimization machine-learning

atm's People

Contributors

Stargazers

Watchers

Forkers

sthitaprajnas milliondreams o7s8r6 luotongml faviovazquez shaneopatrick jazzman37 cclauss stochasticresearch johanfrisk abhinavm24 rogerfitz redtrades komorihi alfredo-cuesta takabayashi wojohowitz00 praneeth-gummalla tom-hydrogen lauragustafson adripurkayastha cici-tan jaedukseo kimiliu1992 xflee dream-seeker devopsmi caidongyun leaderyangzi kerou nidecai luciany shivasj sczhaoqi batermj csala paulrich1234 iampatgrady zhouyonglong bigrlab cyinseu ituco allensmile fendaq zssasa wuqixiaobai veronicahs jobliz wangqianwen0418 essobi nginyc kioco hangjie720 nanaakwasiabayieboateng nkamsteve chengguobiao tony32769 limberc rencx rogersf dlts85 mansourfall feitianyiren wxrui zenghanfu we1l1n shyampandey2895 pvk-developer mikewlange kajjjak iprobe-lab drroad beevabeeva aledala yishuihanhan pivotsecurity liwei-cn averroes jdeguia ajoeajoe singh8477 autumnn hwunlams zhanpengjie reference-project maximlf forkdump septumcapital zwcdp adin-alihodzic kangking2019 kventures chrinide eugenesavenko hdony nurulc jigao19 phymucs maduhu oustandingman

atm's Issues

Update fabfile.py to fix AWS compatibility

The fabfile is currently out of sync with the rest of the codebase, making it impossible to automatically launch an ATM cluster on AWS. This needs to be fixed.

Metrics should be saved as JSON (not pickled)

Currently, the metrics dict, which is generated by Model.train_test(), is saved as a pickled python object. This is unnecessary, because the object is entirely composed of python dicts, lists, and numeric values. Saving metrics in json files instead would make it easier to eyeball the results and analyze them with other software.

This is just a matter of changing the save_metric() function in atm/utilities.py to use json.dump() instead of pickle.dump().

Implement proper logging

Currently, there are print() statements scattered around the project, and worker.py has a simple custom _log function which prints information to stdout and a log file simultaneously. We should aim to get rid of print statements altogether and replace them with calls to python's logging module, so that output to log files and stdout is handled in a more robust way. This will make it more practical for users to run ATM in the background or to call parts of it from other programs.

If there is a third-party logging library that would do the job better, I'm open to using that as well.

update the field names in the modelhub database

These are simple changes like error_msg --> error_message, model_path --> model_location, to more meaningful changes like tunables --> conditional_hyperparameters.

methods config should be configurable

Line 7 of methods.py has the hardcoded value:
CONFIG_PATH = 'methods'

This should be moved to a configuration parameter by run_config.yaml

Metrics file printing wrong filename to log

Line 235 should be changed from:
_log("Saving metrics in: %s" % local_model_path)

_log("Saving metrics in: %s" % local_metric_path)

Bug in concatenating dataframes

Hello,
I believe there is a bug in line 207 of atm/model.py The call to concat should be as follows:

all_data = pd.concat([train_data, test_data])

See here: https://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.concat.html?highlight=concat#pandas.concat

I would submit a pull request but I have other changes and don't want to complicate things.

Install failing

Seems that the install is conflicting with anaconda. Anyone found a work around?

ahmets-MBP-782:atm ahmet$ virtualenv venv Using base prefix '/Users/ahmet/anaconda' New python executable in /Users/ahmet/Documents/GitHub/atm/venv/bin/python dyld: Library not loaded: @rpath/libpython3.6m.dylib Referenced from: /Users/ahmet/Documents/GitHub/atm/venv/bin/python Reason: image not found ERROR: The executable /Users/ahmet/Documents/GitHub/atm/venv/bin/python is not functioning ERROR: It thinks sys.prefix is '/Users/ahmet/Documents/GitHub/atm' (should be '/Users/ahmet/Documents/GitHub/atm/venv') ERROR: virtualenv is not compatible with this system or executable

ATM should be compatible with Python 2 and 3

Thanks to @cclauss in #14 for bringing this up. We definitely want to future-proof ATM and make it as widely available as possible.

This is mostly a matter of sitting down and doing it. Most of it should be easy with http://python-future.org/automatic_conversion.html. I think the biggest decision to make is whether to use unicode_literals or not.

I am for it. Since the project is new and volatile, it shouldn't matter too much whether we have to change the existing API, and I don't think there will be any major changes anyway. unicode_literals will result in cleaner code, and will make it easier to reason about strings in the future.

I've started doing a test-run of futurize in BTB, since it's a much smaller project. Once that's done, I'll start going through file-by-file in ATM and doing the same. Feel free to jump in and contribute!

Support for regressions

Hello, If I understood correctly currently ATM solves only classification tasks. I wonder if there is a plan for adding support for regression problems. Thanks

Separating repeated processing from classifier models

In between different runs of the ATM, the outputs of all the steps of the pipeline are "static," except for the input and output to the classifier that is chosen by BTB. What I mean by this is, for example, suppose PCA is in the pipeline, then every time ATM/BTB chooses a new model to run, it will recompute the PCA for the same dataset. Unless I'm misunderstanding the flow of data, this seems inefficient. Although the current pipeline is pretty simple (scaling/PCA), there could be more computationally intensive elements to the pipeline that people may want to add.

We can separate the pipeline into two pipelines, one that is "static" and the outputs stored somewhere to disk such that it can be recalled between runs, and a "dynamic" which is essentially the classifier, and any blocks which change based on the ATM/BTB model being run.

If you think this is a good idea, how do we want to go about architecting this from a software perspective? One approach is to compute the static pipeline before the test_classifier method is run and save that to the data directory where the train/test dataset is being saved.

GaussianProcessClassifier errors with "N-th leading minor is not positive definite"

Appears to only happen when kernel == 'exp_sine_squared'. Does not happen every time. More investigation needed.

Error testing classifier: datarun=<ID = 24, dataset ID = 10, strategy = gp__bestk, budget = classifier (100), status: running>
Traceback (most recent call last):
  File "/home/bcyphers/work/fl/atm/atm/worker.py", line 401, in run_classifier
    model, performance = self.test_classifier(hyperpartition.method, params)
  File "/home/bcyphers/work/fl/atm/atm/worker.py", line 339, in test_classifier
    test_path=test_path)
  File "/home/bcyphers/work/fl/atm/atm/model.py", line 195, in train_test
    cv_scores = self.cross_validate(X_train, y_train)
  File "/home/bcyphers/work/fl/atm/atm/model.py", line 132, in cross_validate
    n_folds=self.N_FOLDS)
  File "/home/bcyphers/work/fl/atm/atm/metrics.py", line 194, in cross_validate_pipeline
    pipeline.fit(X[train_index], y[train_index])
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/pipeline.py", line 270, in fit
    self._final_estimator.fit(Xt, y, **fit_params)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 610, in fit
    self.base_estimator_.fit(X, y)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/multiclass.py", line 216, in fit
    for i, column in enumerate(columns))
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 758, in __call__
    while self.dispatch_one_batch(iterator):
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 608, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 571, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 109, in apply_async
    result = ImmediateResult(func)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 326, in __init__
    self.results = batch()
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/multiclass.py", line 80, in _fit_binary
    estimator.fit(X, y)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 208, in fit
    self.kernel_.bounds)]
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 426, in _constrained_optimization
    fmin_l_bfgs_b(obj_func, initial_theta, bounds=bounds)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 193, in fmin_l_bfgs_b
    **opts)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 328, in _minimize_lbfgsb
    f, g = func_and_grad(x)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 278, in func_and_grad
    f = fun(x, *args)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 292, in function_wrapper
    return function(*(wrapper_args + args))
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 63, in __call__
    fg = self.fun(x, *args)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 200, in obj_func
    theta, eval_gradient=True)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 344, in log_marginal_likelihood
    self._posterior_mode(K, return_temporaries=True)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 397, in _posterior_mode
    L = cholesky(B, lower=True)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/linalg/decomp_cholesky.py", line 81, in cholesky
    check_finite=check_finite)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/linalg/decomp_cholesky.py", line 30, in _cholesky
    raise LinAlgError("%d-th leading minor not positive definite" % info)
LinAlgError: 31-th leading minor not positive definite

Encode foreign-key database relationships with the SQLAlchemy ORM

Right now, foreign-key relationships in the ModelHub database are not reflected with SQLAlchemy's ORM. Doing so like this will make it easier to reference attributes of mapped objects. For example, the following code:

def foo(classifier_id):
    classifier = db.get_classifier(classifier_id)
    datarun = db.get_datarun(classifier.datarun_id)
    dataset = db.get_dataset(datarun.dataset_id)
    ...

could be simplified to this:

def foo(classifier_id):
    dataset = db.get_classifier(classifier_id).dataset
    ...

Overall, it will make things cleaner and easier to maintain.

is it possible to add more methods like RNN or CNN from keras?

Hi all,

Just wonder is it possible if I want to use methods from keras to build CNN? Since there are all basic classifiers from sklearn in the methods for now. And does it accept 3d input?

Add in-memory database option

In general, we would like to better support the use of ATM as a normal library. It should be just as convenient to use pieces of ATM in another python project as it is to use the system from the command line.

To that end, we should add a database option that is purely in-memory -- and does not leave a file system footprint unless requested. For example, we could use pandas DataFrames in the back-end instead of tables in a SQL database while maintaining the same interface for the Database class.

Change btb evaluation to include multiple trials

Change evaluate_btb.py to include multiple trials of tuner. This allows for a better evaluation as to whether a specific tuner actually leads to a performance increase. The results will be compared to the best so far in terms of mean over the trials, and standard deviation.

Change btb_test to evaluate

Change name of btb_test.py to evaluate_btb.py. Here we are evaluating the performance of different tuners, opposed to testing their functionality.

Add command line help to atm/enter_data.py

Running python atm/enter_data.py -h (or --help) should print out available commands, as is widely used convention.

Adding custom blocks to pipeline

It would be nice to be able to add additional blocks to the ML pipeline, both static and dynamic (see #70 ).

Gaussian Process classifier "nu" parameter should be a float

The nu hyperparameter (defined in methods/gaussian_process.json) is currently a string categorical variable; this causes the following error whenever the matern kernel is selected:

Traceback (most recent call last):
  File "/home/bcyphers/work/fl/atm/atm/worker.py", line 401, in run_classifier
    model, performance = self.test_classifier(hyperpartition.method, params)
  File "/home/bcyphers/work/fl/atm/atm/worker.py", line 339, in test_classifier
    test_path=test_path)
  File "/home/bcyphers/work/fl/atm/atm/model.py", line 195, in train_test
    cv_scores = self.cross_validate(X_train, y_train)
  File "/home/bcyphers/work/fl/atm/atm/model.py", line 132, in cross_validate
    n_folds=self.N_FOLDS)
  File "/home/bcyphers/work/fl/atm/atm/metrics.py", line 194, in cross_validate_pipeline
    pipeline.fit(X[train_index], y[train_index])
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/pipeline.py", line 270, in fit
    self._final_estimator.fit(Xt, y, **fit_params)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 610, in fit
    self.base_estimator_.fit(X, y)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 208, in fit
    self.kernel_.bounds)]
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 426, in _constrained_optimization
    fmin_l_bfgs_b(obj_func, initial_theta, bounds=bounds)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 193, in fmin_l_bfgs_b
    **opts)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 328, in _minimize_lbfgsb
    f, g = func_and_grad(x)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 278, in func_and_grad
    f = fun(x, *args)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 292, in function_wrapper
    return function(*(wrapper_args + args))
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 63, in __call__
    fg = self.fun(x, *args)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 200, in obj_func
Chose parameters for method knn:
    theta, eval_gradient=True)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 337, in log_marginal_likelihood
    K, K_gradient = kernel(self.X_train_, eval_gradient=True)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/kernels.py", line 1337, in __call__
    tmp = (math.sqrt(2 * self.nu) * K)
TypeError: a float is required

nu is passed to the constructor for a Matern kernel, which is then passed to the GaussianProcessClassifier constructor. According to the docs, nu should be a float that defaults to 1.5. I'm not sure whether the current configuration was ever correct, but it's not correct as of sklearn 0.18.

Someone should figure out what the proper range of values for nu is, and update the json to reflect that.

Include Matplotlib in 'requirements.txt'

It would be nice that matplotlib module was included in requirements.txt too.
Sometimes it is useful in debugging tasks.

Bugs in Logging

Hello,
I believe that the logging configurations are not being passed through, resulting in models and metrics being persisted in the default directories rather than the ones specified in log.yaml.

I think the following needs to be added to atm/config.py
log_path = log_path or kwargs.get('log_config') (on line 524)

Additionally, the caller of the load_config function overwrites the log file configuration for what goes to STDOUT and the logs, which means that even if the user specifies the log-level to be info for stdout, they still need to add the command line argument --verbose-metrics

Dockerisation and parallelisations

Dockerisation of this project would be great. Makes it easy to install and deploy :) It would also be great to incorporate Celery in so that the jobs can be dispatched to different worker instances.

All pull requests' circle-ci tests will fail

All pull requests will fail the circle-ci tests due to lint errors on the original codebase ... see #75

How do you interpret the results across and within models?

HI all, the log file will produce a line like below showing the best classifier and then you can find out the filename for the learner (provided you made a log of the run), but how do you read the model file? The manual doesn't seem to cover this either.

Examining results across all models (e.g., by a plot) would also be helpful.

nohup python atm/worker.py 2>&1 | tee Output.txt & # log the run

...
Saving model in: models/1261aa3655008f0b9afec119e25d5aab-b585ff5423b4c095b6562b81f2dc2f63-uniform__uniform.model
Saving metrics in: models/1261aa3655008f0b9afec119e25d5aab-b585ff5423b4c095b6562b81f2dc2f63-uniform__uniform.model
Saved classifier 21.
...
Best so far (learner 21): 0.716 +- 0.035

naming of the hyperparameters for the methods

To match our explanation of the hyperparameters and their types in the ATM paper, we could:

parameters→ hyperparameters
root_parameters→ root_hyperparameters

There is a bit more of categorization required in terms of hyperparameters. Will follow up on this thread with more.

Wrong keywords into ML models

Hello, I am trying to test run your classifiers on our data, and am getting some errors when the system tries various classifiers. The relevant portions of the error messages are pasted below:

Chose` parameters for method dt:
	n_jobs = -1
	min_samples_leaf = 1
	n_estimators = 100
	criterion = entropy
	max_features = 0.950735797858
	max_depth = 6
TypeError: __init__() got an unexpected keyword argument 'n_jobs'

Chose parameters for method dt:
	C = 0.00359684119303
	tol = 0.000357435603328
	fit_intercept = True
	penalty = l2
	_scale = True
	dual = False
	class_weight = auto
TypeError: __init__() got an unexpected keyword argument 'C'

Chose parameters for method logreg:
	n_jobs = -1
	min_samples_leaf = 1
	n_estimators = 100
	criterion = gini
	max_features = 0.218919710352
	max_depth = 7
TypeError: __init__() got an unexpected keyword argument 'min_samples_leaf'

Based on these errors, it seems that the hyperparameters input for the scikit-learn's DecisionTree model is being mixed up with the hyperparameters input for scikit-learn's LogisticRegression model. For example, LogisticRegression does not have a "min_samples_leaf" hyperparameter. Similarly, DecisionTreeClassifier does not have C or n_jobs as hyperparameters. Digging around, the methods/decision_tree.json and methods/logistic_regression.json files seem correct .. so I'm not sure why this is getting mixed up.

I get similar issues when running against the example provided in the readme. Here is a copy/paste of the entire error message

Selector: <class 'btb.selection.uniform.Uniform'>
Tuner: <class 'btb.tuning.uniform.Uniform'>
Choosing hyperparameters...
Chose parameters for method knn:
	C = 0.000128015603097
	tol = 0.000148636727508
	fit_intercept = True
	penalty = l2
	_scale = True
	dual = True
	class_weight = auto
Creating classifier...
Testing classifier...
Error testing classifier: datarun=<ID = 5, dataset ID = 5, strategy = uniform__uniform, budget = classifier (100), status: running>
Traceback (most recent call last):
  File "atm/worker.py", line 440, in run_classifier
    model, performance = self.test_classifier(classifier_id, params)
  File "atm/worker.py", line 374, in test_classifier
    performance = wrapper.start()
  File "/home/kkarra/atm/atm/wrapper.py", line 97, in start
    self.make_pipeline()
  File "/home/kkarra/atm/atm/wrapper.py", line 383, in make_pipeline
    classifier = self.class_(**classifier_params)
  File "/home/kkarra/atm/venv/local/lib/python2.7/site-packages/sklearn/neighbors/classification.py", line 126, in __init__
    metric_params=metric_params, n_jobs=n_jobs, **kwargs)
TypeError: _init_params() got an unexpected keyword argument 'C'

Here, it seems that the KNN model is getting the wrong keywords. I'm not sure why model's are not being optimized with appropriate keywords. I'm wondering if I should dig further to ensure that the selected model chooses the correct keywords, or if this an identified bug already in the course of porting from old environment to new?

MySQL setup not working

Hi all, any ideas re: the error message below?

Collecting mysql-python==1.2.5 (from -r requirements.txt (line 9))
  Downloading MySQL-python-1.2.5.zip (108kB)
    100% |████████████████████████████████| 112kB 9.6MB/s 
    Complete output from command python setup.py egg_info:
    sh: 1: mysql_config: not found
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-wg4Zpa/mysql-python/setup.py", line 17, in <module>
        metadata, options = get_config()
      File "setup_posix.py", line 43, in get_config
        libs = mysql_config("libs_r")
      File "setup_posix.py", line 25, in mysql_config
        raise EnvironmentError("%s not found" % (mysql_config.path,))
    EnvironmentError: mysql_config not found
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-wg4Zpa/mysql-python/

Scientific notation in method json definitions is defined wrong

Pretty much all exponential hyperparameter ranges are defined as something like

"range": [10e-5, 10e5]

This is wrong, and it's my fault for misunderstanding how scientific notation works in Python. When I translated all the old enumeration classes to the new json, I made the mistake of turning 10**3 into 10e3 across the board. See e.g. https://github.com/HDI-Project/ATM/blob/50b592dd6a151a75470fb4120c1781ca7249d43f/atm/enumeration/classification/svm.py for how it was before. This is a quick fix that I'll push in a few minutes.

We need more docs!

The documentation hasn't been updated for a few months, and does not reflect the state of the project. I'm removing the docs/ folder from the master repo for now, and pushing it into the branch 'bcyphers/docs' (#80) to avoid confusing newcomers to the project. The folder will be added back to master once it is closer to being ready.

Avoid creating redundant datasets

If enter_data() is called with the same train_path twice in a row and the data itself hasn't changed, a new Dataset does not need to be created.

We should add a column which stores some kind of hash of the actual data. When a Dataset would be created, if the metadata and data hash are exactly the same as an existing Dataset, nothing should be added to the ModelHub database and the existing Dataset should be returned instead.

Passing arguments for computation of ROC curves

When the performance of a classifier is evaluated, the chain of function calls is as follows:

worker.py::self.test_classifer --> model.train_test --> model.test_final_model --> metrics.py::test_pipeline --> metrics.py::get_metrics

test_pipeline has a kwargs argument to allow for include_curves to be set, however, no kwargs are passed through the functions above the chain, so include_curves is always defaulted to False in the code. How do we want to address this? Should there be a command line argument (or configuration in run_config.yaml) which allows the user to decide if the roc_curves should be computed, and if so, passed through the system?

Installation Should Specify Python Version

Added a python=2.7 line in #32 but not sure how you want to style the information.
Python3 doesn't work because of the prints

Make matplotlib a conditional import

Tests fail because utilities.py is imported and matplotlib is in turn imported. Doesn't make any sense to have plotting as a hard dependency, but you can make it a conditional import in utilities.py

try:
    import matplotlib.pyplot as plt
except ImportError:
    plt = None

# then, in graph_series
if plt is None:
    raise ImportError("Unable to import matplotlib")

Nolearn DBN is out of date

nolearn.dbn, the library that ATM uses for its Deep Belief Network (DBN) classifier, is no longer supported. From the home page:

The nolearn.dbn module is no longer supported. Take a look at nolearn.lasagne for a more modern neural net toolkit.

We should upgrade to lasagne ASAP.

Possibly unrelated, but sometimes the DBN classifier will fail with an error like this:

Traceback (most recent call last):
  File "/home/bcyphers/work/fl/atm/atm/worker.py", line 401, in run_classifier
    model, performance = self.test_classifier(hyperpartition.method, params)
  File "/home/bcyphers/work/fl/atm/atm/worker.py", line 339, in test_classifier
    test_path=test_path)
  File "/home/bcyphers/work/fl/atm/atm/model.py", line 195, in train_test
    cv_scores = self.cross_validate(X_train, y_train)
  File "/home/bcyphers/work/fl/atm/atm/model.py", line 132, in cross_validate
    n_folds=self.N_FOLDS)
  File "/home/bcyphers/work/fl/atm/atm/metrics.py", line 194, in cross_validate_pipeline
    pipeline.fit(X[train_index], y[train_index])
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/pipeline.py", line 270, in fit
    self._final_estimator.fit(Xt, y, **fit_params)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/nolearn/dbn.py", line 407, in fit
    self.use_dropout,
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/gdbn/dbn.py", line 202, in fineTune
    err, outMB = step(inpMB, targMB, self.learnRates, self.momentum, self.L2Costs, useDropout)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/gdbn/dbn.py", line 303, in stepNesterov
    errSignals, outputActs, error = self.fpropBprop(inputBatch, targetBatch, useDropout)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/gdbn/dbn.py", line 262, in fpropBprop
    outputErrSignal = -self.outputActFunct.dErrordNetInput(targetBatch, self.state[-1], outputActs)
AttributeError: 'Tanh' object has no attribute 'dErrordNetInput'

Hopefully the upgrade will kill two birds with one stone.

REST API

A REST API would be useful to access ATM from any other language and device through the web, and also to illustrate how the code is structured and might be extended.

From the project's readme I get that the internal API's are going to change and that it consequently might be a bit early to develop a REST API, but I wanted to see what was possible to do with the current code. The API currently serves:

Various GET endpoints for reading data from the four entities that are currently present in the database
1 GET endpoint to run the worker.py script inside the virtualenv as a subprocess and retrieve it's stdout and stderr
1 POST endpoint to send a .csv file with the HTTP request, save the file to the atm/data directory and run enter_data on it.

No modifications were made outside of the rest_api_server.py file except to the requirements.txt file, adding flask and simplejson to the project dependencies.

TODO's / caveats:

No AWS integration
api.py currently does not check if the uploaded filename is already present, so a CSV upload can rewrite a previously sent file with the same name. This needs fixing, but I thought that I should ask first if the atm/data directory is the right one to put new files in, and if storing UUID's is ok with the project's design before using them and doing a pull request.

Example api.py usage:

After following the readme's installation instructions and running python scripts/rest_api_server.py on a separate shell under the virtualenv:

curl localhost:5000/enter_data -F file=@/path/file.csv

It should return:

{"success": true}

To see the created dataset:

curl localhost:5000/datasets/1 | jq

{
  "class_column": "class",
  "description": null,
  "size_kb": 6,
  "test_path": null,
  "k_classes": 3,
  "majority": 0.333333333,
  "d_features": 6,
  "train_path": "FULLPATH/file.csv",
  "id": 1,
  "n_examples": 150,
  "name": "file"
}

To see the created datarun:

curl localhost:5000/dataruns/1 | jq

{
  "status": "pending",
  "start_time": null,
  "description": "uniform__uniform",
  "r_minimum": 2,
  "metric": "f1",
  "budget": 100,
  "selector": "uniform",
  "priority": 1,
  "score_target": "cv_judgment_metric",
  "deadline": null,
  "budget_type": "classifier",
  "id": 1,
  "tuner": "uniform",
  "dataset_id": 1,
  "gridding": 0,
  "k_window": 3,
  "end_time": null
}

To run the worker.py script once:

curl localhost:5000/simple_worker | jq

after a while

{
  "stderr": "",
  "stdout": "huge stdout string with worker.py's output"
}

atm/enter_data.py is failing: ImportError: No module named boto.s3.connection

Got an issue when executing enter_data.py

atm) ahmets-MBP-782:atm ahmet$ python atm/enter_data.py Traceback (most recent call last): File "atm/enter_data.py", line 7, in <module> from boto.s3.connection import S3Connection, Key as S3Key ImportError: No module named boto.s3.connection

atm importing issue

When I run atm/enter_data.py, it pops up this error:
ImportError: No module named atm.config

Add minimal unit tests

Right now there are just a couple of actual tests in the test/ folder, and they leave huge portions of the code untouched.

Eventually, we will need a suite of unit tests and comprehensive integration tests that reasonably convince us that a new change won't break anything. We'll also want to integrate with CircleCI and github so that we can evaluate pull requests from the web (but that will come later).

I'll try to update this issue with our progress as time goes on, but a good start would be unit tests for:

each hyperpartition for each classification method
each database.py create/query/update function
hyperpartition enumeration
hyperpartition selection/parameter tuning
metric computation in metrics.py
data loading and encoding with a variety of quirky data types
serialization/deserialization of models/metrics objects

Make model and metric file names more friendly

There is no way we need each file name to be 90 characters of gibberish. We should, at the very least, shorten the hash, and probably try to make them more human-readable.

Make paths relative to project root

Currently, lots of code for loading config or data depends on the initiating script running from the project root. We can fix this by defining everything relative to the location of a python file instead of relative to ./. This will make everything more predictable and less brittle.

Error happened when using model.predict(X) function

I trained a few models, and then picked the best one from model directory, was trying to predict new data, then error came up.

Traceback (most recent call last):
  File "test_best_model.py", line 21, in <module>
    preds = best_model.predict(X)
  File "/medical_data/Datasets/ATM/atm/model.py", line 209, in predict
    X, _ = self.encoder.transform(data)
  File "/medical_data/Datasets/ATM/atm/encoder.py", line 101, in transform
    features = data[self.feature_columns]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

I use pickle to load the model:

with open(classfier_p, 'rb') as f:
        best_model= pickle.load(f)

Allow custom evaluation metrics

Right now, it's only possible to configure a datarun to use methods and metrics that are included with the library. It should be possible to pass JSON files for custom machine-learning methods and Python files/functions for custom metrics. This can be implemented in much the same way that custom tuners/selectors for BTB are handled.

Create `explorer.py` to help explore results of Dataruns

Should be a file that has helper methods for loading and testing previously-generated models. This should include some graphing and data visualization utilities.

Not sure about the best way to make the interface -- it could be a command-line tool like enter_data and worker, or it could be a Python REPL-style interface directly to the functions.

Remove Methods table from database

At this point, I think the "methods" table in the database is vestigial, and no longer serves a purpose. We should get rid of it to reduce clutter.

Judgement Metric not passed through

Line 341 of atm/worker.py does not pass the score target, so regardless of what the user chooses in run_config.yaml, the code determines the best classifier based on the mu_sigma computation. It seems there is some inconsistency in nomenclature between the mu_sigma and the cv options.

Was the original intent to use mu_sigma to correspond to the highest lower error bound, cv to correspond to the average cv score, and test to correspond to average test score? mu_sigma is not in the keys of the database, so its not a valid run_config.yaml value. If we take cv to mean the same thing as mu_sigma, then one approach to deal with this may be (starting at Line 341 of atm/worker.py):

        if('cv' in self.datarun.score_target):
            score_target_in = 'mu_sigma'
        else:
            score_target_in = self.datarun.score_target
        old_best = self.db.get_best_classifier(datarun_id=self.datarun.id,
                                               score_target=score_target_in)

        cur_cv_val       = model.cv_judgment_metric
        cur_cv_err       = model.cv_judgment_metric_stdev
        cur_test_val     = model.test_judgment_metric
        
        if old_best is not None:
            old_cv_val   = old_best.cv_judgment_metric
            old_cv_err   = 2*old_best.cv_judgment_metric_stdev
            old_test_val = old_best.test_judgment_metric
        if('cv' in self.datarun.score_target):
            _log('Judgment metric (%s): %.3f +- %.3f' %
                 (self.datarun.metric, cur_cv_val, cur_cv_err))
            if old_best is not None:
                if (cur_cv_val - cur_cv_err) > (old_cv_val - old_cv_err):
                    _log('New best score! Previous best (classifier %s): %.3f +- %.3f' %
                         (old_best.id, old_cv_val, old_cv_err))
                else:
                    _log('Best so far (classifier %s): %.3f +- %.3f' %
                         (old_best.id, old_cv_val, old_cv_err))
        else:
            _log('Judgment metric (%s): %.3f' %
                 (self.datarun.metric, cur_test_val))
            if old_best is not None:
                if (cur_test_val) > (old_test_val):
                    _log('New best score! Previous best (classifier %s): %.3f' %
                         (old_best.id, old_test_val))
                else:
                    _log('Best so far (classifier %s): %.3f' %
                         (old_best.id, old_test_val))

Add PyYAML dependency to requirements.txt

Not installed by default in new virtual environments such as conda python 2.7

Add database command to remove dataruns

Right now, if you create a datarun with a typo or just decide you don't want to run it, there's no simple way to remove it from the database. We should add it as a subcommand to enter_data.py. Maybe:

python enter_data.py remove --datarun 1

likewise,

python enter_data.py remove --dataset 1

Allow testing in other environments

I'm guessing that the tester in their environment has populated the files config/test/btb/aws.yaml etc. and data/car_1.csv etc.

As it currently stands, it is not possible for someone who just clones the package to run tests and assure themselves the software is installed correctly/works correctly.

Consider overriding the .gitignore to commit the needed yaml and csv files for the test/ directory only? I imagine that the AWS tests would be skipped by default. Also, at the least, there could be unit tests for some of the software that are not dependent on data and configs.

(Somewhat unrelated to this issue, I would have thought that test/method_test.py does unit tests, and was going to recommend the more conventional python -m pytest invocation. But it seems that that file is very similar to test/end_to_end_test.py, doing end-to-end tests as well. Fairly confused by that.)

Got error while running 'python atm/enter_data.py'

Hi all,
As a beginer of ATM,
I have below error while running 'python atm/enter_data.py',
could you tell me how can I deal with this error?

Thank you!

=======================================
(myPy27) ubuntu@ip-172-31-17-79:~/workspace/git/atm$ python atm/enter_data.py
Traceback (most recent call last):
File "atm/enter_data.py", line 8, in
from .config import *
ValueError: Attempted relative import in non-package