hdi-project / atm Goto Github PK
View Code? Open in Web Editor NEWAuto Tune Models - A multi-tenant, multi-data system for automated machine learning (model selection and tuning).
Home Page: https://hdi-project.github.io/ATM/
License: MIT License
Auto Tune Models - A multi-tenant, multi-data system for automated machine learning (model selection and tuning).
Home Page: https://hdi-project.github.io/ATM/
License: MIT License
The fabfile is currently out of sync with the rest of the codebase, making it impossible to automatically launch an ATM cluster on AWS. This needs to be fixed.
Currently, the metrics
dict, which is generated by Model.train_test()
, is saved as a pickled python object. This is unnecessary, because the object is entirely composed of python dicts, lists, and numeric values. Saving metrics in json files instead would make it easier to eyeball the results and analyze them with other software.
This is just a matter of changing the save_metric()
function in atm/utilities.py
to use json.dump()
instead of pickle.dump()
.
Currently, there are print() statements scattered around the project, and worker.py has a simple custom _log
function which prints information to stdout and a log file simultaneously. We should aim to get rid of print statements altogether and replace them with calls to python's logging
module, so that output to log files and stdout is handled in a more robust way. This will make it more practical for users to run ATM in the background or to call parts of it from other programs.
If there is a third-party logging library that would do the job better, I'm open to using that as well.
These are simple changes like error_msg --> error_message, model_path --> model_location, to more meaningful changes like tunables --> conditional_hyperparameters.
Line 7 of methods.py has the hardcoded value:
CONFIG_PATH = 'methods'
This should be moved to a configuration parameter by run_config.yaml
Line 235 should be changed from:
_log("Saving metrics in: %s" % local_model_path)
to
_log("Saving metrics in: %s" % local_metric_path)
Hello,
I believe there is a bug in line 207 of atm/model.py The call to concat should be as follows:
all_data = pd.concat([train_data, test_data])
I would submit a pull request but I have other changes and don't want to complicate things.
Seems that the install is conflicting with anaconda. Anyone found a work around?
ahmets-MBP-782:atm ahmet$ virtualenv venv Using base prefix '/Users/ahmet/anaconda' New python executable in /Users/ahmet/Documents/GitHub/atm/venv/bin/python dyld: Library not loaded: @rpath/libpython3.6m.dylib Referenced from: /Users/ahmet/Documents/GitHub/atm/venv/bin/python Reason: image not found ERROR: The executable /Users/ahmet/Documents/GitHub/atm/venv/bin/python is not functioning ERROR: It thinks sys.prefix is '/Users/ahmet/Documents/GitHub/atm' (should be '/Users/ahmet/Documents/GitHub/atm/venv') ERROR: virtualenv is not compatible with this system or executable
Thanks to @cclauss in #14 for bringing this up. We definitely want to future-proof ATM and make it as widely available as possible.
This is mostly a matter of sitting down and doing it. Most of it should be easy with http://python-future.org/automatic_conversion.html. I think the biggest decision to make is whether to use unicode_literals
or not.
I am for it. Since the project is new and volatile, it shouldn't matter too much whether we have to change the existing API, and I don't think there will be any major changes anyway. unicode_literals
will result in cleaner code, and will make it easier to reason about strings in the future.
I've started doing a test-run of futurize
in BTB, since it's a much smaller project. Once that's done, I'll start going through file-by-file in ATM and doing the same. Feel free to jump in and contribute!
Hello, If I understood correctly currently ATM solves only classification tasks. I wonder if there is a plan for adding support for regression problems. Thanks
In between different runs of the ATM, the outputs of all the steps of the pipeline are "static," except for the input and output to the classifier that is chosen by BTB. What I mean by this is, for example, suppose PCA is in the pipeline, then every time ATM/BTB chooses a new model to run, it will recompute the PCA for the same dataset. Unless I'm misunderstanding the flow of data, this seems inefficient. Although the current pipeline is pretty simple (scaling/PCA), there could be more computationally intensive elements to the pipeline that people may want to add.
We can separate the pipeline into two pipelines, one that is "static" and the outputs stored somewhere to disk such that it can be recalled between runs, and a "dynamic" which is essentially the classifier, and any blocks which change based on the ATM/BTB model being run.
If you think this is a good idea, how do we want to go about architecting this from a software perspective? One approach is to compute the static pipeline before the test_classifier
method is run and save that to the data directory where the train/test dataset is being saved.
Appears to only happen when kernel == 'exp_sine_squared'
. Does not happen every time. More investigation needed.
Error testing classifier: datarun=<ID = 24, dataset ID = 10, strategy = gp__bestk, budget = classifier (100), status: running>
Traceback (most recent call last):
File "/home/bcyphers/work/fl/atm/atm/worker.py", line 401, in run_classifier
model, performance = self.test_classifier(hyperpartition.method, params)
File "/home/bcyphers/work/fl/atm/atm/worker.py", line 339, in test_classifier
test_path=test_path)
File "/home/bcyphers/work/fl/atm/atm/model.py", line 195, in train_test
cv_scores = self.cross_validate(X_train, y_train)
File "/home/bcyphers/work/fl/atm/atm/model.py", line 132, in cross_validate
n_folds=self.N_FOLDS)
File "/home/bcyphers/work/fl/atm/atm/metrics.py", line 194, in cross_validate_pipeline
pipeline.fit(X[train_index], y[train_index])
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/pipeline.py", line 270, in fit
self._final_estimator.fit(Xt, y, **fit_params)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 610, in fit
self.base_estimator_.fit(X, y)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/multiclass.py", line 216, in fit
for i, column in enumerate(columns))
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 758, in __call__
while self.dispatch_one_batch(iterator):
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 608, in dispatch_one_batch
self._dispatch(tasks)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 571, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 109, in apply_async
result = ImmediateResult(func)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 326, in __init__
self.results = batch()
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 131, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/multiclass.py", line 80, in _fit_binary
estimator.fit(X, y)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 208, in fit
self.kernel_.bounds)]
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 426, in _constrained_optimization
fmin_l_bfgs_b(obj_func, initial_theta, bounds=bounds)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 193, in fmin_l_bfgs_b
**opts)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 328, in _minimize_lbfgsb
f, g = func_and_grad(x)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 278, in func_and_grad
f = fun(x, *args)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 292, in function_wrapper
return function(*(wrapper_args + args))
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 63, in __call__
fg = self.fun(x, *args)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 200, in obj_func
theta, eval_gradient=True)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 344, in log_marginal_likelihood
self._posterior_mode(K, return_temporaries=True)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 397, in _posterior_mode
L = cholesky(B, lower=True)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/linalg/decomp_cholesky.py", line 81, in cholesky
check_finite=check_finite)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/linalg/decomp_cholesky.py", line 30, in _cholesky
raise LinAlgError("%d-th leading minor not positive definite" % info)
LinAlgError: 31-th leading minor not positive definite
Right now, foreign-key relationships in the ModelHub database are not reflected with SQLAlchemy's ORM. Doing so like this will make it easier to reference attributes of mapped objects. For example, the following code:
def foo(classifier_id):
classifier = db.get_classifier(classifier_id)
datarun = db.get_datarun(classifier.datarun_id)
dataset = db.get_dataset(datarun.dataset_id)
...
could be simplified to this:
def foo(classifier_id):
dataset = db.get_classifier(classifier_id).dataset
...
Overall, it will make things cleaner and easier to maintain.
Hi all,
Just wonder is it possible if I want to use methods from keras to build CNN? Since there are all basic classifiers from sklearn in the methods for now. And does it accept 3d input?
In general, we would like to better support the use of ATM as a normal library. It should be just as convenient to use pieces of ATM in another python project as it is to use the system from the command line.
To that end, we should add a database option that is purely in-memory -- and does not leave a file system footprint unless requested. For example, we could use pandas DataFrames
in the back-end instead of tables in a SQL database while maintaining the same interface for the Database
class.
Change evaluate_btb.py to include multiple trials of tuner. This allows for a better evaluation as to whether a specific tuner actually leads to a performance increase. The results will be compared to the best so far in terms of mean over the trials, and standard deviation.
Change name of btb_test.py to evaluate_btb.py. Here we are evaluating the performance of different tuners, opposed to testing their functionality.
Running python atm/enter_data.py -h
(or --help
) should print out available commands, as is widely used convention.
It would be nice to be able to add additional blocks to the ML pipeline, both static and dynamic (see #70 ).
The nu
hyperparameter (defined in methods/gaussian_process.json
) is currently a string categorical variable; this causes the following error whenever the matern
kernel is selected:
Traceback (most recent call last):
File "/home/bcyphers/work/fl/atm/atm/worker.py", line 401, in run_classifier
model, performance = self.test_classifier(hyperpartition.method, params)
File "/home/bcyphers/work/fl/atm/atm/worker.py", line 339, in test_classifier
test_path=test_path)
File "/home/bcyphers/work/fl/atm/atm/model.py", line 195, in train_test
cv_scores = self.cross_validate(X_train, y_train)
File "/home/bcyphers/work/fl/atm/atm/model.py", line 132, in cross_validate
n_folds=self.N_FOLDS)
File "/home/bcyphers/work/fl/atm/atm/metrics.py", line 194, in cross_validate_pipeline
pipeline.fit(X[train_index], y[train_index])
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/pipeline.py", line 270, in fit
self._final_estimator.fit(Xt, y, **fit_params)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 610, in fit
self.base_estimator_.fit(X, y)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 208, in fit
self.kernel_.bounds)]
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 426, in _constrained_optimization
fmin_l_bfgs_b(obj_func, initial_theta, bounds=bounds)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 193, in fmin_l_bfgs_b
**opts)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 328, in _minimize_lbfgsb
f, g = func_and_grad(x)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 278, in func_and_grad
f = fun(x, *args)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 292, in function_wrapper
return function(*(wrapper_args + args))
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 63, in __call__
fg = self.fun(x, *args)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 200, in obj_func
Chose parameters for method knn:
theta, eval_gradient=True)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 337, in log_marginal_likelihood
K, K_gradient = kernel(self.X_train_, eval_gradient=True)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/kernels.py", line 1337, in __call__
tmp = (math.sqrt(2 * self.nu) * K)
TypeError: a float is required
nu
is passed to the constructor for a Matern
kernel, which is then passed to the GaussianProcessClassifier
constructor. According to the docs, nu
should be a float that defaults to 1.5. I'm not sure whether the current configuration was ever correct, but it's not correct as of sklearn 0.18.
Someone should figure out what the proper range of values for nu
is, and update the json to reflect that.
It would be nice that matplotlib module was included in requirements.txt too.
Sometimes it is useful in debugging tasks.
Hello,
I believe that the logging configurations are not being passed through, resulting in models and metrics being persisted in the default directories rather than the ones specified in log.yaml.
I think the following needs to be added to atm/config.py
log_path = log_path or kwargs.get('log_config')
(on line 524)
Additionally, the caller of the load_config
function overwrites the log file configuration for what goes to STDOUT and the logs, which means that even if the user specifies the log-level to be info for stdout, they still need to add the command line argument --verbose-metrics
Dockerisation of this project would be great. Makes it easy to install and deploy :) It would also be great to incorporate Celery in so that the jobs can be dispatched to different worker instances.
All pull requests will fail the circle-ci tests due to lint errors on the original codebase ... see #75
HI all, the log file will produce a line like below showing the best classifier and then you can find out the filename for the learner (provided you made a log of the run), but how do you read the model file? The manual doesn't seem to cover this either.
Examining results across all models (e.g., by a plot) would also be helpful.
nohup python atm/worker.py 2>&1 | tee Output.txt & # log the run
...
Saving model in: models/1261aa3655008f0b9afec119e25d5aab-b585ff5423b4c095b6562b81f2dc2f63-uniform__uniform.model
Saving metrics in: models/1261aa3655008f0b9afec119e25d5aab-b585ff5423b4c095b6562b81f2dc2f63-uniform__uniform.model
Saved classifier 21.
...
Best so far (learner 21): 0.716 +- 0.035
To match our explanation of the hyperparameters and their types in the ATM paper, we could:
There is a bit more of categorization required in terms of hyperparameters. Will follow up on this thread with more.
Hello, I am trying to test run your classifiers on our data, and am getting some errors when the system tries various classifiers. The relevant portions of the error messages are pasted below:
Chose` parameters for method dt:
n_jobs = -1
min_samples_leaf = 1
n_estimators = 100
criterion = entropy
max_features = 0.950735797858
max_depth = 6
TypeError: __init__() got an unexpected keyword argument 'n_jobs'
Chose parameters for method dt:
C = 0.00359684119303
tol = 0.000357435603328
fit_intercept = True
penalty = l2
_scale = True
dual = False
class_weight = auto
TypeError: __init__() got an unexpected keyword argument 'C'
Chose parameters for method logreg:
n_jobs = -1
min_samples_leaf = 1
n_estimators = 100
criterion = gini
max_features = 0.218919710352
max_depth = 7
TypeError: __init__() got an unexpected keyword argument 'min_samples_leaf'
Based on these errors, it seems that the hyperparameters input for the scikit-learn's DecisionTree model is being mixed up with the hyperparameters input for scikit-learn's LogisticRegression model. For example, LogisticRegression does not have a "min_samples_leaf" hyperparameter. Similarly, DecisionTreeClassifier does not have C or n_jobs as hyperparameters. Digging around, the methods/decision_tree.json and methods/logistic_regression.json files seem correct .. so I'm not sure why this is getting mixed up.
I get similar issues when running against the example provided in the readme. Here is a copy/paste of the entire error message
Selector: <class 'btb.selection.uniform.Uniform'>
Tuner: <class 'btb.tuning.uniform.Uniform'>
Choosing hyperparameters...
Chose parameters for method knn:
C = 0.000128015603097
tol = 0.000148636727508
fit_intercept = True
penalty = l2
_scale = True
dual = True
class_weight = auto
Creating classifier...
Testing classifier...
Error testing classifier: datarun=<ID = 5, dataset ID = 5, strategy = uniform__uniform, budget = classifier (100), status: running>
Traceback (most recent call last):
File "atm/worker.py", line 440, in run_classifier
model, performance = self.test_classifier(classifier_id, params)
File "atm/worker.py", line 374, in test_classifier
performance = wrapper.start()
File "/home/kkarra/atm/atm/wrapper.py", line 97, in start
self.make_pipeline()
File "/home/kkarra/atm/atm/wrapper.py", line 383, in make_pipeline
classifier = self.class_(**classifier_params)
File "/home/kkarra/atm/venv/local/lib/python2.7/site-packages/sklearn/neighbors/classification.py", line 126, in __init__
metric_params=metric_params, n_jobs=n_jobs, **kwargs)
TypeError: _init_params() got an unexpected keyword argument 'C'
Here, it seems that the KNN model is getting the wrong keywords. I'm not sure why model's are not being optimized with appropriate keywords. I'm wondering if I should dig further to ensure that the selected model chooses the correct keywords, or if this an identified bug already in the course of porting from old environment to new?
Hi all, any ideas re: the error message below?
Collecting mysql-python==1.2.5 (from -r requirements.txt (line 9))
Downloading MySQL-python-1.2.5.zip (108kB)
100% |████████████████████████████████| 112kB 9.6MB/s
Complete output from command python setup.py egg_info:
sh: 1: mysql_config: not found
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-build-wg4Zpa/mysql-python/setup.py", line 17, in <module>
metadata, options = get_config()
File "setup_posix.py", line 43, in get_config
libs = mysql_config("libs_r")
File "setup_posix.py", line 25, in mysql_config
raise EnvironmentError("%s not found" % (mysql_config.path,))
EnvironmentError: mysql_config not found
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-wg4Zpa/mysql-python/
Pretty much all exponential hyperparameter ranges are defined as something like
"range": [10e-5, 10e5]
This is wrong, and it's my fault for misunderstanding how scientific notation works in Python. When I translated all the old enumeration classes to the new json, I made the mistake of turning 10**3
into 10e3
across the board. See e.g. https://github.com/HDI-Project/ATM/blob/50b592dd6a151a75470fb4120c1781ca7249d43f/atm/enumeration/classification/svm.py for how it was before. This is a quick fix that I'll push in a few minutes.
The documentation hasn't been updated for a few months, and does not reflect the state of the project. I'm removing the docs/ folder from the master repo for now, and pushing it into the branch 'bcyphers/docs' (#80) to avoid confusing newcomers to the project. The folder will be added back to master once it is closer to being ready.
If enter_data()
is called with the same train_path
twice in a row and the data itself hasn't changed, a new Dataset does not need to be created.
We should add a column which stores some kind of hash of the actual data. When a Dataset would be created, if the metadata and data hash are exactly the same as an existing Dataset, nothing should be added to the ModelHub database and the existing Dataset should be returned instead.
When the performance of a classifier is evaluated, the chain of function calls is as follows:
worker.py::self.test_classifer
--> model.train_test
--> model.test_final_model
--> metrics.py::test_pipeline
--> metrics.py::get_metrics
test_pipeline
has a kwargs
argument to allow for include_curves
to be set, however, no kwargs
are passed through the functions above the chain, so include_curves
is always defaulted to False
in the code. How do we want to address this? Should there be a command line argument (or configuration in run_config.yaml) which allows the user to decide if the roc_curves should be computed, and if so, passed through the system?
Added a python=2.7 line in #32 but not sure how you want to style the information.
Python3 doesn't work because of the prints
Tests fail because utilities.py is imported and matplotlib is in turn imported. Doesn't make any sense to have plotting as a hard dependency, but you can make it a conditional import in utilities.py
try:
import matplotlib.pyplot as plt
except ImportError:
plt = None
# then, in graph_series
if plt is None:
raise ImportError("Unable to import matplotlib")
nolearn.dbn, the library that ATM uses for its Deep Belief Network (DBN) classifier, is no longer supported. From the home page:
The nolearn.dbn module is no longer supported. Take a look at nolearn.lasagne for a more modern neural net toolkit.
We should upgrade to lasagne ASAP.
Possibly unrelated, but sometimes the DBN classifier will fail with an error like this:
Traceback (most recent call last):
File "/home/bcyphers/work/fl/atm/atm/worker.py", line 401, in run_classifier
model, performance = self.test_classifier(hyperpartition.method, params)
File "/home/bcyphers/work/fl/atm/atm/worker.py", line 339, in test_classifier
test_path=test_path)
File "/home/bcyphers/work/fl/atm/atm/model.py", line 195, in train_test
cv_scores = self.cross_validate(X_train, y_train)
File "/home/bcyphers/work/fl/atm/atm/model.py", line 132, in cross_validate
n_folds=self.N_FOLDS)
File "/home/bcyphers/work/fl/atm/atm/metrics.py", line 194, in cross_validate_pipeline
pipeline.fit(X[train_index], y[train_index])
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/pipeline.py", line 270, in fit
self._final_estimator.fit(Xt, y, **fit_params)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/nolearn/dbn.py", line 407, in fit
self.use_dropout,
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/gdbn/dbn.py", line 202, in fineTune
err, outMB = step(inpMB, targMB, self.learnRates, self.momentum, self.L2Costs, useDropout)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/gdbn/dbn.py", line 303, in stepNesterov
errSignals, outputActs, error = self.fpropBprop(inputBatch, targetBatch, useDropout)
File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/gdbn/dbn.py", line 262, in fpropBprop
outputErrSignal = -self.outputActFunct.dErrordNetInput(targetBatch, self.state[-1], outputActs)
AttributeError: 'Tanh' object has no attribute 'dErrordNetInput'
Hopefully the upgrade will kill two birds with one stone.
A REST API would be useful to access ATM from any other language and device through the web, and also to illustrate how the code is structured and might be extended.
From the project's readme I get that the internal API's are going to change and that it consequently might be a bit early to develop a REST API, but I wanted to see what was possible to do with the current code. The API currently serves:
No modifications were made outside of the rest_api_server.py file except to the requirements.txt file, adding flask and simplejson to the project dependencies.
TODO's / caveats:
Example api.py usage:
After following the readme's installation instructions and running python scripts/rest_api_server.py on a separate shell under the virtualenv:
curl localhost:5000/enter_data -F file=@/path/file.csv
It should return:
{"success": true}
To see the created dataset:
curl localhost:5000/datasets/1 | jq
{
"class_column": "class",
"description": null,
"size_kb": 6,
"test_path": null,
"k_classes": 3,
"majority": 0.333333333,
"d_features": 6,
"train_path": "FULLPATH/file.csv",
"id": 1,
"n_examples": 150,
"name": "file"
}
To see the created datarun:
curl localhost:5000/dataruns/1 | jq
{
"status": "pending",
"start_time": null,
"description": "uniform__uniform",
"r_minimum": 2,
"metric": "f1",
"budget": 100,
"selector": "uniform",
"priority": 1,
"score_target": "cv_judgment_metric",
"deadline": null,
"budget_type": "classifier",
"id": 1,
"tuner": "uniform",
"dataset_id": 1,
"gridding": 0,
"k_window": 3,
"end_time": null
}
To run the worker.py script once:
curl localhost:5000/simple_worker | jq
after a while
{
"stderr": "",
"stdout": "huge stdout string with worker.py's output"
}
Got an issue when executing enter_data.py
atm) ahmets-MBP-782:atm ahmet$ python atm/enter_data.py Traceback (most recent call last): File "atm/enter_data.py", line 7, in <module> from boto.s3.connection import S3Connection, Key as S3Key ImportError: No module named boto.s3.connection
When I run atm/enter_data.py, it pops up this error:
ImportError: No module named atm.config
Right now there are just a couple of actual tests in the test/
folder, and they leave huge portions of the code untouched.
Eventually, we will need a suite of unit tests and comprehensive integration tests that reasonably convince us that a new change won't break anything. We'll also want to integrate with CircleCI and github so that we can evaluate pull requests from the web (but that will come later).
I'll try to update this issue with our progress as time goes on, but a good start would be unit tests for:
database.py
create/query/update functionmetrics.py
There is no way we need each file name to be 90 characters of gibberish. We should, at the very least, shorten the hash, and probably try to make them more human-readable.
Currently, lots of code for loading config or data depends on the initiating script running from the project root. We can fix this by defining everything relative to the location of a python file instead of relative to ./
. This will make everything more predictable and less brittle.
I trained a few models, and then picked the best one from model directory, was trying to predict new data, then error came up.
Traceback (most recent call last):
File "test_best_model.py", line 21, in <module>
preds = best_model.predict(X)
File "/medical_data/Datasets/ATM/atm/model.py", line 209, in predict
X, _ = self.encoder.transform(data)
File "/medical_data/Datasets/ATM/atm/encoder.py", line 101, in transform
features = data[self.feature_columns]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
I use pickle to load the model:
with open(classfier_p, 'rb') as f:
best_model= pickle.load(f)
Right now, it's only possible to configure a datarun to use methods and metrics that are included with the library. It should be possible to pass JSON files for custom machine-learning methods and Python files/functions for custom metrics. This can be implemented in much the same way that custom tuners/selectors for BTB are handled.
Should be a file that has helper methods for loading and testing previously-generated models. This should include some graphing and data visualization utilities.
Not sure about the best way to make the interface -- it could be a command-line tool like enter_data
and worker
, or it could be a Python REPL-style interface directly to the functions.
At this point, I think the "methods" table in the database is vestigial, and no longer serves a purpose. We should get rid of it to reduce clutter.
Line 341 of atm/worker.py
does not pass the score target, so regardless of what the user chooses in run_config.yaml
, the code determines the best classifier based on the mu_sigma
computation. It seems there is some inconsistency in nomenclature between the mu_sigma
and the cv
options.
Was the original intent to use mu_sigma
to correspond to the highest lower error bound, cv
to correspond to the average cv
score, and test
to correspond to average test score? mu_sigma
is not in the keys of the database, so its not a valid run_config.yaml
value. If we take cv
to mean the same thing as mu_sigma
, then one approach to deal with this may be (starting at Line 341 of atm/worker.py):
if('cv' in self.datarun.score_target):
score_target_in = 'mu_sigma'
else:
score_target_in = self.datarun.score_target
old_best = self.db.get_best_classifier(datarun_id=self.datarun.id,
score_target=score_target_in)
cur_cv_val = model.cv_judgment_metric
cur_cv_err = model.cv_judgment_metric_stdev
cur_test_val = model.test_judgment_metric
if old_best is not None:
old_cv_val = old_best.cv_judgment_metric
old_cv_err = 2*old_best.cv_judgment_metric_stdev
old_test_val = old_best.test_judgment_metric
if('cv' in self.datarun.score_target):
_log('Judgment metric (%s): %.3f +- %.3f' %
(self.datarun.metric, cur_cv_val, cur_cv_err))
if old_best is not None:
if (cur_cv_val - cur_cv_err) > (old_cv_val - old_cv_err):
_log('New best score! Previous best (classifier %s): %.3f +- %.3f' %
(old_best.id, old_cv_val, old_cv_err))
else:
_log('Best so far (classifier %s): %.3f +- %.3f' %
(old_best.id, old_cv_val, old_cv_err))
else:
_log('Judgment metric (%s): %.3f' %
(self.datarun.metric, cur_test_val))
if old_best is not None:
if (cur_test_val) > (old_test_val):
_log('New best score! Previous best (classifier %s): %.3f' %
(old_best.id, old_test_val))
else:
_log('Best so far (classifier %s): %.3f' %
(old_best.id, old_test_val))
Not installed by default in new virtual environments such as conda python 2.7
Right now, if you create a datarun with a typo or just decide you don't want to run it, there's no simple way to remove it from the database. We should add it as a subcommand to enter_data.py
. Maybe:
python enter_data.py remove --datarun 1
likewise,
python enter_data.py remove --dataset 1
I'm guessing that the tester in their environment has populated the files config/test/btb/aws.yaml
etc. and data/car_1.csv
etc.
As it currently stands, it is not possible for someone who just clones the package to run tests and assure themselves the software is installed correctly/works correctly.
Consider overriding the .gitignore to commit the needed yaml and csv files for the test/ directory only? I imagine that the AWS tests would be skipped by default. Also, at the least, there could be unit tests for some of the software that are not dependent on data and configs.
(Somewhat unrelated to this issue, I would have thought that test/method_test.py
does unit tests, and was going to recommend the more conventional python -m pytest
invocation. But it seems that that file is very similar to test/end_to_end_test.py
, doing end-to-end tests as well. Fairly confused by that.)
Hi all,
As a beginer of ATM,
I have below error while running 'python atm/enter_data.py',
could you tell me how can I deal with this error?
Thank you!
=======================================
(myPy27) ubuntu@ip-172-31-17-79:~/workspace/git/atm$ python atm/enter_data.py
Traceback (most recent call last):
File "atm/enter_data.py", line 8, in
from .config import *
ValueError: Attempted relative import in non-package
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.