yandex / rep Goto Github PK
View Code? Open in Web Editor NEWMachine Learning toolbox for Humans
Home Page: http://yandex.github.io/rep/
License: Other
Machine Learning toolbox for Humans
Home Page: http://yandex.github.io/rep/
License: Other
It would be nice to have support for transformermixin too from sklearn and clustering too
There is only a FoldingClassifier, it would be nice to have folding regressor too
staged_predict_proba
predict_all
(True, False) to report learning curve (when mask will be applied, before or after prediction operation)XGboost supports only float64, int64 and bool types for data.
151 dtypes = data.dtypes
152 if not all(dtype.name in ('int64', 'float64', 'bool') for dtype in dtypes):
--> 153 raise ValueError('DataFrame.dtypes must be int, float or bool')
154
155 if feature_names is None:
ValueError: DataFrame.dtypes must be int, float or bool
New version of XGBoost should be used.
after python setup.py install
I run cd tests ; nosetests .
it runs for long time and ends up with errors:
..Info in <TCanvas::Print>: png file /tmp/tmpBg1dar.png has been created
Error in <TFile::TFile>: file toy_datasets/toyMC_bck_mass.root does not exist
E..E.
======================================================================
ERROR: tests.z_test_notebook.test_notebooks_in_folder('/root/rep/howto/00-intro-ROOT.ipynb',)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/root/rep/rep/test/test_notebooks.py", line 43, in check_single_notebook
raise RuntimeError(description)
RuntimeError: Cell failed: 'T.Draw("min_DOCA")
c1'
Traceback:
---------------------------------------------------------------------------
ReferenceError Traceback (most recent call last)
<ipython-input-5-aa6c7320180d> in <module>()
----> 1 T.Draw("min_DOCA")
2 c1
ReferenceError: attempt to access a null-pointer
What am I missing?
Hi!
Is there a way to install REP manually on Windows environment?
When installing dependencies i get an error when installing gnureadline:
Error: this module is not meant to work on Windows (try pyreadline instead)
Is there a way to use pyreadline for windows uoosers?
It would be nice to add wrapper for caffe, which is popular library for deep NN.
This isn't really an issue but I just wanted to let you know that root_numpy now provides functions that can feed NumPy arrays directly to TMVA Factories or Readers:
http://rootpy.github.io/root_numpy/reference/index.html#module-root_numpy.tmva
So there is no longer any need to convert the arrays to TTrees in a temporary ROOT file.
test_xgboost
file is not running on windows 10
File "c:\Sander\my_code\rep-master\tests\test_xgboost.py", line 4, in
from rep.estimators import XGBoostClassifier, XGBoostRegressor
ImportError: cannot import name XGBoostClassifier
when rep installatoin is ok
but xgboost instal fails
Microsoft Windows Version 10.0.10586 2015 Microsoft Corporation. All rights reserved.
c:\Sander>pip install rep --no-dependencies
Collecting rep
Downloading rep-0.6.5.tar.gz (72kB)
100% |################################| 81kB 511kB/s
Building wheels for collected packages: rep
Running setup.py bdist_wheel for rep ... done
Stored in directory: C:\Users\Sander\AppData\Local\pip\Cache\wheels\db\ee\06\ac6e3f3ec208edaee29654f0b55ffaf2719a51de799c396b91
Successfully built rep
Installing collected packages: rep
Successfully installed rep-0.6.5
You are using pip version 8.1.0, however version 8.1.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
c:\Sander>pip install xgboost==0.4a30
Collecting xgboost==0.4a30
Downloading xgboost-0.4a30.tar.gz (753kB)
100% |################################| 757kB 553kB/s
No files/directories in c:\users\sander\appdata\local\temp\pip-build-exobfm\xgboost\pip-egg-info (from PKG-INFO)
You are using pip version 8.1.0, however version 8.1.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
c:\Sander>
estimator = SklearnClassifier(GradientBoostingClassifier(n_estimators=15), features=features)
grid_finder = GridOptimalSearchCV(estimator, generator, scorer)
grid_finder.fit(data, labels)
results in
--------------------------------------------------------------------------- AssertionError Traceback (most recent call last) in () 1 estimator = SklearnClassifier(GradientBoostingClassifier(n_estimators=15), features=features) 2 grid_finder = GridOptimalSearchCV(estimator, generator, scorer) ----> 3 grid_finder.fit(data, labels) /Users/axelr/ipython/rep/rep/metaml/gridsearch.py in fit(self, X, y, sample_weight) 530 state_indices, state_dict = self.params_generator.generate_next_point() 531 status, value = apply_scorer(self.scorer, state_dict, self.base_estimator, X, y, sample_weight) --> 532 assert status == 'success', 'Error during grid search ' + str(value) 533 self.params_generator.add_result(state_indices, value) 534 self.evaluations_done += 1 AssertionError: Error during grid search 'RocAuc' object has no attribute 'classes_'
/root/miniconda/envs/rep_py2/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/core.pyc in _maybe_from_pandas(data, feature_names, feature_types)
151 dtypes = data.dtypes
152 if not all(dtype.name in ('int64', 'float64', 'bool') for dtype in dtypes):
--> 153 raise ValueError('DataFrame.dtypes must be int, float or bool')
154
155 if feature_names is None:
ValueError: DataFrame.dtypes must be int, float or bool
currently grid_search maximizes quality metrics, but e.g. for LogLoss this is not the case.
Hi!
Using REP with Docker is really nice and easy. Is it possible to have the same for Python 3 or do I have to build the image myself?
Current grid-search example is very complicated.
The following minimal example by @mschlupp fails
from rep.estimators import NeurolabClassifier
import pandas as pd
clf = NeurolabClassifier(net_type='learning-vector')
ds = pd.DataFrame()
ds['feature1']=[0,1,2,3,4,5]
ds['feature2']=[5,7,2,4,7,9]
ds['y']=[0,0,0,1,1,1]
clf.fit(ds[['feature1','feature2']],ds['y'])
since transf
makes no sense for LVQ.
This one is used somehow (?) for building rep-image for python3.
This parameter is not supported now, install_repbase.sh
only expects the major version parameter
It seems last docker release depricates boot2docker http://docs.docker.com/installation/mac/
"This release of Docker deprecates the Boot2Docker command line in favor of Docker Machine"
How to install REP with latest docker release?
After REP installation from here, I've met the following problem with TMVAClassifier fitting: I'm trying to train TMVAClassifier, and IOError raises after following strings:
" baseline = TMVAClassifier(method='kBDT', features=variables, BoostType='Grad', NTrees=40, Shrinkage=0.01, MaxDepth=7, UseNvars=6, nCuts=-1) features=variables)
baseline.fit(train, train['signal'])"
Stacktrace is next:
IOError Traceback (most recent call last) in () 3 UseNvars=6, nCuts=-1) 4 # baseline = TMVAClassifier(method='kBDT', NTrees=50, Shrinkage=0.05, features=variables) ----> 5 baseline.fit(train, train['signal'])
/usr/local/lib/python2.7/dist-packages/rep-0.6.3-py2.7.egg/rep/estimators/tmva.pyc in fit(self, X, y, sample_weight) 288 self.factory_options = '{}:AnalysisType=Multiclass'.format(self.factory_options) 289 --> 290 return self._fit(X, y, sample_weight=sample_weight) 291 292 def predict_proba(self, X):
/usr/local/lib/python2.7/dist-packages/rep-0.6.3-py2.7.egg/rep/estimators/tmva.pyc in _fit(self, X, y, sample_weight, model_type) 104 add_info = _AdditionalInformation(directory, model_type=model_type) 105 try: --> 106 self._run_tmva_training(add_info, X, y, sample_weight) 107 finally: 108 self._remove_tmp_directory(directory)
/usr/local/lib/python2.7/dist-packages/rep-0.6.3-py2.7.egg/rep/estimators/tmva.pyc in run_tmva_training(self, info, X, y, sample_weight) 134 xml_filename = os.path.join(info.directory, 'weights', 135 '{job}{name}.weights.xml'.format(job=info.tmva_job, name=self._method_name)) --> 136 with open(xml_filename, 'r') as xml_file: 137 self.formula_xml = xml_file.read() 138
IOError: [Errno 2] No such file or directory: '/home/artem/Documents/IPython Notebooks/CERN + Yandex/Original Baseline/flavours-of-physics-start/tmp0Fhtqe/weights/TMVAEstimation_REP_Estimator.weights.xml'
As I found, weights/ folder was created outside of temporary folder instead created inside in last one. It causes the error above.
ROOT 5.34, Python 2.7, GCC 4.8, Ubuntu 14.04 LTS (x64). All requirenments for REP were installed successfully (from requirenments.txt)
i.e. this works:
X = XGBoostClassifier()
X.fit(D.data['X_train'], D.data['Y_train'])
and this doesn't:
factory = ClassifiersFactory()
factory.add_classifier('ada', AdaBoostClassifier(n_estimators=100))
factory['xgb'] = XGBoostClassifier()
factory.fit(D.data['X_train'], D.data['Y_train'])
complaining:
/usr/local/lib/python2.7/dist-packages/rep/metaml/factory.pyc in fit(self, X, y, sample_weight, parallel_profile, features)
48 :return: self
49 """
---> 50 assert isinstance(X, pandas.DataFrame), 'The passed '
51 if features is not None:
52 for name, estimator in self.items():
AssertionError: The passed
compute_optimum
All of neural nets (pybrain, neurolab, theanets) have problems with reproducibility:
Meanwhile there are some dirty hacks with changing global seed, which is awful. (And they don't work for pyBrain, so things there are totally irreproducible)
ROOT package at conda was updated, installation scripts do not find thisroot.sh
First failed job: https://travis-ci.org/yandex/rep/builds/113799526
I would really like to use your package, it looks like an awesome simplicfication for trying different ML algorithms and frameworks.
What's currently stopping me from using it, are the ==
requirements for rather old package versions,
especially pandas 0.14
.
Why do you need those ==
requirements?
Could you check if anything gets broken with newer versions and change the requirement to >=
?
Some introduction in the structure of module is needed.
Add wrapper for the factorization machines
Currently, grid search can use only IPython cluster for parallel computations.
For uniformity with other parts of library, it sounds reasonable to introduce support of threads here.
It would be fine to have single installation script (together with environment), which will be used in docker image preparation, continuous integration and fabric-like integration.
Currently there is no special point in having ROOT notebooks, since it is not connected to REP
directly, only using plotting.
Two notebooks about ROOT is definitely an overkill.
as I start develop version under jupyterhub, it reports in log and fails:
Starting under Jupyterhub
fatal: destination path '/notebooks' already exists and is not an empty directory.
Traceback (most recent call last):
File "/root/miniconda/envs/jupyterhub_py3/bin/jupyterhub-singleuser", line 21, in <module>
import notebook
ImportError: No module named 'notebook'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda/envs/jupyterhub_py3/bin/jupyterhub-singleuser", line 23, in <module>
raise ImportError("JupyterHub single-user server requires notebook >= 4.0")
ImportError: JupyterHub single-user server requires notebook >= 4.0
When training in threads many models and passing columns
argument, this drives to kernel restart.
After 2-hour debugging we found out this is due to numexpr
, which is unable to work normally in threads. Minimal failing example:
from rep.metaml.utils import map_on_cluster
import pandas
import numpy
from rep.utils import get_columns_in_df
columns = ['Feature_0', 'Feature_1']
size = 10000
x = pandas.DataFrame(numpy.random.random([size, 2]), columns=columns)
def f(x, columns):
x = get_columns_in_df(x.copy(), columns)
return x
n = 50
result = map_on_cluster('threads-4', f, [x] * n, [columns[:1]] * n)
print 'ok'
At least this means we should minimize usage of numexpr
(if not completely exclude).
Given: set of machines connected together and to the Internet
Needed: instructions/scripts to set up runnable ipython cluster on those machines using REP containers
Getting python to work on *nix systems is not a great deal, while for windows users docker may serve as good option compared to conda/pythonXY/etc.
For me it would be enough to have simply arguments of docker run
(hardly I can decode them from scripts).
Find out at which conditions TMVA processes keep running.
example of failed job: https://travis-ci.org/yandex/rep/jobs/112224597
ATM travis spends ~45 minutes, majot part spent on testing notebooks
minor remark: in scikit-learn (since 0.16 I think) train_test_split
should work as yours.
If you have any other api annoyances that you need to work around, let me know ;)
ps: awesome stuff!
Build an additional test that GaussianProcess optimization works fine.
For batch optimization, probably some improvement is possible, which also minimizes an overall variance as well as looking for most optimal model. (Need to check if some well-tested implementation exists).
Sklearn's gaussian processes don't support variance of measurements, which probably could improve search.
Hi.
I had old REP docker version in ~/rep_container which started with run.sh script on 8080 port. I updated REP and it broke: sudo $REPDIR/run.sh
worked, but I couldn't connect to localhost:8080 (connection refused). I've decided to update docker and REP according to new instructions: https://github.com/yandex/rep/wiki/Install-REP-with-Docker-(Linux).
netstat -anl | grep 8888
gave empty resultgit checkout https://github.com/yandex/rep.git
didn't work (pathspec did not match any file(s) known to git), so I used git clone
instead.sudo make run
was successful and installed container.sudo make run
gave the followingdocker run -ti --rm -p 8888:8888 --name rep yandex/rep:0.6.4
Error response from daemon: Conflict. The name "rep" is already in use by container 3af0884aeedb. You have to remove (or rename) that container to be able to reuse that name.
make: ** [run] Error 1*
6. I ran sudo docker images
REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE
yandex/rep 0.6.4 18a48bc5a3b6 8 hours ago 2.635 GB
anaderi/rep latest 63c3db2850b6 4 months ago 1.649 GB
91c95931e552 7 months ago 910 B
7. I tried sudo docker start rep
. It worked and I opned REP on localhost:8888. But its working folder changed. Now it is /var/lib/docker/volumes/dbcc7ff99538007d9c6b244fb6b8f03bdcfd564f6076b36d79fa3330d2041107/_data/. It is quite unhandy, because it requires superuser rights to access and not conveniently located at all.
Question: Is it a new system or did I something wrong? If latter, how to I fix it and run REP container in handy folder?
XGBoost wrapper was significantly improved.
New wrapper is almost sklearn-compatible, but there are still failures of copy and clone, important field classes_
is not set.
So there are two ways:
When some feature has zero importance it is not listed
Proposal: add cache classifier for researches that require heavy computations.
clf = CacheClassifier(name='stage_1', base_estimator=XGBoostClassifier(...))
clf.fit(X, y, sample_weight)
clf.predict(...)
All the methods are proxied to initial classifier (XGBoostClassifier
in this case).
Copy of trained classifier is saved at .rep_cache/stage_1.pkl
, together with hash of dataset.
The next time notebook is executed, if we have the same parameters of classifier and the same value of dataset hash, fit method only loads already-trained estimator.
There are many possible caveats, first think of handling clone and pickle. Those are not trivial.
it is useful if jupyter frontend disconnects during fit/predict execution.
might the following snippet be handy for such cases
class Stopwatch(object):
def __enter__(self):
self.t0 = datetime.datetime.now()
return self
def __exit__(self, type, value, traceback):
self.t1 = datetime.datetime.now()
def __repr__(self):
return "delta: (%s)" % (self.t1 - self.t0)
with Stopwatch() as sfit:
time.sleep(1)
with Stopwatch() as spredict:
time.sleep(1)
print "fit:", sfit, "spredict:", spredict
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.