yandex / rep Goto Github PK

View Code? Open in Web Editor NEW

686.0 686.0 144.0 134.38 MB

Machine Learning toolbox for Humans

Home Page: http://yandex.github.io/rep/

License: Other

Python 11.04% Shell 0.16% Makefile 0.08% Jupyter Notebook 88.72%

rep's People

Stargazers

Watchers

Forkers

dzianis-pirshtuk hengqujushi chagge marouane-tradelab 0x0all spolakh haskile tyamana a-berdnikov afey johnfrye mmeloon schevalier jethrotan nkhuyu daviddjchen oiclid arunsingh eyadsibai carlosf danielmckeown vkuznet gitter-badger sashabaranov nickcdryan shaliko grapefroot mtresch jwimberley anaderi parthasen suryanarayadev jithsjoy zxzcdb noscripter prayagverma nikolayvoronchikhin denmoroz solertis tborgstadt errord arogozhnikov tsterbak kevinhsu lai-bluejay yhaddad govindap mschlupp mr1azl chenditc candypythonflow alexandertek blasphemy1991 gamifiml hemel-cse aligator77 dearkafka barseghyanartur dailyactie coloratto noreentry stephanesbizzera olivierh59500 mohendra jwiemann ofergold ibrahim85 alfiyazi sureshsagir bblokar jonas-eschle mikewlange leoredi msmartbot shravankumar147 gdujany contactvictor zhoudaqing bashkapro llyangithub tony32769 vishalbelsare odahme erichjzhang clustersdata seanhsieh tlikhomanenko shujianhui cloudstdio marinang srmchcy spencerx cnsuhao kormilitzin maggap daritter volodymyrss bastianatte replegacy devops8012

rep's Issues

TransformerMixin

It would be nice to have support for transformermixin too from sklearn and clustering too

FoldingRegressor

There is only a FoldingClassifier, it would be nice to have folding regressor too

Folding Classifier don't correctly work with mask in report.learning_curve

Add message about dataset length (equal or not to the training) during staged_predict_proba
Add predict_all (True, False) to report learning curve (when mask will be applied, before or after prediction operation)

DTypes problem in current xgboost version

XGboost supports only float64, int64 and bool types for data.

 151     dtypes = data.dtypes
    152     if not all(dtype.name in ('int64', 'float64', 'bool') for dtype in dtypes):
--> 153         raise ValueError('DataFrame.dtypes must be int, float or bool')
    154 
    155     if feature_names is None:

ValueError: DataFrame.dtypes must be int, float or bool

New version of XGBoost should be used.

Manual Install on Windows

Hi!
Is there a way to install REP manually on Windows environment?
When installing dependencies i get an error when installing gnureadline:

Error: this module is not meant to work on Windows (try pyreadline instead)

Is there a way to use pyreadline for windows uoosers?

get rid of docker.cid file while running docker container

Create wrapper for caffe

It would be nice to add wrapper for caffe, which is popular library for deep NN.

Rewrite documentation for metrics

Use the new TMVA-related functions in root_numpy

This isn't really an issue but I just wanted to let you know that root_numpy now provides functions that can feed NumPy arrays directly to TMVA Factories or Readers:

http://rootpy.github.io/root_numpy/reference/index.html#module-root_numpy.tmva

So there is no longer any need to convert the arrays to TTrees in a temporary ROOT file.

test_xgboost file is not running on windows 10

test_xgboost
file is not running on windows 10
File "c:\Sander\my_code\rep-master\tests\test_xgboost.py", line 4, in
from rep.estimators import XGBoostClassifier, XGBoostRegressor

ImportError: cannot import name XGBoostClassifier

when rep installatoin is ok
but xgboost instal fails
Microsoft Windows Version 10.0.10586 2015 Microsoft Corporation. All rights reserved.

c:\Sander>pip install rep --no-dependencies
Collecting rep
Downloading rep-0.6.5.tar.gz (72kB)
100% |################################| 81kB 511kB/s
Building wheels for collected packages: rep
Running setup.py bdist_wheel for rep ... done
Stored in directory: C:\Users\Sander\AppData\Local\pip\Cache\wheels\db\ee\06\ac6e3f3ec208edaee29654f0b55ffaf2719a51de799c396b91
Successfully built rep
Installing collected packages: rep
Successfully installed rep-0.6.5
You are using pip version 8.1.0, however version 8.1.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

c:\Sander>pip install xgboost==0.4a30
Collecting xgboost==0.4a30
Downloading xgboost-0.4a30.tar.gz (753kB)
100% |################################| 757kB 553kB/s
No files/directories in c:\users\sander\appdata\local\temp\pip-build-exobfm\xgboost\pip-egg-info (from PKG-INFO)
You are using pip version 8.1.0, however version 8.1.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

c:\Sander>

A better introductory notebook or video

Grid_search doensn't use metric prefitting

estimator = SklearnClassifier(GradientBoostingClassifier(n_estimators=15), features=features)
grid_finder = GridOptimalSearchCV(estimator, generator, scorer)
grid_finder.fit(data, labels)

results in

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
 in ()
      1 estimator = SklearnClassifier(GradientBoostingClassifier(n_estimators=15), features=features)
      2 grid_finder = GridOptimalSearchCV(estimator, generator, scorer)
----> 3 grid_finder.fit(data, labels)

/Users/axelr/ipython/rep/rep/metaml/gridsearch.py in fit(self, X, y, sample_weight)
    530                 state_indices, state_dict = self.params_generator.generate_next_point()
    531                 status, value = apply_scorer(self.scorer, state_dict, self.base_estimator, X, y, sample_weight)
--> 532                 assert status == 'success', 'Error during grid search ' + str(value)
    533                 self.params_generator.add_result(state_indices, value)
    534                 self.evaluations_done += 1

AssertionError: Error during grid search 'RocAuc' object has no attribute 'classes_'

float32 not supported by xgboost

/root/miniconda/envs/rep_py2/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/core.pyc in _maybe_from_pandas(data, feature_names, feature_types)
151 dtypes = data.dtypes
152 if not all(dtype.name in ('int64', 'float64', 'bool') for dtype in dtypes):
--> 153 raise ValueError('DataFrame.dtypes must be int, float or bool')
154
155 if feature_names is None:

ValueError: DataFrame.dtypes must be int, float or bool

Add 'minimize' parameter to OptimalGridSearch

currently grid_search maximizes quality metrics, but e.g. for LogLoss this is not the case.

Docker with python3

Hi!

Using REP with Docker is really nice and easy. Is it possible to have the same for Python 3 or do I have to build the image myself?

Simple grid-search example notebook

Current grid-search example is very complicated.

Problems with LVQ in neurolab

The following minimal example by @mschlupp fails

from rep.estimators import NeurolabClassifier
import pandas as pd
clf = NeurolabClassifier(net_type='learning-vector')
ds = pd.DataFrame()
ds['feature1']=[0,1,2,3,4,5]
ds['feature2']=[5,7,2,4,7,9]
ds['y']=[0,0,0,1,1,1]
clf.fit(ds[['feature1','feature2']],ds['y'])

since transf makes no sense for LVQ.

create notebook that explicitly downloads datasets

Get rid of TRAVIS_PYTHON_VERSION in makefile

This one is used somehow (?) for building rep-image for python3.
This parameter is not supported now, install_repbase.sh only expects the major version parameter

Mac OS instalation with docker

It seems last docker release depricates boot2docker http://docs.docker.com/installation/mac/
"This release of Docker deprecates the Boot2Docker command line in favor of Docker Machine"

How to install REP with latest docker release?

publish 0.6.4 to PyPI

Problem with TMVAClassifier

After REP installation from here, I've met the following problem with TMVAClassifier fitting: I'm trying to train TMVAClassifier, and IOError raises after following strings:
" baseline = TMVAClassifier(method='kBDT', features=variables, BoostType='Grad', NTrees=40, Shrinkage=0.01, MaxDepth=7, UseNvars=6, nCuts=-1) features=variables)

baseline.fit(train, train['signal'])"

Stacktrace is next:

IOError Traceback (most recent call last) in () 3 UseNvars=6, nCuts=-1) 4 # baseline = TMVAClassifier(method='kBDT', NTrees=50, Shrinkage=0.05, features=variables) ----> 5 baseline.fit(train, train['signal'])

/usr/local/lib/python2.7/dist-packages/rep-0.6.3-py2.7.egg/rep/estimators/tmva.pyc in fit(self, X, y, sample_weight) 288 self.factory_options = '{}:AnalysisType=Multiclass'.format(self.factory_options) 289 --> 290 return self._fit(X, y, sample_weight=sample_weight) 291 292 def predict_proba(self, X):

/usr/local/lib/python2.7/dist-packages/rep-0.6.3-py2.7.egg/rep/estimators/tmva.pyc in _fit(self, X, y, sample_weight, model_type) 104 add_info = _AdditionalInformation(directory, model_type=model_type) 105 try: --> 106 self._run_tmva_training(add_info, X, y, sample_weight) 107 finally: 108 self._remove_tmp_directory(directory)

/usr/local/lib/python2.7/dist-packages/rep-0.6.3-py2.7.egg/rep/estimators/tmva.pyc in run_tmva_training(self, info, X, y, sample_weight) 134 xml_filename = os.path.join(info.directory, 'weights', 135 '{job}{name}.weights.xml'.format(job=info.tmva_job, name=self._method_name)) --> 136 with open(xml_filename, 'r') as xml_file: 137 self.formula_xml = xml_file.read() 138

IOError: [Errno 2] No such file or directory: '/home/artem/Documents/IPython Notebooks/CERN + Yandex/Original Baseline/flavours-of-physics-start/tmp0Fhtqe/weights/TMVAEstimation_REP_Estimator.weights.xml'

As I found, weights/ folder was created outside of temporary folder instead created inside in last one. It causes the error above.

ROOT 5.34, Python 2.7, GCC 4.8, Ubuntu 14.04 LTS (x64). All requirenments for REP were installed successfully (from requirenments.txt)

Factory doesn't fit given plain numpy arrays

i.e. this works:

X = XGBoostClassifier()
X.fit(D.data['X_train'], D.data['Y_train'])

and this doesn't:

factory = ClassifiersFactory()
factory.add_classifier('ada', AdaBoostClassifier(n_estimators=100))
factory['xgb'] = XGBoostClassifier()
factory.fit(D.data['X_train'], D.data['Y_train'])

complaining:

/usr/local/lib/python2.7/dist-packages/rep/metaml/factory.pyc in fit(self, X, y, sample_weight, parallel_profile, features)
     48         :return: self
     49         """
---> 50         assert isinstance(X, pandas.DataFrame), 'The passed '
     51         if features is not None:
     52             for name, estimator in self.items():

AssertionError: The passed

OptimalMetricNdim - better documentation

OptimalMetricNdim doesn't follow metric conventions. This should be said explicitly + no derivation from metric mixin + rename call to something like compute_optimum
Good explanation (probably with example) is needed as well.

Reproducibility issue

All of neural nets (pybrain, neurolab, theanets) have problems with reproducibility:

Meanwhile there are some dirty hacks with changing global seed, which is awful. (And they don't work for pyBrain, so things there are totally irreproducible)

Travis builds fail after updating ROOT

ROOT package at conda was updated, installation scripts do not find thisroot.sh

First failed job: https://travis-ci.org/yandex/rep/builds/113799526

can we have an option to support other ipython parallel engines by means of ipython-cluster-helper?

https://github.com/roryk/ipython-cluster-helper

inform about version of REP during init/run scripts

Why those harsh `==` requirements?

I would really like to use your package, it looks like an awesome simplicfication for trying different ML algorithms and frameworks.

What's currently stopping me from using it, are the == requirements for rather old package versions,
especially pandas 0.14.

Why do you need those == requirements?

Could you check if anything gets broken with newer versions and change the requirement to >=?

Reorganize documentation for grid_search

Some introduction in the structure of module is needed.

FastFM wrapper

Add wrapper for the factorization machines

Support of threads-based parallel computing in GridSearch

Currently, grid search can use only IPython cluster for parallel computations.
For uniformity with other parts of library, it sounds reasonable to introduce support of threads here.

Single script used for deployment

It would be fine to have single installation script (together with environment), which will be used in docker image preparation, continuous integration and fabric-like integration.

Rewrite or remove ROOT notebooks

Currently there is no special point in having ROOT notebooks, since it is not connected to REP directly, only using plotting.
Two notebooks about ROOT is definitely an overkill.

develop branch doesn't fly under jupyterhub/everware

as I start develop version under jupyterhub, it reports in log and fails:

Starting under Jupyterhub
fatal: destination path '/notebooks' already exists and is not an empty directory.
Traceback (most recent call last):
  File "/root/miniconda/envs/jupyterhub_py3/bin/jupyterhub-singleuser", line 21, in <module>
    import notebook
ImportError: No module named 'notebook'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda/envs/jupyterhub_py3/bin/jupyterhub-singleuser", line 23, in <module>
    raise ImportError("JupyterHub single-user server requires notebook >= 4.0")
ImportError: JupyterHub single-user server requires notebook >= 4.0

Numexpr and multithread failures

When training in threads many models and passing columns argument, this drives to kernel restart.

After 2-hour debugging we found out this is due to numexpr, which is unable to work normally in threads. Minimal failing example:

from rep.metaml.utils import map_on_cluster
import pandas
import numpy
from rep.utils import get_columns_in_df

columns = ['Feature_0', 'Feature_1']
size = 10000
x = pandas.DataFrame(numpy.random.random([size, 2]), columns=columns)

def f(x, columns):
    x = get_columns_in_df(x.copy(), columns)
    return x

n = 50
result = map_on_cluster('threads-4', f, [x] * n, [columns[:1]] * n)
print 'ok'

At least this means we should minimize usage of numexpr (if not completely exclude).

instructions on how to setup ipython cluster using REP containers

Given: set of machines connected together and to the Internet
Needed: instructions/scripts to set up runnable ipython cluster on those machines using REP containers

makefile/fabfile for setting up cluster
update start script to spawn ipcontroller
Dockerfile/image for REP updated with ipyparallel==4.1.0
targets for starting master
targets for starting slaves
test on Cracow
instructions how to use this stuff
what is the best place for instructions/dockerfiles/makefiles/fabfiles/tests? same repo? same branch? or? (@arogozhnikov?)
tests

Windows instructions for docker

Getting python to work on *nix systems is not a great deal, while for windows users docker may serve as good option compared to conda/pythonXY/etc.

For me it would be enough to have simply arguments of docker run (hardly I can decode them from scripts).

TMVA processes are not stopped

Find out at which conditions TMVA processes keep running.

test_optimal_metrics_2dim fails spontaneously

example of failed job: https://travis-ci.org/yandex/rep/jobs/112224597

Reduce time needed to execute notebooks

ATM travis spends ~45 minutes, majot part spent on testing notebooks

train_test_split in sklearn perserves columns now

minor remark: in scikit-learn (since 0.16 I think) train_test_split should work as yours.
If you have any other api annoyances that you need to work around, let me know ;)

ps: awesome stuff!

Test GP-based optimization

Build an additional test that GaussianProcess optimization works fine.

For batch optimization, probably some improvement is possible, which also minimizes an overall variance as well as looking for most optimal model. (Need to check if some well-tested implementation exists).

Sklearn's gaussian processes don't support variance of measurements, which probably could improve search.

Plots

Plot.ly introduced too hard restrictions (50 plots/day), which makes this library almost useless without payed account
given bokeh, mlpd3.js and %matplotlib notebook options, I'm sure we can delete following support for plot.ly
TMVA plots require additional testing - at least in some case those 'hang', showing the same picture for different plots.

New REP docker version running in /var/lib/docker/volumes/ instead of ~/rep_container

Hi.

I had old REP docker version in ~/rep_container which started with run.sh script on 8080 port. I updated REP and it broke: sudo $REPDIR/run.sh worked, but I couldn't connect to localhost:8080 (connection refused). I've decided to update docker and REP according to new instructions: https://github.com/yandex/rep/wiki/Install-REP-with-Docker-(Linux).

I installed Docker, according to instructions.
netstat -anl | grep 8888 gave empty result
git checkout https://github.com/yandex/rep.git didn't work (pathspec did not match any file(s) known to git), so I used git clone instead.
First run of sudo make run was successful and installed container.
I rebooted and second sudo make run gave the following

docker run -ti --rm -p 8888:8888 --name rep yandex/rep:0.6.4
Error response from daemon: Conflict. The name "rep" is already in use by container 3af0884aeedb. You have to remove (or rename) that container to be able to reuse that name.
make: ** [run] Error 1*
6. I ran sudo docker images

REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE
yandex/rep 0.6.4 18a48bc5a3b6 8 hours ago 2.635 GB
anaderi/rep latest 63c3db2850b6 4 months ago 1.649 GB
91c95931e552 7 months ago 910 B
7. I tried sudo docker start rep. It worked and I opned REP on localhost:8888. But its working folder changed. Now it is /var/lib/docker/volumes/dbcc7ff99538007d9c6b244fb6b8f03bdcfd564f6076b36d79fa3330d2041107/_data/. It is quite unhandy, because it requires superuser rights to access and not conveniently located at all.

Question: Is it a new system or did I something wrong? If latter, how to I fix it and run REP container in handy folder?

Update XGBoost wrapper

XGBoost wrapper was significantly improved.

New wrapper is almost sklearn-compatible, but there are still failures of copy and clone, important field classes_ is not set.

So there are two ways:

We leave things as they are, since our wrapper works fine.
or I fix these bugs in original wrapper and we delete special wrapper for XGBoost. Less code to support, but no guarantee about proper support of that wrapper.

XGBoost feature importance error

When some feature has zero importance it is not listed

Cache classifier?

Proposal: add cache classifier for researches that require heavy computations.

Interface

clf = CacheClassifier(name='stage_1', base_estimator=XGBoostClassifier(...))
clf.fit(X, y, sample_weight)
clf.predict(...)

All the methods are proxied to initial classifier (XGBoostClassifier in this case).

Copy of trained classifier is saved at .rep_cache/stage_1.pkl, together with hash of dataset.

The next time notebook is executed, if we have the same parameters of classifier and the same value of dataset hash, fit method only loads already-trained estimator.

There are many possible caveats, first think of handling clone and pickle. Those are not trivial.

do we need to measure fit/predict time without %time?

it is useful if jupyter frontend disconnects during fit/predict execution.

might the following snippet be handy for such cases

class Stopwatch(object):
    def __enter__(self):
        self.t0 = datetime.datetime.now()
        return self

    def __exit__(self, type, value, traceback):
        self.t1 = datetime.datetime.now()

    def __repr__(self):
        return "delta: (%s)" % (self.t1 - self.t0)


with Stopwatch() as sfit:
    time.sleep(1)
with Stopwatch() as spredict:
    time.sleep(1)

print "fit:", sfit, "spredict:", spredict