Git Product home page Git Product logo

rep's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rep's Issues

TransformerMixin

It would be nice to have support for transformermixin too from sklearn and clustering too

FoldingRegressor

There is only a FoldingClassifier, it would be nice to have folding regressor too

DTypes problem in current xgboost version

XGboost supports only float64, int64 and bool types for data.

 151     dtypes = data.dtypes
    152     if not all(dtype.name in ('int64', 'float64', 'bool') for dtype in dtypes):
--> 153         raise ValueError('DataFrame.dtypes must be int, float or bool')
    154 
    155     if feature_names is None:

ValueError: DataFrame.dtypes must be int, float or bool

New version of XGBoost should be used.

test failed

after python setup.py install I run cd tests ; nosetests .
it runs for long time and ends up with errors:

..Info in <TCanvas::Print>: png file /tmp/tmpBg1dar.png has been created
Error in <TFile::TFile>: file toy_datasets/toyMC_bck_mass.root does not exist
E..E.
======================================================================
ERROR: tests.z_test_notebook.test_notebooks_in_folder('/root/rep/howto/00-intro-ROOT.ipynb',)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/root/rep/rep/test/test_notebooks.py", line 43, in check_single_notebook
    raise RuntimeError(description)
RuntimeError: Cell failed: 'T.Draw("min_DOCA")
c1'

 Traceback:
---------------------------------------------------------------------------
ReferenceError                            Traceback (most recent call last)
<ipython-input-5-aa6c7320180d> in <module>()
----> 1 T.Draw("min_DOCA")
      2 c1

ReferenceError: attempt to access a null-pointer

What am I missing?

Manual Install on Windows

Hi!
Is there a way to install REP manually on Windows environment?
When installing dependencies i get an error when installing gnureadline:

Error: this module is not meant to work on Windows (try pyreadline instead)

Is there a way to use pyreadline for windows uoosers?

test_xgboost file is not running on windows 10

test_xgboost
file is not running on windows 10
File "c:\Sander\my_code\rep-master\tests\test_xgboost.py", line 4, in
from rep.estimators import XGBoostClassifier, XGBoostRegressor

ImportError: cannot import name XGBoostClassifier

when rep installatoin is ok
but xgboost instal fails
Microsoft Windows Version 10.0.10586 2015 Microsoft Corporation. All rights reserved.

c:\Sander>pip install rep --no-dependencies
Collecting rep
Downloading rep-0.6.5.tar.gz (72kB)
100% |################################| 81kB 511kB/s
Building wheels for collected packages: rep
Running setup.py bdist_wheel for rep ... done
Stored in directory: C:\Users\Sander\AppData\Local\pip\Cache\wheels\db\ee\06\ac6e3f3ec208edaee29654f0b55ffaf2719a51de799c396b91
Successfully built rep
Installing collected packages: rep
Successfully installed rep-0.6.5
You are using pip version 8.1.0, however version 8.1.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

c:\Sander>pip install xgboost==0.4a30
Collecting xgboost==0.4a30
Downloading xgboost-0.4a30.tar.gz (753kB)
100% |################################| 757kB 553kB/s
No files/directories in c:\users\sander\appdata\local\temp\pip-build-exobfm\xgboost\pip-egg-info (from PKG-INFO)
You are using pip version 8.1.0, however version 8.1.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

c:\Sander>

Grid_search doensn't use metric prefitting

estimator = SklearnClassifier(GradientBoostingClassifier(n_estimators=15), features=features)
grid_finder = GridOptimalSearchCV(estimator, generator, scorer)
grid_finder.fit(data, labels)

results in

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
 in ()
      1 estimator = SklearnClassifier(GradientBoostingClassifier(n_estimators=15), features=features)
      2 grid_finder = GridOptimalSearchCV(estimator, generator, scorer)
----> 3 grid_finder.fit(data, labels)

/Users/axelr/ipython/rep/rep/metaml/gridsearch.py in fit(self, X, y, sample_weight)
    530                 state_indices, state_dict = self.params_generator.generate_next_point()
    531                 status, value = apply_scorer(self.scorer, state_dict, self.base_estimator, X, y, sample_weight)
--> 532                 assert status == 'success', 'Error during grid search ' + str(value)
    533                 self.params_generator.add_result(state_indices, value)
    534                 self.evaluations_done += 1

AssertionError: Error during grid search 'RocAuc' object has no attribute 'classes_'

float32 not supported by xgboost

/root/miniconda/envs/rep_py2/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/core.pyc in _maybe_from_pandas(data, feature_names, feature_types)
151 dtypes = data.dtypes
152 if not all(dtype.name in ('int64', 'float64', 'bool') for dtype in dtypes):
--> 153 raise ValueError('DataFrame.dtypes must be int, float or bool')
154
155 if feature_names is None:

ValueError: DataFrame.dtypes must be int, float or bool

Docker with python3

Hi!

Using REP with Docker is really nice and easy. Is it possible to have the same for Python 3 or do I have to build the image myself?

Problems with LVQ in neurolab

The following minimal example by @mschlupp fails

from rep.estimators import NeurolabClassifier
import pandas as pd
clf = NeurolabClassifier(net_type='learning-vector')
ds = pd.DataFrame()
ds['feature1']=[0,1,2,3,4,5]
ds['feature2']=[5,7,2,4,7,9]
ds['y']=[0,0,0,1,1,1]
clf.fit(ds[['feature1','feature2']],ds['y'])

since transf makes no sense for LVQ.

Problem with TMVAClassifier

After REP installation from here, I've met the following problem with TMVAClassifier fitting: I'm trying to train TMVAClassifier, and IOError raises after following strings:
" baseline = TMVAClassifier(method='kBDT', features=variables, BoostType='Grad', NTrees=40, Shrinkage=0.01, MaxDepth=7, UseNvars=6, nCuts=-1) features=variables)

baseline.fit(train, train['signal'])"

Stacktrace is next:

IOError Traceback (most recent call last) in () 3 UseNvars=6, nCuts=-1) 4 # baseline = TMVAClassifier(method='kBDT', NTrees=50, Shrinkage=0.05, features=variables) ----> 5 baseline.fit(train, train['signal'])

/usr/local/lib/python2.7/dist-packages/rep-0.6.3-py2.7.egg/rep/estimators/tmva.pyc in fit(self, X, y, sample_weight) 288 self.factory_options = '{}:AnalysisType=Multiclass'.format(self.factory_options) 289 --> 290 return self._fit(X, y, sample_weight=sample_weight) 291 292 def predict_proba(self, X):

/usr/local/lib/python2.7/dist-packages/rep-0.6.3-py2.7.egg/rep/estimators/tmva.pyc in _fit(self, X, y, sample_weight, model_type) 104 add_info = _AdditionalInformation(directory, model_type=model_type) 105 try: --> 106 self._run_tmva_training(add_info, X, y, sample_weight) 107 finally: 108 self._remove_tmp_directory(directory)

/usr/local/lib/python2.7/dist-packages/rep-0.6.3-py2.7.egg/rep/estimators/tmva.pyc in run_tmva_training(self, info, X, y, sample_weight) 134 xml_filename = os.path.join(info.directory, 'weights', 135 '{job}{name}.weights.xml'.format(job=info.tmva_job, name=self._method_name)) --> 136 with open(xml_filename, 'r') as xml_file: 137 self.formula_xml = xml_file.read() 138

IOError: [Errno 2] No such file or directory: '/home/artem/Documents/IPython Notebooks/CERN + Yandex/Original Baseline/flavours-of-physics-start/tmp0Fhtqe/weights/TMVAEstimation_REP_Estimator.weights.xml'

As I found, weights/ folder was created outside of temporary folder instead created inside in last one. It causes the error above.

ROOT 5.34, Python 2.7, GCC 4.8, Ubuntu 14.04 LTS (x64). All requirenments for REP were installed successfully (from requirenments.txt)

Factory doesn't fit given plain numpy arrays

i.e. this works:

X = XGBoostClassifier()
X.fit(D.data['X_train'], D.data['Y_train'])

and this doesn't:

factory = ClassifiersFactory()
factory.add_classifier('ada', AdaBoostClassifier(n_estimators=100))
factory['xgb'] = XGBoostClassifier()
factory.fit(D.data['X_train'], D.data['Y_train'])

complaining:

/usr/local/lib/python2.7/dist-packages/rep/metaml/factory.pyc in fit(self, X, y, sample_weight, parallel_profile, features)
     48         :return: self
     49         """
---> 50         assert isinstance(X, pandas.DataFrame), 'The passed '
     51         if features is not None:
     52             for name, estimator in self.items():

AssertionError: The passed 

OptimalMetricNdim - better documentation

  1. OptimalMetricNdim doesn't follow metric conventions. This should be said explicitly + no derivation from metric mixin + rename call to something like compute_optimum
  2. Good explanation (probably with example) is needed as well.

Why those harsh `==` requirements?

I would really like to use your package, it looks like an awesome simplicfication for trying different ML algorithms and frameworks.

What's currently stopping me from using it, are the == requirements for rather old package versions,
especially pandas 0.14.

Why do you need those == requirements?

Could you check if anything gets broken with newer versions and change the requirement to >=?

Single script used for deployment

It would be fine to have single installation script (together with environment), which will be used in docker image preparation, continuous integration and fabric-like integration.

Rewrite or remove ROOT notebooks

Currently there is no special point in having ROOT notebooks, since it is not connected to REP directly, only using plotting.
Two notebooks about ROOT is definitely an overkill.

develop branch doesn't fly under jupyterhub/everware

as I start develop version under jupyterhub, it reports in log and fails:

Starting under Jupyterhub
fatal: destination path '/notebooks' already exists and is not an empty directory.
Traceback (most recent call last):
  File "/root/miniconda/envs/jupyterhub_py3/bin/jupyterhub-singleuser", line 21, in <module>
    import notebook
ImportError: No module named 'notebook'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda/envs/jupyterhub_py3/bin/jupyterhub-singleuser", line 23, in <module>
    raise ImportError("JupyterHub single-user server requires notebook >= 4.0")
ImportError: JupyterHub single-user server requires notebook >= 4.0

Numexpr and multithread failures

When training in threads many models and passing columns argument, this drives to kernel restart.

After 2-hour debugging we found out this is due to numexpr, which is unable to work normally in threads. Minimal failing example:

from rep.metaml.utils import map_on_cluster
import pandas
import numpy
from rep.utils import get_columns_in_df

columns = ['Feature_0', 'Feature_1']
size = 10000
x = pandas.DataFrame(numpy.random.random([size, 2]), columns=columns)

def f(x, columns):
    x = get_columns_in_df(x.copy(), columns)
    return x

n = 50
result = map_on_cluster('threads-4', f, [x] * n, [columns[:1]] * n)
print 'ok'

At least this means we should minimize usage of numexpr (if not completely exclude).

instructions on how to setup ipython cluster using REP containers

Given: set of machines connected together and to the Internet
Needed: instructions/scripts to set up runnable ipython cluster on those machines using REP containers

  • makefile/fabfile for setting up cluster
  • update start script to spawn ipcontroller
  • Dockerfile/image for REP updated with ipyparallel==4.1.0
  • targets for starting master
  • targets for starting slaves
  • test on Cracow
  • instructions how to use this stuff
  • what is the best place for instructions/dockerfiles/makefiles/fabfiles/tests? same repo? same branch? or? (@arogozhnikov?)
  • tests

Windows instructions for docker

Getting python to work on *nix systems is not a great deal, while for windows users docker may serve as good option compared to conda/pythonXY/etc.

For me it would be enough to have simply arguments of docker run (hardly I can decode them from scripts).

Test GP-based optimization

Build an additional test that GaussianProcess optimization works fine.

For batch optimization, probably some improvement is possible, which also minimizes an overall variance as well as looking for most optimal model. (Need to check if some well-tested implementation exists).

Sklearn's gaussian processes don't support variance of measurements, which probably could improve search.

Plots

  • Plot.ly introduced too hard restrictions (50 plots/day), which makes this library almost useless without payed account
  • given bokeh, mlpd3.js and %matplotlib notebook options, I'm sure we can delete following support for plot.ly
  • TMVA plots require additional testing - at least in some case those 'hang', showing the same picture for different plots.

New REP docker version running in /var/lib/docker/volumes/ instead of ~/rep_container

Hi.

I had old REP docker version in ~/rep_container which started with run.sh script on 8080 port. I updated REP and it broke: sudo $REPDIR/run.sh worked, but I couldn't connect to localhost:8080 (connection refused). I've decided to update docker and REP according to new instructions: https://github.com/yandex/rep/wiki/Install-REP-with-Docker-(Linux).

  1. I installed Docker, according to instructions.
  2. netstat -anl | grep 8888 gave empty result
  3. git checkout https://github.com/yandex/rep.git didn't work (pathspec did not match any file(s) known to git), so I used git clone instead.
  4. First run of sudo make run was successful and installed container.
  5. I rebooted and second sudo make run gave the following

docker run -ti --rm -p 8888:8888 --name rep yandex/rep:0.6.4
Error response from daemon: Conflict. The name "rep" is already in use by container 3af0884aeedb. You have to remove (or rename) that container to be able to reuse that name.
make: *
* [run] Error 1*
6. I ran sudo docker images

REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE
yandex/rep 0.6.4 18a48bc5a3b6 8 hours ago 2.635 GB
anaderi/rep latest 63c3db2850b6 4 months ago 1.649 GB
91c95931e552 7 months ago 910 B

7. I tried sudo docker start rep. It worked and I opned REP on localhost:8888. But its working folder changed. Now it is /var/lib/docker/volumes/dbcc7ff99538007d9c6b244fb6b8f03bdcfd564f6076b36d79fa3330d2041107/_data/. It is quite unhandy, because it requires superuser rights to access and not conveniently located at all.

Question: Is it a new system or did I something wrong? If latter, how to I fix it and run REP container in handy folder?

Update XGBoost wrapper

XGBoost wrapper was significantly improved.

New wrapper is almost sklearn-compatible, but there are still failures of copy and clone, important field classes_ is not set.

So there are two ways:

  1. We leave things as they are, since our wrapper works fine.
  2. or I fix these bugs in original wrapper and we delete special wrapper for XGBoost. Less code to support, but no guarantee about proper support of that wrapper.

Cache classifier?

Proposal: add cache classifier for researches that require heavy computations.

Interface

clf = CacheClassifier(name='stage_1', base_estimator=XGBoostClassifier(...))
clf.fit(X, y, sample_weight)
clf.predict(...)

All the methods are proxied to initial classifier (XGBoostClassifier in this case).

Copy of trained classifier is saved at .rep_cache/stage_1.pkl, together with hash of dataset.

The next time notebook is executed, if we have the same parameters of classifier and the same value of dataset hash, fit method only loads already-trained estimator.

There are many possible caveats, first think of handling clone and pickle. Those are not trivial.

do we need to measure fit/predict time without %time?

it is useful if jupyter frontend disconnects during fit/predict execution.

might the following snippet be handy for such cases

class Stopwatch(object):
    def __enter__(self):
        self.t0 = datetime.datetime.now()
        return self

    def __exit__(self, type, value, traceback):
        self.t1 = datetime.datetime.now()

    def __repr__(self):
        return "delta: (%s)" % (self.t1 - self.t0)


with Stopwatch() as sfit:
    time.sleep(1)
with Stopwatch() as spredict:
    time.sleep(1)

print "fit:", sfit, "spredict:", spredict

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.