lisa-lab / pylearn2 Goto Github PK

Warning: This project does not have any current developer. See bellow.

License: BSD 3-Clause "New" or "Revised" License

Shell 0.07% Python 51.73% Cuda 16.56% C++ 0.75% C 0.57% Jupyter Notebook 30.05% Cython 0.26%

pylearn2's Introduction

Pylearn2: A machine learning research library

Warning :

This project does not have any current developer. We will continue to review pull requests and merge them when appropriate, but do not expect new development unless someone decides to work on it.

There are other machine learning frameworks built on top of Theano that could interest you, such as: Blocks, Keras and Lasagne.

Pylearn2 is a library designed to make machine learning research easy.

Pylearn2 has online documentation. If you want to build a local copy of the documentation, run

python ./doc/scripts/docgen.py

More documentation is available in the form of commented examples scripts and ipython notebooks in the "pylearn2/scripts/tutorials" directory.

Pylearn2 was initially developed by David Warde-Farley, Pascal Lamblin, Ian Goodfellow and others during the winter 2011 offering of IFT6266, and is now developed by the LISA lab.

Quick start and basic design rules

Installation instructions are available here.
Subscribe to the pylearn-users Google group for important updates. Please write to this list for general inquiries and support questions.
Subscribe to the pylearn-dev Google group for important development updates. Please write to this list if you find any bug or want to contribute to the project.
Read through the documentation and examples mentioned above.
Pylearn2 should not force users to commit to the whole library. If someone just wants to implement a Model, they should be able to do that and not need to implement a TrainingAlgorithm. Try not to write library features that force users to buy into the whole library.
When writing reference implementations to go in the library, maximize code re-usability by decomposing your algorithm into a TrainingAlgorithm that trains a Model on a Dataset. It will probably do this by minimizing a Cost. In fact, you can probably use an existing TrainingAlgorithm.

Highlights

Pylearn2 was used to set the state of the art on MNIST, CIFAR-10, CIFAR-100, and SVHN. See pylearn2.models.maxout or pylearn2/scripts/papers/maxout
Pylearn2 provides a wrapper around Alex Krizhevsky's extremely efficient GPU convolutional network library. This wrapper lets you use Theano's symbolic differentiation and other capabilities with minimal overhead. See pylearn2.sandbox.cuda_convnet.

License and Citations

Pylearn2 is released under the 3-claused BSD license, so it may be used for commercial purposes. The license does not require anyone to cite Pylearn2, but if you use Pylearn2 in published research work we encourage you to cite this article:

Ian J. Goodfellow, David Warde-Farley, Pascal Lamblin, Vincent Dumoulin, Mehdi Mirza, Razvan Pascanu, James Bergstra, Frédéric Bastien, and Yoshua Bengio. "Pylearn2: a machine learning research library". arXiv preprint arXiv:1308.4214 (BibTeX)

pylearn2's People

Contributors

Stargazers

Watchers

Forkers

jaberg nouiz odelalleau pascanur yosinski davyfeng memimo chrish42 hanialmousli williamlechelle laulysta laurent-dinh lamblin gdesjardins hannes-brt vdumoulin wqren jmarinero gwtaylor poolio ganguli-lab scyoyo renjupaul mathewsbabu yiiwood zhaoyangyang316 jiangfeng1124 nicholas-leonard caglar jacobirwin sinahonari casperkaae bouthilx dnouri neozhangthe1 xuyuan-qd xiaoyili eedanny nvdnkpr jingyuy uci-igb barapa bbudescu netconstructor mruan invinciblejha dansbecker willkurt sachinlondhe4 dammyboy gallamine odenas jia55060503 tdomhan gouwsmeister capybaralet kod3r ibobriakov alito syhw altus88 tempbottle eamartin git- alienfeel benharbit holyjupiter kastnerkyle huiwenhan ironchief saurabh203 nlp-peter mikalaidrabovich haarts alphaprime mareija ageek theoryno3 niharsarangi kvtm-ramananmuthuraman web5design mdqyy nicolaschapados wojzaremba goller donms luoheng coodoing kyunghyuncho tongming imlj robertlayton fanfannothing lethic xuanhan863 odellus peterjsadowski lambdafu ccsevers adityaml

pylearn2's Issues

three-pronged dataset serialization proposal

This proposal aims to address the problem of serializing datasets that are referenced by monitors or other objects. It is often desirable to not serialize e.g. a monitoring dataset when you serialize a model.

The proposal addresses datasets that have had preprocessing applied to them that destroys the original data, and allow either

serializing the end state of a preprocessed dataset pipeline (for maximal reproducibility, saving computational time, limited resources on the machine performing the deserialization; i.e. the dataset has been preprocessed on a machine with more resources than the one on which the model learning will take place)
serializing just a description of how to reconstruct the entire preprocessed data (and apply appropriate preprocessing) either
- at deserialization time
- at first access

Deserializing at first access (a.k.a. "lazy-loading") has the following advantages:

Speed of access: You don't need to reload and re-preprocess the actual data if all you need is the dataset's view converter (i.e. for visualizing weights)
Deserialization of an entire hierarchy of objects won't fail due to missing external disk file dependencies (NPY files, HDF5 files, etc.). This could also be accomplished by simply making a "dataless" instance of the dataset class -- lazy-loading implies that this "dataless" state is the default when deserializing-from-description, and the object is mutated to a "data-ful" state the first time actual data is actually needed.

Implementation

First, apply_preprocessor should keep a record of preprocessor objects applied, and in what order. This will be serialized along with the dataset object.

The proposal further augments two boolean flags:

serialize_final_state -- A flag, by default set to False, which determines the behaviour at serialization time. If False, serialize on
initialized - Is False if the data has not been loaded from disk or preprocessed, True if the data is usable.

Methods that access the data, such as iterator(), get_design_matrix(), etc. will call a method like

if not self.initiaized:
    self.initialize()

At serialization time, if serialize_final_state is False, __setstate__ removes the necessary objects and sets the initialized flag to False.

Advanced pickle features like __getnewargs__ may make things simpler on the deserialization side.

remove dependency on TheanoLinear

As per dwf's review of my convolution support pull request, either wrap TheanoLinear it up in pylearn2 or in Theano. should probably talk to James about what he'd like to do most.

Depend on joblib for serialization, possibly other things

joblib has a nice custom pickler that takes care of efficiently storing NumPy arrays automatically. It is a light dependency (pure Python only) so we should not duplicate their work.

Of particular note:

unit test for datasets.dense_design_matrix

Check that converting from a design matrix to a topological view and back recovers the original design matrix, and vice versa

Find how to use floatX in the dataset

In commit 3086268, Ian did a quick fix to make the dataset work with floatX. But we need to think how to do it correctly in all case.

If we have floatX=float32 and we always use a datatset in float32, we should not waste memory by keeping it in float64. But there is the other way, sometimes we want to upcast it to float64 at the start but not always.

unit tests for dataset.preprocessing.GlobalContrastNormalization

-write a unit test that checks that zero vectors can't cause NaNs
-write a unit test that checks that using std_bias = 0 causes output to have unit norm

patch extraction refactor

Refactor the patch extraction code so that there exists a set of utility functions that can be called on raw NumPy arrays, without going through the Datasets API.
Add my semi-exhaustive patch extractor from https://gist.github.com/366247

make expressions for published preprocessing

in datasets.preprocessing we could have functions that give you pipelines for performing the preprocessing for various published papers. So far we support the Coates/Lee/Ng "Analysis of Single Layer Networks" preprocessing, Kai Yu's preprocessing for his paper on LCC with local tangents, and Michael Gutmann's sphere preprocessing for his noise contrastive estimation paper. Bundling these with nice little shortcuts could make it easy for people to reproduce their results / compare other methods on the same kind of data.

Cross-validation

Sprint assignees:

Caglar
Raoul

[sprint] Polyak averaging

Caglar
David W.-F.

Fix broken autoencoder test

tests/test_autoencoder is broken, it's trying to import something that doesn't exist

train_example unit test

Make a unit test based on scripts/train_example

It should do basically the same thing as "train.py cifar_grbm_smd.yaml" but with these changes:

-don't use the real dataset. the preprocessing is expensive, and cifar10 is a reasonably big download. we don't want to force everyone who uses the library to download cifar10. randomly generated data might be ok, or maybe we could host a .npy file with a few hundred thousand examples somewhere that the unit test can download if the file isn't found
-set save_freq to 0, so the test doesn't modify the filesystem

The test should work by checking that the initial and final values in the monitor are pretty close to the values obtained by some original version of the code that we believe to be working. It should also check the length of the monitor to make sure convergence is happening at the same rate.

One question about how to implement this test: do we want it to load the yaml file and then modify the object to use different data? This would require adding an interface for changing the dataset of a Train object. An alternative is to manually update the test to remain similar to the yaml file, but it seems like it would be easy for the test and the yaml to get out of sync this way.

GIL-avoiding producer-consumer data loader

Ian's complaint about Python's shitty threading support is certainly a valid one. Multiple CPU bound threads cannot work in Python, at least not in CPython (and I don't foresee us moving to IronPython/.NET any time soon). One way around this is to use processes for consumer/producer; the threading module and the multiprocessing module have identical APIs. Another way is to use threads but make sure the I/O bound portion runs in code that releases the GIL.

It turns out, after some digging, that if numpy.ALLOW_THREADS evaluates true at runtime (see this section of the NumPy C API docs, then NumPy was compiled such that PyArray_FromFile (which is the C function eventually called by numpy.load) already releases the GIL before calling fread(), and reacquires it right after. This means that the costliest part of it can be done in a parallel thread, and we can go ahead and implement a thread-based consumer-producer model if we like. If we want more done in parallel we can do that by reimplementing this as much as possible using a Cython with nogil block.

Convert all prints to use loggers and appropriate log levels.

Specifically, stick something like

import logging
...
...

logger = logging.getLogger("put.module.name.here")

at the t op of the module.

Very verbose output with print statements should then be replaced with logger.debug(); less verbose but useful stuff with logger.info(); stuff that goes wrong that the user should know about even if they don't care about info() level stuff should be a logger.warning(), etc.

This will allow us to easily and programmatically redirect logging output, change logging verbosity, apply common styling to all log messages (add timestamps, blah).

See the Logging Cookbook and the logging module docs.

Make monitor channel names predictable

Related to #49. Right now, every algorithm is going to name the key for its channel in the monitor something different. @goodfeli? @steven-pigeon?

It might make more sense to have the keys for training error and validation error in consistent keys, and append metadata on the channel object like the name of the algorithm class, etc.

@chandiar and @caglar are working on this currently.

AIS hasn't been run for some cases

AIS.fX has no self argument, which suggests that that AIS has never been run in a mode that results in fX being called. We should fix fX and make sure whatever code path is meant to call it is working.

make Preprocessor class

the various classes in datasets.preprocessing have common functionality but no superclass to document it in

Implement sparse autoencoders with reconstruction sampling

Sprint assignees:

Li
Eric

Import public classes at the subpackage level.

We should be referring to stuff as pylearn2.datasets.DenseDesignMatrix, not pylearn2.datasets.dense_design_matrix.DenseDesignMatrix, by importing in datasets/__init__.py.

This has three effects:

a) It makes clear precisely what the public classes that are meant to be used are.
b) It gives us the freedom to move things around within the files in a given subpackage.
c) It means less to type in a lot of places.

make Distribution class

Should go in pylearn2.distributions. Existing distributions in that module should inherit from it.

clean up sgd; make term_crit module?

Lots (all?) of the termination criteria in training_algorithms.sgd are not specific to the sgd training algorithm. These should be put in a separate module somewhere so that it's easier to find them by browsing the code tree / to make it more obvious that they could be used with other training algorithms.

Set up theano's configparser to also work for pylearn

See pylearn-dev e-mail thread "Configuring defaults" for details

Benchmark SVM implementations. Make something competitive with MATLAB

For reference, Adam Coates has matlab implementations of SVMs in some of his code demos (eg AISTATS 11, ICML 11).

http://www.stanford.edu/~acoates/

The multicore aspect of matlab makes his SVM implementations a lot faster. Also, he can do an L2 SVM, a feature which in python is only available for sparse input matrices (via liblinear wrapped in scikits.learn) as far as I know.

Unit tests for Dataset classes, especially DenseDesignMatrix

Tons of issues have cropped up in the last few days due to changes to datasets, but right now there are perilously few tests. At the very least, DenseDesignMatrix ought to have tests for all of its methods.

One stochastic gradient to rule them all

Unify SGD and UnsupervisedBlahSGD.

I floated StochasticGradient for the name, but UnsupervisedStochasticGradient or UnsupervisedSGD might be more.

Are supervised and unsupervised tasks similar enough to unify them transparently? I suppose most of that logic can be pushed to the cost function.

Let's concentrated on the unsupervised case for now.

[sprint] Remove dependency on pylearn1

Ian
Mehdi
David G.

Compile time checks

Figure out some way of automating compile time checks so that really simple bugs like incomplete refactoring updates don't require as much of a unit test burden to catch.

This should be possible, after vim has syntax highlighting plugins that do it. We just need an interface to that same functionality that lets us scan the whole library and emit a list of errors.

Pascal Lamblin says pylint (sp?) might be able to do this.

[sprint] Supervised learning support

Steven
Razvan
Ian

Set up Sphinx documentation

Good instructions for using Sphinx: http://matplotlib.sourceforge.net/sampledoc/

make a git commit hook for fixing white space

pep8 simply issues complaints about white space. these could all be fixed automatically by a commit hook but must currently be fixed by hand. that is a waste of everyone's time. this kind of thing may exist already and we may be able to just copy-paste it, i have not checked.

preprocess serial.load/serial.save paths for env variables

(could pull out yaml_parse.preprocess to do this)

[sprint] Convert the autoencoder Deep Learning Tutorials to use pylearn2

Francis
Simon

problem in viewing the weights of the GRBM

Hello
I am trying to check the weights of the grbm but it seems that there is a bug because self.weights are not assigned.
It is possible to check the output pickle file of the train and get_weights() gives error.
Thanks

standardize on ML-specific code conventions

The "coding style" document produced out of the committee stuff did a pretty good job of things (save for the brewing disagreement over module renaming) but in order that we don't end up with think we need to discuss and standardize on things like

naming conventions for method/function arguments, learnable parameters, hand-set hyperparameters
naming conventions for classes
naming conventions for other variables
use of acronyms, greek letters, etc.

There's a fair amount of disagreement about a number of these issues, but I wanted to assemble therm in one place, along with comments for/against. I've tried to keep this post neutral in town and not colour it with my own opinion, but please add comments both expressing your view on a certain issue as well as correcting anything you feel I've unfairly represented.

Method/function arguments

Should they be named simple one letter names (X, y), something longer and more descriptive (inputs, labels), something else entirely?
Arguments for X, y etc.: makes complicated mathematical expressions easier to read, might correspond with usage in paper describing algorithm
Arguments for inputs, labels, etc.: perhaps clearer to novices, complicated expressions could be broken up in terms of temporaries

Learnable parameters

Should these be distinguished in some particular way, other than their membership in the _params attribute? e.g., scikits-learn uses a convention of suffixing all learnable parameter object members with a single underscore, e.g. self.weights_. Of course pylearn2 parameters will be shared variables rather than numpy arrays, so perhaps this particular convention should be avoided to prevent confusion, but should we adopt something similar?

Hyperparameters

Should they be named by their putative function (sparsity_penalty), by the name assigned to them in a particular paper (eta, epsilon), by some other convention? One complicating factor is that lambda is a reserved word in Python, so you can't name an object member lambda (just like you can't name it 'for' or 'else').

Other variables

Should any particular conventions apply?

Class Names

Models are currently typically named their typical name in the literature (DenoisingAutoencoder, ContractiveAutoencoder, BinaryGaussianRBM). Preprocessing varies between verb-phrases (ExtractPatches), noun-phrases (LocalContrastNormalization), etc. Should the object be named as a ``doer'' of whatever it does (e.g. PatchExtractor, LocalContrastNormalizer), the noun phrase of the accomplished processing (PatchExtraction, LocalContrastNormalization), or what?
Typically underscores are left off in favour of camelcase for class names (as denoted in PEP8). Should we adhere to this convention?

Acronyms

Certain acronyms are in wide use and perhaps unambiguous enough (e.g. RBM). Others are overloaded and potentially more opaque (i.e. SMD for "Score Matching Denoising", "Stochastic Meta-Descent", "Stochastic Mirror Descent"). How should we deal with acronyms? Should there be a whitelist (certain acronyms are okay, others aren't)? Or should it be up to the user to dig through docstrings?

Change environment variable syntax in preprocess()

Our current ${foo} conflicts with the syntax used by the builtin .format() method of strings in Python 2.6+.

>>> '${PYLEARN2_FOO}, {bar}'.format(bar=5)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'PYLEARN2_FOO'

I suggest we move to $(PYLEARN2_FOO) instead.

>>> '$(PYLEARN2_FOO), {bar}'.format(bar=5)
'$(PYLEARN2_FOO), 5'

Stricter compile time checks

See if pychecker or pylint can do stricter compile time checks, such as following imports, than are currently done by pylearn2.devtools.tests.test_via_pyflakes

modularize SGD training algorithm

Things are a bit tightly coupled in there, and it might be worth thinking about other use cases (i.e. where people want to reuse some of our nitty gritty SGD logic but "be their own driver", etc.). I think we'll see more uptake of pylearn2 if you can choose with a bit more granularity "how much & what for" you want to use the library.

High level abstractions are good, but high-level with a medium-level option are better.

See if we can pull it apart so different aspects can be used in isolation, while maintaining the "simple recipe" for training that is currently possible via yaml experiment descriptions
See if we can leverage anything from the previously written SGDOptimizer in optimizers.py before we just trash it

linear docstrings

If you're overriding a method you can suck the superclass docstring into the subclass by using the @functools.wraps(SuperClassName.method_name) decorator.

examples of how to use pylearn2 with jobman

Pascal
Hani

unify monitoring and callbacks

it's important to call the monitor before calling the callbacks. factor this logic into one piece of re-usable code so that we don't need to worry about each implementer of a training_algorithm screwing it up

Benchmark k-means implementations. Make something competitive with MATLAB

Adam Coates' matlab implementation is way faster than ours or scipy's, even though he doesn't do anything particularly fancy.

I'm in the process of changing ours to just use milk. We should benchmark milk against Adam's code (look at the objective function value, runtime, and memory consumption) and make something competitive with his. Be sure to benchmark his on a machine with several cores.

http://www.stanford.edu/~acoates/papers/kmeans_demo.tgz

[sprint] Pretty-printing

Steven
Nicolas

__str__() and __repr__() where appropriate.

More efficient/fault-tolerant single-file serialization

joblib.save is much faster than normal pickle and lets you recover your parameters in the case that your pickles go stale, but has the side effect of creating several files on disk instead of one, which may be undesirable for some.

Looking at a way to remedy this with a custom pickler, it seems possible but a bit trickier. From an email exchange with Ian:

The trouble is that joblib currently relies on the ability to write out
arrays to separate files asynchronously, during the (recursive) pickling
procedure. If you specify compression parameters then it will do this,

After some thought, the simplest thing to do to achieve the same speed and
fault tolerance with a single file would be to have the Pickler object store
a list of arrays, replacing them with dummy object as it goes, then use the
pickler to pickle this munged representation to a StringIO/BytesIO object.

The wrapping "save" function would then open a file for writing, loop through
this list of arrays and repeatedly numpy.save on the same open file
descriptor. When it was done, it would call cPickle.dump(f, False) or
something to act as a sentinel, indicating that there are no more arrays.

np.load() can handle pickles as well as npy data, so unarchiving would look
like this:
# f is a pre-existing open file descriptor
arrays = []
while True:
    arr = numpy.load(f)
    if arr is False:
        break
    arrays.append(arr)
my_custom_pickler.load(f, arrays)
Now my_custom_pickler has all the information it needs to unpickle while
subbing in the arrays where appropriate (the dummy objects can also keep
track of the index in the list of arrays for which they are a stand-in).

A two-file solution is also a possibility, using only one separate file.

Alas, this is a bigger project than I thought, so I'm going to defer actually
implementing this until after this round of deadlines. For now, if you want
to take advantage of joblib's fast saves/fault tolerance and don't mind
multiple files, you can use the file suffix ".joblib". An eventual
single-file format will probably have its own custom suffix as well, ".pl2"
or something.

Unify "Block" and "Model" somehow

We have two separate base classes for components that really should be related somehow.

Figure out how to do automatic refactoring and document it

We get lots of bugs from refactoring. Partly this is an inevitable consequence of using python, but some things should still be more automatable.

Pascal Lamblin and Guillaume think Eclipse might be able to do some automatic refactoring for python. Also, Yan Dauphin has heard good things about an IDE called pycharm. I got our team an open source developer license for pycharm.

If someone would like to try out a few IDEs, figure out how well they work for automating common annoying tasks like renaming a class, it'd probably be really helpful to start a page in the doc describing which IDEs are recommended and how to set them up to work with pylearn.

numpydoc auto-generated class documentation

Sphinx supports auto-generating API documentation, and the numpydoc Sphinx extension can automatically render docstrings in the NumPy docstring format (which we supposedly use) into pretty, readable HTML. We should set this up.

Here's an example of the kind of documentation I'm talking about.

[sprint] RBM training

Guillaume
David W.-F.

put pylearn2 tests on the buildbot so we get notifications when they break

Clear error message about pylearn1

We should detect when people don't have pylearn1 installed and raise an importerror that explains pylearn2's dependence on pylearn1, and possibly where to download it. This error message should only be raised when we actually attempt to import pylearn.

Make MNIST loader read Yann LeCun's original files

Our MNIST loader class shouldn't depend on the user downloading some arbitrary pickle files that could potentially go stale when a new release of NumPy comes out. Yann's format is so simple that it can be read in with a single numpy.fromfile call; we should do this rather than what we do now.

The only thing is making sure all the transposes and such are the same as they are now.

We should also store hashes of the original files (and their gzipped equivalents) to make sure the download isn't corrupted. A download_mnist() function in this module could grab the files with urllib.