Git Product home page Git Product logo

pylearn2's People

Contributors

aalmah avatar bartvm avatar bbudescu avatar bouthilx avatar caglar avatar capybaralet avatar carriepl avatar cdevin avatar chandiar avatar daemonmaker avatar dwf avatar eamartin avatar euphoris avatar fvisin avatar gdesjardins avatar goodfeli avatar hantek avatar herr-biber avatar jdumas avatar jych avatar lamblin avatar laurent-dinh avatar memimo avatar menerve avatar nouiz avatar pascanur avatar sebastien-j avatar serhalp avatar superelectric avatar vdumoulin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pylearn2's Issues

make a git commit hook for fixing white space

pep8 simply issues complaints about white space. these could all be fixed automatically by a commit hook but must currently be fixed by hand. that is a waste of everyone's time. this kind of thing may exist already and we may be able to just copy-paste it, i have not checked.

Make MNIST loader read Yann LeCun's original files

Our MNIST loader class shouldn't depend on the user downloading some arbitrary pickle files that could potentially go stale when a new release of NumPy comes out. Yann's format is so simple that it can be read in with a single numpy.fromfile call; we should do this rather than what we do now.

The only thing is making sure all the transposes and such are the same as they are now.

We should also store hashes of the original files (and their gzipped equivalents) to make sure the download isn't corrupted. A download_mnist() function in this module could grab the files with urllib.

Change environment variable syntax in preprocess()

Our current ${foo} conflicts with the syntax used by the builtin .format() method of strings in Python 2.6+.

>>> '${PYLEARN2_FOO}, {bar}'.format(bar=5)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'PYLEARN2_FOO'

I suggest we move to $(PYLEARN2_FOO) instead.

>>> '$(PYLEARN2_FOO), {bar}'.format(bar=5)
'$(PYLEARN2_FOO), 5'

three-pronged dataset serialization proposal

This proposal aims to address the problem of serializing datasets that are referenced by monitors or other objects. It is often desirable to not serialize e.g. a monitoring dataset when you serialize a model.

The proposal addresses datasets that have had preprocessing applied to them that destroys the original data, and allow either

  • serializing the end state of a preprocessed dataset pipeline (for maximal reproducibility, saving computational time, limited resources on the machine performing the deserialization; i.e. the dataset has been preprocessed on a machine with more resources than the one on which the model learning will take place)
  • serializing just a description of how to reconstruct the entire preprocessed data (and apply appropriate preprocessing) either
    • at deserialization time
    • at first access

Deserializing at first access (a.k.a. "lazy-loading") has the following advantages:

  • Speed of access: You don't need to reload and re-preprocess the actual data if all you need is the dataset's view converter (i.e. for visualizing weights)
  • Deserialization of an entire hierarchy of objects won't fail due to missing external disk file dependencies (NPY files, HDF5 files, etc.). This could also be accomplished by simply making a "dataless" instance of the dataset class -- lazy-loading implies that this "dataless" state is the default when deserializing-from-description, and the object is mutated to a "data-ful" state the first time actual data is actually needed.

Implementation

First, apply_preprocessor should keep a record of preprocessor objects applied, and in what order. This will be serialized along with the dataset object.

The proposal further augments two boolean flags:

  • serialize_final_state -- A flag, by default set to False, which determines the behaviour at serialization time. If False, serialize on
  • initialized - Is False if the data has not been loaded from disk or preprocessed, True if the data is usable.

Methods that access the data, such as iterator(), get_design_matrix(), etc. will call a method like

if not self.initiaized:
    self.initialize()

At serialization time, if serialize_final_state is False, __setstate__ removes the necessary objects and sets the initialized flag to False.

Advanced pickle features like __getnewargs__ may make things simpler on the deserialization side.

make expressions for published preprocessing

in datasets.preprocessing we could have functions that give you pipelines for performing the preprocessing for various published papers. So far we support the Coates/Lee/Ng "Analysis of Single Layer Networks" preprocessing, Kai Yu's preprocessing for his paper on LCC with local tangents, and Michael Gutmann's sphere preprocessing for his noise contrastive estimation paper. Bundling these with nice little shortcuts could make it easy for people to reproduce their results / compare other methods on the same kind of data.

More efficient/fault-tolerant single-file serialization

joblib.save is much faster than normal pickle and lets you recover your parameters in the case that your pickles go stale, but has the side effect of creating several files on disk instead of one, which may be undesirable for some.

Looking at a way to remedy this with a custom pickler, it seems possible but a bit trickier. From an email exchange with Ian:

The trouble is that joblib currently relies on the ability to write out
arrays to separate files asynchronously, during the (recursive) pickling
procedure. If you specify compression parameters then it will do this,

After some thought, the simplest thing to do to achieve the same speed and
fault tolerance with a single file would be to have the Pickler object store
a list of arrays, replacing them with dummy object as it goes, then use the
pickler to pickle this munged representation to a StringIO/BytesIO object.

The wrapping "save" function would then open a file for writing, loop through
this list of arrays and repeatedly numpy.save on the same open file
descriptor. When it was done, it would call cPickle.dump(f, False) or
something to act as a sentinel, indicating that there are no more arrays.

np.load() can handle pickles as well as npy data, so unarchiving would look
like this:

# f is a pre-existing open file descriptor
arrays = []
while True:
    arr = numpy.load(f)
    if arr is False:
        break
    arrays.append(arr)
my_custom_pickler.load(f, arrays)

Now my_custom_pickler has all the information it needs to unpickle while
subbing in the arrays where appropriate (the dummy objects can also keep
track of the index in the list of arrays for which they are a stand-in).

A two-file solution is also a possibility, using only one separate file.

Alas, this is a bigger project than I thought, so I'm going to defer actually
implementing this until after this round of deadlines. For now, if you want
to take advantage of joblib's fast saves/fault tolerance and don't mind
multiple files, you can use the file suffix ".joblib". An eventual
single-file format will probably have its own custom suffix as well, ".pl2"
or something.

GIL-avoiding producer-consumer data loader

Ian's complaint about Python's shitty threading support is certainly a valid one. Multiple CPU bound threads cannot work in Python, at least not in CPython (and I don't foresee us moving to IronPython/.NET any time soon). One way around this is to use processes for consumer/producer; the threading module and the multiprocessing module have identical APIs. Another way is to use threads but make sure the I/O bound portion runs in code that releases the GIL.

It turns out, after some digging, that if numpy.ALLOW_THREADS evaluates true at runtime (see this section of the NumPy C API docs, then NumPy was compiled such that PyArray_FromFile (which is the C function eventually called by numpy.load) already releases the GIL before calling fread(), and reacquires it right after. This means that the costliest part of it can be done in a parallel thread, and we can go ahead and implement a thread-based consumer-producer model if we like. If we want more done in parallel we can do that by reimplementing this as much as possible using a Cython with nogil block.

Benchmark k-means implementations. Make something competitive with MATLAB

Adam Coates' matlab implementation is way faster than ours or scipy's, even though he doesn't do anything particularly fancy.

I'm in the process of changing ours to just use milk. We should benchmark milk against Adam's code (look at the objective function value, runtime, and memory consumption) and make something competitive with his. Be sure to benchmark his on a machine with several cores.

http://www.stanford.edu/~acoates/papers/kmeans_demo.tgz

Make monitor channel names predictable

Related to #49. Right now, every algorithm is going to name the key for its channel in the monitor something different. @goodfeli? @steven-pigeon?

It might make more sense to have the keys for training error and validation error in consistent keys, and append metadata on the channel object like the name of the algorithm class, etc.

@chandiar and @caglar are working on this currently.

Figure out how to do automatic refactoring and document it

We get lots of bugs from refactoring. Partly this is an inevitable consequence of using python, but some things should still be more automatable.

Pascal Lamblin and Guillaume think Eclipse might be able to do some automatic refactoring for python. Also, Yan Dauphin has heard good things about an IDE called pycharm. I got our team an open source developer license for pycharm.

If someone would like to try out a few IDEs, figure out how well they work for automating common annoying tasks like renaming a class, it'd probably be really helpful to start a page in the doc describing which IDEs are recommended and how to set them up to work with pylearn.

make Preprocessor class

the various classes in datasets.preprocessing have common functionality but no superclass to document it in

Find how to use floatX in the dataset

In commit 3086268, Ian did a quick fix to make the dataset work with floatX. But we need to think how to do it correctly in all case.

If we have floatX=float32 and we always use a datatset in float32, we should not waste memory by keeping it in float64. But there is the other way, sometimes we want to upcast it to float64 at the start but not always.

Convert all prints to use loggers and appropriate log levels.

Specifically, stick something like

import logging
...
...

logger = logging.getLogger("put.module.name.here")

at the t op of the module.

Very verbose output with print statements should then be replaced with logger.debug(); less verbose but useful stuff with logger.info(); stuff that goes wrong that the user should know about even if they don't care about info() level stuff should be a logger.warning(), etc.

This will allow us to easily and programmatically redirect logging output, change logging verbosity, apply common styling to all log messages (add timestamps, blah).

See the Logging Cookbook and the logging module docs.

Stricter compile time checks

See if pychecker or pylint can do stricter compile time checks, such as following imports, than are currently done by pylearn2.devtools.tests.test_via_pyflakes

AIS hasn't been run for some cases

AIS.fX has no self argument, which suggests that that AIS has never been run in a mode that results in fX being called. We should fix fX and make sure whatever code path is meant to call it is working.

make Distribution class

Should go in pylearn2.distributions. Existing distributions in that module should inherit from it.

unify monitoring and callbacks

it's important to call the monitor before calling the callbacks. factor this logic into one piece of re-usable code so that we don't need to worry about each implementer of a training_algorithm screwing it up

standardize on ML-specific code conventions

The "coding style" document produced out of the committee stuff did a pretty good job of things (save for the brewing disagreement over module renaming) but in order that we don't end up with think we need to discuss and standardize on things like

  • naming conventions for method/function arguments, learnable parameters, hand-set hyperparameters
  • naming conventions for classes
  • naming conventions for other variables
  • use of acronyms, greek letters, etc.

There's a fair amount of disagreement about a number of these issues, but I wanted to assemble therm in one place, along with comments for/against. I've tried to keep this post neutral in town and not colour it with my own opinion, but please add comments both expressing your view on a certain issue as well as correcting anything you feel I've unfairly represented.

Method/function arguments

  • Should they be named simple one letter names (X, y), something longer and more descriptive (inputs, labels), something else entirely?
  • Arguments for X, y etc.: makes complicated mathematical expressions easier to read, might correspond with usage in paper describing algorithm
  • Arguments for inputs, labels, etc.: perhaps clearer to novices, complicated expressions could be broken up in terms of temporaries

Learnable parameters

  • Should these be distinguished in some particular way, other than their membership in the _params attribute? e.g., scikits-learn uses a convention of suffixing all learnable parameter object members with a single underscore, e.g. self.weights_. Of course pylearn2 parameters will be shared variables rather than numpy arrays, so perhaps this particular convention should be avoided to prevent confusion, but should we adopt something similar?

Hyperparameters

  • Should they be named by their putative function (sparsity_penalty), by the name assigned to them in a particular paper (eta, epsilon), by some other convention? One complicating factor is that lambda is a reserved word in Python, so you can't name an object member lambda (just like you can't name it 'for' or 'else').

Other variables

  • Should any particular conventions apply?

Class Names

  • Models are currently typically named their typical name in the literature (DenoisingAutoencoder, ContractiveAutoencoder, BinaryGaussianRBM). Preprocessing varies between verb-phrases (ExtractPatches), noun-phrases (LocalContrastNormalization), etc. Should the object be named as a ``doer'' of whatever it does (e.g. PatchExtractor, LocalContrastNormalizer), the noun phrase of the accomplished processing (PatchExtraction, LocalContrastNormalization), or what?
  • Typically underscores are left off in favour of camelcase for class names (as denoted in PEP8). Should we adhere to this convention?

Acronyms

  • Certain acronyms are in wide use and perhaps unambiguous enough (e.g. RBM). Others are overloaded and potentially more opaque (i.e. SMD for "Score Matching Denoising", "Stochastic Meta-Descent", "Stochastic Mirror Descent"). How should we deal with acronyms? Should there be a whitelist (certain acronyms are okay, others aren't)? Or should it be up to the user to dig through docstrings?

remove dependency on TheanoLinear

As per dwf's review of my convolution support pull request, either wrap TheanoLinear it up in pylearn2 or in Theano. should probably talk to James about what he'd like to do most.

Import public classes at the subpackage level.

We should be referring to stuff as pylearn2.datasets.DenseDesignMatrix, not pylearn2.datasets.dense_design_matrix.DenseDesignMatrix, by importing in datasets/__init__.py.

This has three effects:

a) It makes clear precisely what the public classes that are meant to be used are.
b) It gives us the freedom to move things around within the files in a given subpackage.
c) It means less to type in a lot of places.

Clear error message about pylearn1

We should detect when people don't have pylearn1 installed and raise an importerror that explains pylearn2's dependence on pylearn1, and possibly where to download it. This error message should only be raised when we actually attempt to import pylearn.

problem in viewing the weights of the GRBM

Hello
I am trying to check the weights of the grbm but it seems that there is a bug because self.weights are not assigned.
It is possible to check the output pickle file of the train and get_weights() gives error.
Thanks

Compile time checks

Figure out some way of automating compile time checks so that really simple bugs like incomplete refactoring updates don't require as much of a unit test burden to catch.

This should be possible, after vim has syntax highlighting plugins that do it. We just need an interface to that same functionality that lets us scan the whole library and emit a list of errors.

Pascal Lamblin says pylint (sp?) might be able to do this.

One stochastic gradient to rule them all

Unify SGD and UnsupervisedBlahSGD.

I floated StochasticGradient for the name, but UnsupervisedStochasticGradient or UnsupervisedSGD might be more.

Are supervised and unsupervised tasks similar enough to unify them transparently? I suppose most of that logic can be pushed to the cost function.

Let's concentrated on the unsupervised case for now.

linear docstrings

If you're overriding a method you can suck the superclass docstring into the subclass by using the @functools.wraps(SuperClassName.method_name) decorator.

train_example unit test

Make a unit test based on scripts/train_example

It should do basically the same thing as "train.py cifar_grbm_smd.yaml" but with these changes:

-don't use the real dataset. the preprocessing is expensive, and cifar10 is a reasonably big download. we don't want to force everyone who uses the library to download cifar10. randomly generated data might be ok, or maybe we could host a .npy file with a few hundred thousand examples somewhere that the unit test can download if the file isn't found
-set save_freq to 0, so the test doesn't modify the filesystem

The test should work by checking that the initial and final values in the monitor are pretty close to the values obtained by some original version of the code that we believe to be working. It should also check the length of the monitor to make sure convergence is happening at the same rate.

One question about how to implement this test: do we want it to load the yaml file and then modify the object to use different data? This would require adding an interface for changing the dataset of a Train object. An alternative is to manually update the test to remain similar to the yaml file, but it seems like it would be easy for the test and the yaml to get out of sync this way.

modularize SGD training algorithm

Things are a bit tightly coupled in there, and it might be worth thinking about other use cases (i.e. where people want to reuse some of our nitty gritty SGD logic but "be their own driver", etc.). I think we'll see more uptake of pylearn2 if you can choose with a bit more granularity "how much & what for" you want to use the library.

High level abstractions are good, but high-level with a medium-level option are better.

  • See if we can pull it apart so different aspects can be used in isolation, while maintaining the "simple recipe" for training that is currently possible via yaml experiment descriptions
  • See if we can leverage anything from the previously written SGDOptimizer in optimizers.py before we just trash it

clean up sgd; make term_crit module?

Lots (all?) of the termination criteria in training_algorithms.sgd are not specific to the sgd training algorithm. These should be put in a separate module somewhere so that it's easier to find them by browsing the code tree / to make it more obvious that they could be used with other training algorithms.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.