mila-iqia / blocks Goto Github PK
View Code? Open in Web Editor NEWA Theano framework for building and training neural networks
License: Other
A Theano framework for building and training neural networks
License: Other
In the previous version we had documentation of application methods preserved by means of wraps
decorator. Now we just lose it.
@bartvm, can you please fix it somehow? The core code is currently yours and you know better how to do it.
For debugging and for tutorials it would be very nice to have a routine like theano.printing.debugprint
that can print the blocks hierarchy in textual form. It should gracefully handle the DAG case, when a brick is a child of a few others. Optionally one can also output an image by using some graph visualization library, aka graphviz
.
Using logging has some advantages, such as being able to redirect output to a file (in a more refined way than just dumping stdout), applying filters, silencing, etc. I came across this when I wanted to silence the output during tests.
A typical _push_initialization_config
simply modifies weights_init
and biases_init
of the children bricks. Seems that it would be nice to separate this behaviour a mixin brick, called let's say Initializable
.
One who makes it should probably also take of pushing rng
and theano_rng
down hierarchy.
This is just a bunch of ideas that I think would be nice for monitoring. I want to start working on some optimization stuff, for which it would be handy if I can monitor a range of different things, many of which Pylearn2 doesn't support right now.
We need to keep in mind that Theano variable monitor channels need to be collected and compiled as a group (so that they can share a large part of the computational graph). Likewise, if we monitor the progress of each batch (useful for very long-running experiments like machine translation) it would be best if we can compile the monitoring value together with the update, so that they too can share the computational graph.
Instead of having check_theano_variable
scattered throughout the code, I think we should leave those checks out for now until we implement the error-checking mechanism we talked about at the beginning.
I haven't thought about the details yet, but I think something like this would be nice:
@apply.check_inputs
def apply_check(self, input_name, input):
if input_name == 'states':
assert [_[0] for _ in groupby(input.axes)] == ['time', 'batches', 'features']
assert input.axes.count('batches') == 1
assert input.ndim >= 3
assert 'float' in input.dtype
This would be called automatically for all the inputs (and likewise for outputs?) by the Application
class in __call__
somewhere. It would catch the AssertionError
and turn it into something useful like
The input {} (index {}) provided to the application method {} of brick {} is invalid.
It was the output of application method {} of brick {}.
The assertion error that was raised was:
...
To turn off these checks, please set blocks.config.check_inputs = False
See subject. It is using tensor.nnet.softmax
which raises the following error:
ValueError: ('x must be 1-d or 2-d tensor of floats. Got ', TensorType(float64, 3D))
When this is fixed it could be used in sequence_generators.py
.
Seems like we need a new abstraction: a brick with a single input and a single output that supports setting the input and output dimensions, i.e. has some writeable properties input_dim
and output_dim
. Or has set_input_dim
, set_output_dim
methods. I suggest an to create an abstract brick Feedforward
for it.
Motivation: before I used MLPs everywhere for intermediate transformations and configured them with mlp.dims[0] = smth
. Now we also have LinearMaxout
and later we will have DaisyChain
. There should be a shared configuration interface for all these guys.
Opinions?
Now that we have an initial version of datasets, the next challenge: Serialization and preprocessing. This is a pretty difficult problem to tackle, so it's worth some thought.
Serialization is difficult because we don't want to pickle the data each time we save the model (e.g. no point in writing the same MNIST to disk time and time again). Not saving the data and loading it from disk when deserializing is the obvious solution, but comes with pitfalls: If the data isn't where it is was originally, the deserialization will fail and our saved model is now useless.
DWF gave this some thought for Pylearn2, so let me quote from his outline of what would be a solution:
Automatically loading from disk on unpickling causes unpickling failure if the data has moved or if the path doesn’t exist on that machine. Don’t do one as a consequence of the other automatically.
Data loading should be overridable.
The lowest common denominator for every kind of dataset is to provide a directory path for loading or saving compatible disk representations. The load path should be specifiable via constructor, and alterable by a method call.Data loading should be lazy.
To that end, data should be fetched from data on an as-needed basis. [...] In particular, no additional disk access should be triggered by unpickling: this gives the user the opportunity to modify the data load path, if necessary. This prevents a lot of nasty situations where you cannot unpickle something and you don’t have permissions to create the data paths necessary in order to make it unpickle cleanly.
A separate problem, but one quite closely integrated with serialization, is preprocessing. If we want to support automatic preprocessing such as whitening, ZCA, contrast normalization, etc. there are a few things that should be considered:
To quote from DWF again:
Different facilities for in-memory and out-of-memory data; in-code awareness of which you’re dealing with.
Whole-dataset preprocessors should be able to know that they are operating on an in-memory dataset, and fail if they don’t know how to do something intelligent with a certain kind of out-of-memory dataset.
Keep records of transformations done on data.
It should be possible to regenerate preprocessed data from the default loading path and the log of preprocessing operations upon deserialization. Of course generic containers in the spirit of “DenseDesignMatrix” needn’t provide a default loading path, but given a deserialized object, the load -> (optionally) fit preprocessor -> apply preprocessor should be a simple method call away.
❗ However, I think the best point made is at the end of his document:
Is it really worth having the in-{Python,YAML} preprocessor facilities?
It seems like this is only of any use for relatively small, in-memory datasets. For large datasets (or even small datasets where you value reproducibility, and different machines will give you different ZCA results, for example), we could support it with a script likepython preprocess.py \ --config <description-of-preprocessing> \ --data_output_path <directory> \ --output_fitted_preprocessor <pickle file>
You could also add arguments for preprocessing from an already fitted preprocessor object, etc. To use a preprocessed dataset you’d then use the corresponding dataset class and override the load path in the constructor. Viewer code could be passed the serialized pickle file to know how to undo transformations and things like that.
It sounds to me like the pre-processing serialization, etc. would be a real nightmare, and I would argue using this approach for all datasets, in-memory or not. So pre-processing basically happens as an entirely separate step through a script that takes particular datasets (NumPy arrays, HDF5, CSV, etc.) and a configuration file describing what preprocessing to apply as an input. As outputs it produces a new dataset as well as a (pickle?) file which describes what preprocessing exactly has been performed, allowing you to reproduce it and/or reverse it.
We are missing an explanation why we need those (inputs, outputs, states, etc.) and how they work.
The docstrings aren't lost, but Sphinx see that the application methods are actually class instances, so it shows their signature as e.g apply
instead of apply(x, y)
. A solution is proposed here but it doesn't seem too straightforward.
Bart, what do you mean by output_mode?
It would be nice to have an example on the dataset everybody is familiar with showcasing they way we separate model constructor and regularization.
@sherjilozair, you wanted to contribute that?
In Groundhog a Layer typically had two references to a parameter (a shared variable):
self.W
self.params
A dangerous side of doing it this way is that to change a parameter you have to change it twice.
If we want to stay away from this and only have references to parameters in self.params
(which I guess you want from your code), we need a property for every parameter to provide convenient access. Such properties would refer to a specific position in the parameters list (see W
property that
the Recurrent
brick has).
But what if I have varying number of parameters depending on the configuration of my brick? An example is a GatedRecurrent brick which I plan to commit soon. We should be able to not use any of the update/reset gates. Thus the parameter's position in the list is not fixed anymore.
What I propose is to keep a fixed position for every possible parameters of a brick and have None placeholders for those that are not used in the current configuration. What do you think?
for i, state_below in enumerate(states_below):
states_below[i] = state_below.copy()
# TODO Copy any required tags here
states_below[i].name = self.name + '_input_{}'.format(i)
Why do you do that?
Bart, I think state_below is a historical name that should follows fprop's path in the void. Do you agree that we could that call it 'inputs'?
And another thing; I do not see a point in making apply
method process in one call homogeneously
several variables. Many apply
calls seems to be better.
Subj.
Somebody should make training the SequenceGenerator from PyLearn2 possible. I guess we need some new wrappers to do that.
During these two days I got an impression that our core with its heavy usage of decorator syntax and descriptors is pretty hard to get started with. Would be very nice to have a concise document that explains what the advanced python features we rely on are and where one can get more information about them.
It turns out the syntax to declare metaclass has been changed in Python3. That's why our abstract methods can be called in Python3: they are not abstract in fact!
For more detail see the link below:
http://stackoverflow.com/questions/18513821/python-metaclass-understanding-the-with-metaclass/18513858#18513858
They still have the old interface. I noticed that several of @rizar's examples use custom cost
application methods; we should probably avoid this in general, so that we don't end up effectively implementing MSE, cross-entropy, etc. in every brick, and of course, many bricks can be trained using a variety of costs.
Recent post from Jan in #85 revealing his vision of how monitoring should work
made me think further of what kind of main loop we want to end up with. If we
take the old Groundhog and PyLearn2 main loops and push almost everything out
of them as Jan suggests, we end up with a tiny skeleton managing a set of
callbacks. In this way we are moving towards an event-based framework instead
of one with a fixed interaction scenario, even though it sounds a bit scary at
the first sight.
Another argument to show that we are very close to introducing events: in
Montreal we discussed the interface between the datasets and the rest of the
library. We converged to an object which Jan calls "a view" and I call "a data
stream", that creates an epoch iterator, which in turn creates a batch
iterator. Whereas this covers many common cases like MNIST (when an epoch is a
pass over a dataset) and nearly endless datasets like the one we used in
machine translation (an epoch can be defined as, e.g. 100 batches), it does not
scale: if I am training a skip-gram model, and I want to do something when I
process all n-grams from a sentence, I have to declare a sentence an epoch,
which might be not a good decision in other aspects. If I want to have arbitary
many time scales (word, sentence, paragraph, document, book), it becomes very
challenging unless a notion of an event is defined. In fact, Jan already
argued we need it, but he proposed to mix them with the data itself which was
not embraced by Bart and myself.
So long for an introduction, and here comes the point: let's look at the main loop
as a scheduler that triggers certain handlers when certain events happen.
Examples of events include:
When we have multiple events happening simultaneously (e.g. a tick and an
epoch), the order of their processing should be somehow determited. We could
have a priority associated with every event, e.g. 0 for a tick, 1 for a
keyboard interrupt, -1 for an epoch. Events with a higher priority can be
handled before events with a lower priority.
Also, we will typically have a few handlers for every event that should
be executed in order. Let's recall Jan's example: we want to save after a
validation. To do that we assign a chain of two handlers to the EpochEndEvent
:
validation followed by saving.
There might be a necessity for interaction between handlers. E.g. if we want to
finish an epoch and terminate after a keyboard interrupt is received, we need
two actions to be done for two different events. After the interrupt we should
set a flag, which should be checked at the end of each epoch. The backend monitor
proposed by Jan can accomodate this flag. In fact this monitor is much more
than a logging mechanism: it is a persistent time-indexed array through which
the components of the system can interact with each other, and the user can access
whole interaction history.
It is possible that handlers will generate events as well. E.g. the EpochEndEvent
will be called by a handler of the TickEvent
, the main one, responsible
for fetching data and doing a descent step. To allow that, every single component
of the system must have access to the main loop object. That looks a bit weird
when even iterators have a link to the main guy, but perhaps this the price to pay.
This text should obviously be followed by lots of examples, but so far I will
just share these raw thoughts cause I do not know when will have time to write
some code. The main point is: the generality we aim for requires complete
decentralization, and for this I think we have to think in terms of events, not
in terms of a fixed algorithm with a few hooks.
Write a setup.py that will allow blocks to be installed using e.g. pip
Make flake8 on Travis run separately, so that it's clear straight away whether the error is due to PEP8 or due to failing tests.
Stacked RNNs are so trendy nowadays that I believe somebody will need them soon. I think the right way to support it in Blocks is a RecurrentStack
brick that takes a list of transitions and wraps them into one.
We should check the incoming request in MNIST.get_data
. I spent an hour trying to understand what's going on when I simply passed it wrong indices. The challenge is, that the request
can be an index, sequence of indices and slice, and all this cases need different code to be handled.
For true "checkpointing" i.e. allowing the starting and stopping of training, we need to use a serialization method like pickle instead of dumping the parameters to a NumPy array like GroundHog does (which loses the states of all RNGs, both in the model as well as the dataset iterators). Pylearn2 does this reasonably well, but it has a few shortcomings, mostly regarding the pickling of the dataset iterators, which means that for many complex cases there is still randomness when restarting.
I think that the most robust way to do this is to make sure all classes can be made pickle-able. In most cases that won't require any work, but for datasets it means that we need to make sure that we don't pickle the actual data, but that we do pickle the exact way post-processing was performed (whitening, ZCA, shuffling). I think Pylearn2 originally considered doing it this way, but then they decided that it was too much work and stopped pickling datasets and iterators all together. Now it's come back to bite them, because we really need checkpointing for running on the Helios cluster (which requires jobs to be restarted every 12 hours), so they're working on fixing this in Pylearn2 now.
Currently they are all in a top-level folder tests
. But as we grow we will need more of them. I think it is OK to have them scattered across repository. What do you guys think?
#115 breaks it by changing the way activation classes are serialized.
Perhaps we should just not use class generation? Apart from serialization problems it is quite hard to read for people with limited Python knowledge.
I think that DefaultRNG should be a separate class to avoid diamond-shaped inheritance pattern. I am now playing with RNN implementation and I don't feel it's right to make the basic RNN class a descendant of DefaultRNG, because theoretically one can use the RNN framework I am writing for some non-parametric transformations, that does not requite a random number generator to initialize parameters.
What do you think?
Just to recap, this is the way it is done in Pylearn2, which I think is a good start:
SubsetIterator
is basically an iterator that churns out example indices in a variety of ways: sequential batches, random batches, random sequential batches; forcing batches to always be the same size (needed for convolutions), etc. When StopIteration
is raised, the epoch is assumed to end.FiniteDatasetIterator
basically glues the SubsetIterator
and the Dataset
together, its main task is to provide the space conversions, and to retrieve the data from the Dataset
. It does this by making the assumption that the data can be represented as an object which supports indexing (including slices).Dataset
is in charge of loading the data, and provides hooks for pre-processing (ZCA, etc.) It has a method (iterator
) which uses the data to construct and return an instance of FiniteDatasetIterator
. The FiniteDatasetIterator
is what implements Python's iteration scheme in the end (calling the SubsetIterator
in turn).There are a few assumptions underlying this design that are problematic: It assumes datasets are finite and it makes some assumptions about the way the data can be represented (mostly as in-memory arrays).
A few cases in which this is problematic:
So, a few ideas that I think could help mitigate these problems:
SubsetIterator
needs to be optional, an infinite dataset would look like:class Gaussian(InfiniteDataset, DefaultRNG):
def __init__(self, mean, std):
self.mean = mean
self.std = std
def __next__(self):
rng.normal(self.mean, self.std)
Dataset
has full access to the SubsetIterator
, so it can do e.g. buffering if it wants to. This also means that the dataset is in complete control of how it represents data in memory.def __iter__(self):
batches_iterator = cycle(subset_iterator)
# put these in a multiprocessing queue that always creates batches in advance
def __next__(self):
return self.queue.get() # return from the queue
Dataset
doesn't have to support all iteration schemes. If e.g. word2vec is implemented as a sequential reading of text files (as it is done in the original implementation) then it doesn't need to support shuffling.Dataset
should always be infinite, even for finite datasets. Monitoring for finite datasets can simply be done whenever num_examples_seen % dataset.num_examples == 0
. This removes any sort of confusion as to what constitutes an "epoch"; the only unit of measurement will be "number of training samples seen".I'm not going to try and summarize the discussion at NIPS, but just wanted somewhere to collect any further discussion. Also, see @rizar's Gist which attempts to outline the API: https://gist.github.com/rizar/cc62fb9d6270f9856793
Did anyone write more code than this? Else I can start to try adapting the dataset code I've written so far to make it fit in with the new design.
...such as Emitter and Feedback.
The file https://github.com/bartvm/blocks/blob/master/blocks/bricks/recurrent.py contains a tool to build recurrent bricks: the recurrent
wrapper. It is lacking test coverage and documentation. For instance, the order in which sequences, states and contexts are to be returned should be explained. The wrapper should be covered by tests independently from the example bricks (Recurrent
, GatedRecurrent
).
@orhanf, this might be a task for you.
It does not have the state parameter, loading happens in __init__
.
Implement attention mechanism and port machine translation.
Work in progress, progress can be seen in the branch https://github.com/bartvm/blocks/tree/attention
We could use the https://github.com/jaberg/skdata to download popular datasets and even to load them into memory.
I noticed that the MLP
class actually didn't work with lazy initialisation (because len(activations)
gave an error when activations = None
). We should probably add a unit test which tries to initialise each Brick
subclass without arguments to see if they are lazy-loading compatible.
Hello, I love it that this project is building higher level abstractions on top of Theano. That should be really useful. What I'm more interested in though is what Blocks provides (or aims to provide) that Pylearn2 doesn't. :)
Title is self-explanatory!
It's quite a bit faster on GPU
The annotated computation graph produced by a brick hierarchy contains a lot of very useful information. One could extract from it all application calls and the order in which they happened. This information would be super helpful for debugging, e.g. one can see that he/she did not use a returned value at some point.
Basically one needs two do two things:
I've just noticed that your there is no way to add None
in outputs_info
in your recurrent
wrapper.
Maybe we should add a None
for every output from application.outputs
which is not a state?
So Yoshua and Myriam really want the attention mechanism in Pylearn2. I told them that it would be silly to duplicate the effort by re-implementing everything which is in Blocks in Pylearn2, so ideally we want a nice wrapper that allows people to use block-models (attention-mechanism ones in particular) in Pylearn2. I think that if necessary we can assign some CCW tickets i.e. assign people to work on changes in Pylearn2 that we need to make things compatible.
Theano is already Python 3 compatible, and Pylearn2 will probably be very soon. I think it's easiest to make blocks Python 3 compatible from the very beginning, because making it compatible later on might be a lot more difficult.
I've never maintained a project that supports both Python 2 and 3 at the same time, but purely based on what I've read and seen, I suggest we use a single common codebase supporting both Python 2.7 and 3.4, using six whenever needed (but as little as possible). This is the way NumPy, Scipy, matplotlib, sympy, etc. have all done it.
@rizar Comments?
A design choice we should make: should there be a base class for recurrent blocks.
First I thought it should be there, than I deleted it because there was no logic in it. Now it seems we could put a default implementation of the initial_state
method in it, that would make it more transparent for the user which method should be overridden in order to change initialization behaviour. Such a default implementation would query the state dimensions by self.dimension(state_name)
interface.
What do you think?
That's a really minor thing, but anyway: when I write IsotropicGaussian(0.01)
I expect it to have zero mean and 0.01 standard deviation, which is currently the other way round. Should I fix that?
I've just realized that we can not introduce changes into a scan node by theano.clone. See https://gist.github.com/rizar/ea6aa6e750e1a9548703
That casts serious doubt on the ambitious plans to implement all sorts of regularizations as modifications on a ready cost computation graph (which I liked so much). Maybe I should create a Theano issue for that?
We currently have the situation where we want a set of methods of all the derived classes of Brick
to be wrapped by methods defined in the Brick
base class e.g. _initialize
, _allocate
, _push_allocation_config
, etc. are all wrapped by the methods initialize
, allocate
, ....
There are a few ways of doing this. The way we do it right now is by calling the public method allocate
on the base class, which internally calls the non-public method _initialize
defined on the child classes. This works, but personally I am not a fan of non-public methods being the main features of bricks. It might also be confusing to people (and I think ugly) that they need to define _initialize
and then call initialize
, nothing will work. The current method also doesn't allow us to document the initialize
method of each brick separately.
I was wondering if it would not be better to switch to using decorators here. We use them quite heavily already, but I think this is exactly what they are made for: wrapping methods simply, and explicitly. So instead we would have
class SomeBrick(Brick):
@lazy
def __init__(self, dim):
self.dim = dim
@application
def apply(self, x, y):
"""Applying this brick does a, b and c."""
...
@initialization
def initialize(self):
"""Initialize the 3 weight matrices of this block."""
...
@allocation
def allocate(self):
"""Allocate the parameters. Note that `dim` must be set before calling this."""
...
Title speaks for itself!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.