mila-iqia / blocks Goto Github PK

View Code? Open in Web Editor NEW

1.2K 1.2K 351.0 4.66 MB

A Theano framework for building and training neural networks

License: Other

Python 100.00%

blocks's People

Stargazers

Watchers

Forkers

rizar janchorowski kelvinxu kastnerkyle pbrakel madisonmay yingzha jych dmitriy-serdyuk laurent-dinh sherjilozair dwf vdumoulin harm-devries ogrisel wavelets voidexception wuntoguo jiayong jbornschein arasmus adbrebs memimo lukemetz capybaralet mducoffe saikrishnar edersantana zhangaustin saizheng caglar sebastien-j orhanf npow aalmah bouthilx thrandis tarzain kaynewest snazz2001 udibr jfsantos jesselivezey darkseed mauinz raphael-forks nkhuyu andhus kyunghyuncho nouiz rtvt123 francoisdupre skaasj cosmoharrigan lixiangnlp zomux basaundi johnfrye franciscocanas lenovor jethrotan tigerneil yzhsjtu lamblin ctozlm johnarevalo honeyflyfish zxvix mjwillson gchrupala cvml saltydizz negar-rostamzadeh aliko70 youralien chrishokamp nagyistoce sendit2me galv amritasaha1812 bigsea2015 yangxs lingya zhchxi11 cmorato foremostdw codeaudit mindis o1lo01ol1o tejaskhot jeanharb cooijmanstim atousatorabi fedorajzf milesqli vic-lu-zy mpezeshki provemyself karimpedia volumetric

blocks's Issues

Documentation of Application Methods Lost

In the previous version we had documentation of application methods preserved by means of wraps decorator. Now we just lose it.

@bartvm, can you please fix it somehow? The core code is currently yours and you know better how to do it.

Debug Printing of Bricks Hierarchy

For debugging and for tutorials it would be very nice to have a routine like theano.printing.debugprint that can print the blocks hierarchy in textual form. It should gracefully handle the DAG case, when a brick is a child of a few others. Optionally one can also output an image by using some graph visualization library, aka graphviz.

Switch to logging for monitoring/printing training progress

Using logging has some advantages, such as being able to redirect output to a file (in a more refined way than just dumping stdout), applying filters, silencing, etc. I came across this when I wanted to silence the output during tests.

Separate standard initialization code in a base class.

A typical _push_initialization_config simply modifies weights_init and biases_init of the children bricks. Seems that it would be nice to separate this behaviour a mixin brick, called let's say Initializable.

One who makes it should probably also take of pushing rng and theano_rng down hierarchy.

Monitoring

This is just a bunch of ideas that I think would be nice for monitoring. I want to start working on some optimization stuff, for which it would be handy if I can monitor a range of different things, many of which Pylearn2 doesn't support right now.

Monitor at different points in time (all optional)
1. After each update (useful for monitoring on the training set e.g. the likelihood of each batch; done by GroundHog for machine translation I believe)
2. After a fixed number of updates (useful for infinite datasets, or very large ones)
3. After each epoch (after having seen as many samples as are in the dataset, only for finite datasets; this is what Pylearn2 has)
Don't just monitor scalars, also allow for arrays to be monitored (e.g. gradients and Hessians) and general Python objects (e.g. samples from the translation model as strings)
Closely related: Monitor non-Theano variables (e.g. I want to be able to use NumPy to calculate the eigenvalues, or calculate the BLEU score, and monitor that), but Theano variables still need special treatment
- Allow monitor to save history to disk once in a while, because memory usage to store this data can quickly add up
Allow for monitors to be pickled and continue where they left off

We need to keep in mind that Theano variable monitor channels need to be collected and compiled as a group (so that they can share a large part of the computational graph). Likewise, if we monitor the progress of each batch (useful for very long-running experiments like machine translation) it would be best if we can compile the monitoring value together with the update, so that they too can share the computational graph.

Error checking mechanism

Instead of having check_theano_variable scattered throughout the code, I think we should leave those checks out for now until we implement the error-checking mechanism we talked about at the beginning.

I haven't thought about the details yet, but I think something like this would be nice:

@apply.check_inputs
def apply_check(self, input_name, input):
    if input_name == 'states':
        assert [_[0] for _ in groupby(input.axes)] == ['time', 'batches', 'features']
        assert input.axes.count('batches') == 1
        assert input.ndim >= 3
        assert 'float' in input.dtype

This would be called automatically for all the inputs (and likewise for outputs?) by the Application class in __call__ somewhere. It would catch the AssertionError and turn it into something useful like

The input {} (index {}) provided to the application method {} of brick {} is invalid.
It was the output of application method {} of brick {}.
The assertion error that was raised was:

...

To turn off these checks, please set blocks.config.check_inputs = False

The Softmax Brick Works Only For Matrices

See subject. It is using tensor.nnet.softmax which raises the following error:

ValueError: ('x must be 1-d or 2-d tensor of floats. Got ', TensorType(float64, 3D))

When this is fixed it could be used in sequence_generators.py.

Base Feedforward Brick

Seems like we need a new abstraction: a brick with a single input and a single output that supports setting the input and output dimensions, i.e. has some writeable properties input_dim and output_dim. Or has set_input_dim, set_output_dim methods. I suggest an to create an abstract brick Feedforward for it.

Motivation: before I used MLPs everywhere for intermediate transformations and configured them with mlp.dims[0] = smth. Now we also have LinearMaxout and later we will have DaisyChain. There should be a shared configuration interface for all these guys.

Opinions?

Serialization of datasets (and preprocessing)

Now that we have an initial version of datasets, the next challenge: Serialization and preprocessing. This is a pretty difficult problem to tackle, so it's worth some thought.

Serialization is difficult because we don't want to pickle the data each time we save the model (e.g. no point in writing the same MNIST to disk time and time again). Not saving the data and loading it from disk when deserializing is the obvious solution, but comes with pitfalls: If the data isn't where it is was originally, the deserialization will fail and our saved model is now useless.

DWF gave this some thought for Pylearn2, so let me quote from his outline of what would be a solution:

Automatically loading from disk on unpickling causes unpickling failure if the data has moved or if the path doesn’t exist on that machine. Don’t do one as a consequence of the other automatically.

Data loading should be overridable.
The lowest common denominator for every kind of dataset is to provide a directory path for loading or saving compatible disk representations. The load path should be specifiable via constructor, and alterable by a method call.

Data loading should be lazy.
To that end, data should be fetched from data on an as-needed basis. [...] In particular, no additional disk access should be triggered by unpickling: this gives the user the opportunity to modify the data load path, if necessary. This prevents a lot of nasty situations where you cannot unpickle something and you don’t have permissions to create the data paths necessary in order to make it unpickle cleanly.

A separate problem, but one quite closely integrated with serialization, is preprocessing. If we want to support automatic preprocessing such as whitening, ZCA, contrast normalization, etc. there are a few things that should be considered:

Some preprocessing is global (i.e. for whitening we need the dataset mean and variance), but might need to be able to be applied as an out-of-core algorithm for data that doesn't fit into memory.
For out-of-core datasets we then have two options, depending on the kind of preprocessing performed: We apply the processing batch-wise during iteration, or we save the modified data to disk and then stream from disk again.
For in-memory datasets, ideally, we should remove the original data from memory and only keep the processed data.
Ideally all these pre-processing steps should be serializable/reproducible i.e. with only the original data and a pickle file, we should be able to recreate the pre-processed data exactly (this is where the strong link with serialization as discussed above shows up).

To quote from DWF again:

Different facilities for in-memory and out-of-memory data; in-code awareness of which you’re dealing with.
Whole-dataset preprocessors should be able to know that they are operating on an in-memory dataset, and fail if they don’t know how to do something intelligent with a certain kind of out-of-memory dataset.
Keep records of transformations done on data.
It should be possible to regenerate preprocessed data from the default loading path and the log of preprocessing operations upon deserialization. Of course generic containers in the spirit of “DenseDesignMatrix” needn’t provide a default loading path, but given a deserialized object, the load -> (optionally) fit preprocessor -> apply preprocessor should be a simple method call away.

❗ However, I think the best point made is at the end of his document:

Is it really worth having the in-{Python,YAML} preprocessor facilities?
It seems like this is only of any use for relatively small, in-memory datasets. For large datasets (or even small datasets where you value reproducibility, and different machines will give you different ZCA results, for example), we could support it with a script like
python preprocess.py \
        --config <description-of-preprocessing> \
        --data_output_path <directory> \
        --output_fitted_preprocessor <pickle file>
You could also add arguments for preprocessing from an already fitted preprocessor object, etc. To use a preprocessed dataset you’d then use the corresponding dataset class and override the load path in the constructor. Viewer code could be passed the serialized pickle file to know how to undo transformations and things like that.

It sounds to me like the pre-processing serialization, etc. would be a real nightmare, and I would argue using this approach for all datasets, in-memory or not. So pre-processing basically happens as an entirely separate step through a script that takes particular datasets (NumPy arrays, HDF5, CSV, etc.) and a configuration file describing what preprocessing to apply as an input. As outputs it produces a new dataset as well as a (pickle?) file which describes what preprocessing exactly has been performed, allowing you to reproduce it and/or reverse it.

Introduction to Application Properties

We are missing an explanation why we need those (inputs, outputs, states, etc.) and how they work.

Patch Sphinx to document application methods as methods instead of class instances

The docstrings aren't lost, but Sphinx see that the application methods are actually class instances, so it shows their signature as e.g apply instead of apply(x, y). A solution is proposed here but it doesn't seem too straightforward.

What's output_mode in Your Code?

Bart, what do you mean by output_mode?

MNIST Example with Regularization using PyLearn2 Wrappers

It would be nice to have an example on the dataset everybody is familiar with showcasing they way we separate model constructor and regularization.

@sherjilozair, you wanted to contribute that?

Having None in params.

In Groundhog a Layer typically had two references to a parameter (a shared variable):

a simple attribute reference, e.g. self.W
an entry in the parameters list self.params

A dangerous side of doing it this way is that to change a parameter you have to change it twice.

If we want to stay away from this and only have references to parameters in self.params (which I guess you want from your code), we need a property for every parameter to provide convenient access. Such properties would refer to a specific position in the parameters list (see W property that
the Recurrent brick has).

But what if I have varying number of parameters depending on the configuration of my brick? An example is a GatedRecurrent brick which I plan to commit soon. We should be able to not use any of the update/reset gates. Thus the parameter's position in the list is not fixed anymore.

What I propose is to keep a fixed position for every possible parameters of a brick and have None placeholders for those that are not used in the current configuration. What do you think?

Why Do You Copy Variables?

    for i, state_below in enumerate(states_below):
         states_below[i] = state_below.copy()
         # TODO Copy any required tags here
         states_below[i].name = self.name + '_input_{}'.format(i)

Why do you do that?

Why state_below?

Bart, I think state_below is a historical name that should follows fprop's path in the void. Do you agree that we could that call it 'inputs'?

And another thing; I do not see a point in making apply method process in one call homogeneously
several variables. Many apply calls seems to be better.

recurrent.py Documentation is Missing at RTD.

Subj.

PyLearn2 + SequenceGenerator

Somebody should make training the SequenceGenerator from PyLearn2 possible. I guess we need some new wrappers to do that.

Overview of Advanced Python used in Blocks

During these two days I got an impression that our core with its heavy usage of decorator syntax and descriptors is pretty hard to get started with. Would be very nice to have a concise document that explains what the advanced python features we rely on are and where one can get more information about them.

Our Abstract Classes Are Not Abstract Classes In Python3!

It turns out the syntax to declare metaclass has been changed in Python3. That's why our abstract methods can be called in Python3: they are not abstract in fact!

For more detail see the link below:
http://stackoverflow.com/questions/18513821/python-metaclass-understanding-the-with-metaclass/18513858#18513858

Update cost bricks to new interface

They still have the old interface. I noticed that several of @rizar's examples use custom cost application methods; we should probably avoid this in general, so that we don't end up effectively implementing MSE, cross-entropy, etc. in every brick, and of course, many bricks can be trained using a variety of costs.

Event Based Main Loop

Recent post from Jan in #85 revealing his vision of how monitoring should work
made me think further of what kind of main loop we want to end up with. If we
take the old Groundhog and PyLearn2 main loops and push almost everything out
of them as Jan suggests, we end up with a tiny skeleton managing a set of
callbacks. In this way we are moving towards an event-based framework instead
of one with a fixed interaction scenario, even though it sounds a bit scary at
the first sight.

Another argument to show that we are very close to introducing events: in
Montreal we discussed the interface between the datasets and the rest of the
library. We converged to an object which Jan calls "a view" and I call "a data
stream", that creates an epoch iterator, which in turn creates a batch
iterator. Whereas this covers many common cases like MNIST (when an epoch is a
pass over a dataset) and nearly endless datasets like the one we used in
machine translation (an epoch can be defined as, e.g. 100 batches), it does not
scale: if I am training a skip-gram model, and I want to do something when I
process all n-grams from a sentence, I have to declare a sentence an epoch,
which might be not a good decision in other aspects. If I want to have arbitary
many time scales (word, sentence, paragraph, document, book), it becomes very
challenging unless a notion of an event is defined. In fact, Jan already
argued we need it, but he proposed to mix them with the data itself which was
not embraced by Bart and myself.

So long for an introduction, and here comes the point: let's look at the main loop
as a scheduler that triggers certain handlers when certain events happen.
Examples of events include:

the start, the end and the resumption of the training procedure
"tick", the basic event notifying all the components that a new iteration has started
keyboard interrupt
end of an epoch (sentence, document), or any other data-driven event
a timer can generate events when a condition is satisfied, e.g. a specified number
of words have been seen

When we have multiple events happening simultaneously (e.g. a tick and an
epoch), the order of their processing should be somehow determited. We could
have a priority associated with every event, e.g. 0 for a tick, 1 for a
keyboard interrupt, -1 for an epoch. Events with a higher priority can be
handled before events with a lower priority.

Also, we will typically have a few handlers for every event that should
be executed in order. Let's recall Jan's example: we want to save after a
validation. To do that we assign a chain of two handlers to the EpochEndEvent:
validation followed by saving.

There might be a necessity for interaction between handlers. E.g. if we want to
finish an epoch and terminate after a keyboard interrupt is received, we need
two actions to be done for two different events. After the interrupt we should
set a flag, which should be checked at the end of each epoch. The backend monitor
proposed by Jan can accomodate this flag. In fact this monitor is much more
than a logging mechanism: it is a persistent time-indexed array through which
the components of the system can interact with each other, and the user can access
whole interaction history.

It is possible that handlers will generate events as well. E.g. the EpochEndEvent
will be called by a handler of the TickEvent, the main one, responsible
for fetching data and doing a descent step. To allow that, every single component
of the system must have access to the main loop object. That looks a bit weird
when even iterators have a link to the main guy, but perhaps this the price to pay.

This text should obviously be followed by lots of examples, but so far I will
just share these raw thoughts cause I do not know when will have time to write
some code. The main point is: the generality we aim for requires complete
decentralization, and for this I think we have to think in terms of events, not
in terms of a fixed algorithm with a few hooks.

setup.py

Write a setup.py that will allow blocks to be installed using e.g. pip

Separate flake8 on travis

Make flake8 on Travis run separately, so that it's clear straight away whether the error is due to PEP8 or due to failing tests.

Deep RNNs

Stacked RNNs are so trendy nowadays that I believe somebody will need them soon. I think the right way to support it in Blocks is a RecurrentStack brick that takes a list of transitions and wraps them into one.

Safety Check for MNIST Needed

We should check the incoming request in MNIST.get_data. I spent an hour trying to understand what's going on when I simply passed it wrong indices. The challenge is, that the request can be an index, sequence of indices and slice, and all this cases need different code to be handled.

Switch to pickle for serialization

For true "checkpointing" i.e. allowing the starting and stopping of training, we need to use a serialization method like pickle instead of dumping the parameters to a NumPy array like GroundHog does (which loses the states of all RNGs, both in the model as well as the dataset iterators). Pylearn2 does this reasonably well, but it has a few shortcomings, mostly regarding the pickling of the dataset iterators, which means that for many complex cases there is still randomness when restarting.

I think that the most robust way to do this is to make sure all classes can be made pickle-able. In most cases that won't require any work, but for datasets it means that we need to make sure that we don't pickle the actual data, but that we do pickle the exact way post-processing was performed (whitening, ZCA, shuffling). I think Pylearn2 originally considered doing it this way, but then they decided that it was too much work and stopped pickling datasets and iterators all together. Now it's come back to bite them, because we really need checkpointing for running on the Helios cluster (which requires jobs to be restarted every 12 hours), so they're working on fixing this in Pylearn2 now.

Make Subfolders for Tests

Currently they are all in a top-level folder tests. But as we grow we will need more of them. I think it is OK to have them scattered across repository. What do you guys think?

@bartvm , @janchorowski

PyLearn2 Markov Chain Test Is Broken

#115 breaks it by changing the way activation classes are serialized.

Perhaps we should just not use class generation? Apart from serialization problems it is quite hard to read for people with limited Python knowledge.

DefaultRNG should be a separate class.

I think that DefaultRNG should be a separate class to avoid diamond-shaped inheritance pattern. I am now playing with RNN implementation and I don't feel it's right to make the basic RNN class a descendant of DefaultRNG, because theoretically one can use the RNN framework I am writing for some non-parametric transformations, that does not requite a random number generator to initialize parameters.

What do you think?

Iterators

Just to recap, this is the way it is done in Pylearn2, which I think is a good start:

SubsetIterator is basically an iterator that churns out example indices in a variety of ways: sequential batches, random batches, random sequential batches; forcing batches to always be the same size (needed for convolutions), etc. When StopIteration is raised, the epoch is assumed to end.
FiniteDatasetIterator basically glues the SubsetIterator and the Dataset together, its main task is to provide the space conversions, and to retrieve the data from the Dataset. It does this by making the assumption that the data can be represented as an object which supports indexing (including slices).
Dataset is in charge of loading the data, and provides hooks for pre-processing (ZCA, etc.) It has a method (iterator) which uses the data to construct and return an instance of FiniteDatasetIterator. The FiniteDatasetIterator is what implements Python's iteration scheme in the end (calling the SubsetIterator in turn).

There are a few assumptions underlying this design that are problematic: It assumes datasets are finite and it makes some assumptions about the way the data can be represented (mostly as in-memory arrays).

A few cases in which this is problematic:

Infinite datasets e.g. when trying to learn probability distributions
Datasets involving randomness: for word2vec, the context window is chosen randomly for each example. This means that "give me example 1" doesn't mean anything: the first word can provide 10 different training examples (some of which should be sampled more than others) depending on the context window.
No indexable memory representation: In the same case, for word2vec, the data is never explicitly represented in memory, instead it is created on the fly. In general though, I guess we could just create wrappers that allow for indexing.
Memory management: For large datasets, we might not be able to load the entire dataset into memory. In some cases, memory management can get very tricky e.g. for cases where I/O is the bottleneck we might want to use multi-processing, with one process loading examples, and the other one training. This can be very dataset dependent. In this case, the entire logic of loading the dataset might be different depending on how it is iterated over.

So, a few ideas that I think could help mitigate these problems:

The SubsetIterator needs to be optional, an infinite dataset would look like:

class Gaussian(InfiniteDataset, DefaultRNG):
    def __init__(self, mean, std):
        self.mean = mean
        self.std = std

    def __next__(self):
        rng.normal(self.mean, self.std)

The Dataset has full access to the SubsetIterator, so it can do e.g. buffering if it wants to. This also means that the dataset is in complete control of how it represents data in memory.

def __iter__(self):
    batches_iterator = cycle(subset_iterator)
    # put these in a multiprocessing queue that always creates batches in advance

def __next__(self):
    return self.queue.get()  # return from the queue

A Dataset doesn't have to support all iteration schemes. If e.g. word2vec is implemented as a sequential reading of text files (as it is done in the original implementation) then it doesn't need to support shuffling.
For consistency's sake, maybe the iteration scheme of Dataset should always be infinite, even for finite datasets. Monitoring for finite datasets can simply be done whenever num_examples_seen % dataset.num_examples == 0. This removes any sort of confusion as to what constitutes an "epoch"; the only unit of measurement will be "number of training samples seen".
Datasets must be pickleable without actually storing the data.

Datasets

I'm not going to try and summarize the discussion at NIPS, but just wanted somewhere to collect any further discussion. Also, see @rizar's Gist which attempts to outline the API: https://gist.github.com/rizar/cc62fb9d6270f9856793

Did anyone write more code than this? Else I can start to try adapting the dataset code I've written so far to make it fit in with the new design.

Document Readout brick and its components

...such as Emitter and Feedback.

Cleaning Up, Documenting and Covering by Tests recurrent.py

The file https://github.com/bartvm/blocks/blob/master/blocks/bricks/recurrent.py contains a tool to build recurrent bricks: the recurrent wrapper. It is lacking test coverage and documentation. For instance, the order in which sequences, states and contexts are to be returned should be explained. The wrapper should be covered by tests independently from the example bricks (Recurrent, GatedRecurrent).

@orhanf, this might be a task for you.

The MNIST Dataset's get_data is broken

It does not have the state parameter, loading happens in __init__.

Attention Mechanism

Implement attention mechanism and port machine translation.

Work in progress, progress can be seen in the branch https://github.com/bartvm/blocks/tree/attention

Automated Data Downloading and Loading by Using skdata

We could use the https://github.com/jaberg/skdata to download popular datasets and even to load them into memory.

Lazy Initialization: Discussion

I noticed that the MLP class actually didn't work with lazy initialisation (because len(activations) gave an error when activations = None). We should probably add a unit test which tries to initialise each Brick subclass without arguments to see if they are lazy-loading compatible.

Why Blocks? (And not Pylearn2?)

Hello, I love it that this project is building higher level abstractions on top of Theano. That should be really useful. What I'm more interested in though is what Blocks provides (or aims to provide) that Pylearn2 doesn't. :)

Implement speech recognition model

Title is self-explanatory!

Switch from `RandomStreams` to `sandbox.MRG_RandomStreams`

It's quite a bit faster on GPU

Debug Printing of Computation Graph

The annotated computation graph produced by a brick hierarchy contains a lot of very useful information. One could extract from it all application calls and the order in which they happened. This information would be super helpful for debugging, e.g. one can see that he/she did not use a returned value at some point.

Basically one needs two do two things:

extract the application call hierarchy from the ComputationGraph
display it in a roughly the same way as in #69

Simple outputs for scan.

I've just noticed that your there is no way to add None in outputs_info in your recurrent wrapper.

Maybe we should add a None for every output from application.outputs which is not a state?

Pylearn2 wrapper

So Yoshua and Myriam really want the attention mechanism in Pylearn2. I told them that it would be silly to duplicate the effort by re-implementing everything which is in Blocks in Pylearn2, so ideally we want a nice wrapper that allows people to use block-models (attention-mechanism ones in particular) in Pylearn2. I think that if necessary we can assign some CCW tickets i.e. assign people to work on changes in Pylearn2 that we need to make things compatible.

Python 3 compatibility

Theano is already Python 3 compatible, and Pylearn2 will probably be very soon. I think it's easiest to make blocks Python 3 compatible from the very beginning, because making it compatible later on might be a lot more difficult.

I've never maintained a project that supports both Python 2 and 3 at the same time, but purely based on what I've read and seen, I suggest we use a single common codebase supporting both Python 2.7 and 3.4, using six whenever needed (but as little as possible). This is the way NumPy, Scipy, matplotlib, sympy, etc. have all done it.

@rizar Comments?

Base class for recurrent blocks.

A design choice we should make: should there be a base class for recurrent blocks.

First I thought it should be there, than I deleted it because there was no logic in it. Now it seems we could put a default implementation of the initial_state method in it, that would make it more transparent for the user which method should be overridden in order to change initialization behaviour. Such a default implementation would query the state dimensions by self.dimension(state_name) interface.

What do you think?

Order of Arguments for IsotropicGaussian

That's a really minor thing, but anyway: when I write IsotropicGaussian(0.01) I expect it to have zero mean and 0.01 standard deviation, which is currently the other way round. Should I fix that?

Variable Search and Substition in Inner Graphs of Scans

I've just realized that we can not introduce changes into a scan node by theano.clone. See https://gist.github.com/rizar/ea6aa6e750e1a9548703

That casts serious doubt on the ambitious plans to implement all sorts of regularizations as modifications on a ready cost computation graph (which I liked so much). Maybe I should create a Theano issue for that?

Switch to using decorators for _initialize, _allocate, etc.?

We currently have the situation where we want a set of methods of all the derived classes of Brick to be wrapped by methods defined in the Brick base class e.g. _initialize, _allocate, _push_allocation_config, etc. are all wrapped by the methods initialize, allocate, ....

There are a few ways of doing this. The way we do it right now is by calling the public method allocate on the base class, which internally calls the non-public method _initialize defined on the child classes. This works, but personally I am not a fan of non-public methods being the main features of bricks. It might also be confusing to people (and I think ugly) that they need to define _initialize and then call initialize, nothing will work. The current method also doesn't allow us to document the initialize method of each brick separately.

I was wondering if it would not be better to switch to using decorators here. We use them quite heavily already, but I think this is exactly what they are made for: wrapping methods simply, and explicitly. So instead we would have

class SomeBrick(Brick):
    @lazy
    def __init__(self, dim):
        self.dim = dim

    @application
    def apply(self, x, y):
        """Applying this brick does a, b and c."""
        ...

    @initialization
    def initialize(self):
        """Initialize the 3 weight matrices of this block."""
        ...

    @allocation
    def allocate(self):
        """Allocate the parameters. Note that `dim` must be set before calling this."""
        ...

Implement machine translation model

Title speaks for itself!

mila-iqia / blocks Goto Github PK

blocks's People

Stargazers

Watchers

Forkers

blocks's Issues

Recommend Projects

Recommend Topics

Recommend Org