Git Product home page Git Product logo

recurrent-batch-normalization's Introduction

This repository contains the code that was used for the Sequential MNIST, Penn Treebank and text8 experiments in the paper Recurrent Batch Normalization (http://arxiv.org/abs/1603.09025). For the Attentive Reader, see https://github.com/cooijmanstim/Attentive_reader/tree/bn.

The experiments directory contains shell scripts that demonstrate how to launch the experiments with the hyperparameters from the paper.

Depends on Theano, Blocks and Fuel.

Other implementations:

recurrent-batch-normalization's People

Contributors

aaroncourville avatar ballasn avatar caglar avatar cooijmanstim avatar thrandis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

recurrent-batch-normalization's Issues

How are batch statistics computed?

I'm implementing recurrent BN in Keras, but looking at the original paper and those citing it, a detail remains unclear to me: how are batch statistics computed? In the original, authors state (pg. 3) (emphasis mine):

At training time, the statistics E[h] and Var[h] are estimated by the sample mean and sample variance of the current minibatch

Yet another paper (pg. 3) using and citing it describes:

We subscript BN by time (BN_t) to indicate that each time step tracks its own mean and variance. In practice, we track these statistics as they change over the course of training using an exponential moving average (EMA)

My question's thus two-fold:

  1. Are minibatch statistics computed per immediate minibatch, or as an EMA?
  2. How are the inference parameters, shared across all timesteps, gamma and beta computed? Is the computation in (1) simply averaged across all timesteps? (e.g. average EMA_t for all t)

Existing implementations: in Keras and TF below, but are all outdated, and am unsure regarding correctness

  • Keras, TF-A, and TF-B
  • All above agree that during training, immediate minibatch statistics are used, and that beta and gamma are updated as an EMA of these minibatches
  • Problem: the bn operation (in A, and presumably B & C) is applied on a single timestep slice, to be passed to the K.rnn control flow for re-iteration. Hence, EMA is computed w.r.t. minibatches and timesteps - which I find questionable:
  • EMA is used in place of a simple average when population statistics are dynamic (e.g. minibatch-to-minibatch), whereas we have access to all timesteps in a minibatch prior having to update gamma and beta
  • EMA is a worse but at times necessary alternative to a simple average, but per above, we can use latter - so why don't we? Timestep statistics can be cached, averaged at the end, then discarded - holds also for stateful=True

Is beta being trained or is it fixed at zero?

The paper states that both the hidden and input betas are set to the zero vector to avoid unnecessary redundancy (e.g. LSTM bias is enough) but it's not clear if they are trained or just kept constant at zero. They are called parameters throughout, and in the experiments they are initialized, implying training.

Weight initialization scheme, identity matrix for hidden-to-hidden, why?

First, I love the empirical reasoning about keeping initial gamma values smaller than unit variance. Good stuff! ๐Ÿ’ƒ

However, I'm curious why you didn't just use orthogonal initialization throughout your experiments. Your paper states that for particular tasks you got better generalization with the batch normalized LSTM when you let the hidden-to-hidden weights start from the identity matrix. Do you think this has more to do with the specific tasks than the model? It struck me as a fairly exotic initialization and thus I worry that it might be especially important with your particular model (that I'm trying to reimplement). Could you elaborate a little, please?

Also, is it the hidden-to-hidden weights for the input (g in the paper, eq. 6), or also the gates (e.g. is this right?)?

Bad Performance in test data set

I have read your this paper which give excellent result than original lstm. Recently, after implementing this idea in TensorFlow, However, I have got very bad results in the test data set. Here is the source code.
The following picture is the result of accuracy. The yellow line is training loss. The blue line is training loss when the training flag in the batch norm is set to false. The red line is test loss.
image
I have tried the bn-lstm code you have mentioned. For example:
https://github.com/OlavHN/bnlstm
https://gist.github.com/spitis/27ab7d2a30bbaf5ef431b4a02194ac60
They got a similar result. Please take a look at this discussion. Please help me whether I have mad a stupid mistake.
@cooijmanstim @ballasn @Thrandis @caglar

Padding Affecting Batch Norm

Hey Tim,

The batch norm LSTM paper published has pretty stellar results. I've followed the keras issue thread where you stated that gamma and beta are shared throughout all timesteps, yet the actual statistics should be kept for each timestep separately. Thanks for clarifying this.

Unfortunately, I implemented the batch norm LSTM (gamma = 0.1) in tensorflow, and it seems to not perform as well as a regular LSTM. I'm applying this to sequences that have padding in them.

My question is this: Is it possible that the padding is throwing everything off? Towards the end of the sequence, padding probably disrupts the batch mean and variance Laurents suggested https://arxiv.org/pdf/1510.01378v1.pdf

Did you try padding for any of your experiments? If so, how did you bridge that gap? I'm thinking towards the end of sequence, the mean and variance should be kept from timestep 10, when clearly no padding is occurring.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.