cooijmanstim / recurrent-batch-normalization Goto Github PK

Python 46.76% TeX 52.49% Makefile 0.06% Shell 0.68%

recurrent-batch-normalization's Introduction

This repository contains the code that was used for the Sequential MNIST, Penn Treebank and text8 experiments in the paper Recurrent Batch Normalization (http://arxiv.org/abs/1603.09025). For the Attentive Reader, see https://github.com/cooijmanstim/Attentive_reader/tree/bn.

The experiments directory contains shell scripts that demonstrate how to launch the experiments with the hyperparameters from the paper.

Depends on Theano, Blocks and Fuel.

Other implementations:

recurrent-batch-normalization's People

Contributors

Stargazers

Watchers

recurrent-batch-normalization's Issues

How are batch statistics computed?

I'm implementing recurrent BN in Keras, but looking at the original paper and those citing it, a detail remains unclear to me: how are batch statistics computed? In the original, authors state (pg. 3) (emphasis mine):

At training time, the statistics E[h] and Var[h] are estimated by the sample mean and sample variance of the current minibatch

Yet another paper (pg. 3) using and citing it describes:

We subscript BN by time (BN_t) to indicate that each time step tracks its own mean and variance. In practice, we track these statistics as they change over the course of training using an exponential moving average (EMA)

My question's thus two-fold:

Are minibatch statistics computed per immediate minibatch, or as an EMA?
How are the inference parameters, shared across all timesteps, gamma and beta computed? Is the computation in (1) simply averaged across all timesteps? (e.g. average EMA_t for all t)

Existing implementations: in Keras and TF below, but are all outdated, and am unsure regarding correctness

Keras, TF-A, and TF-B
All above agree that during training, immediate minibatch statistics are used, and that beta and gamma are updated as an EMA of these minibatches
Problem: the bn operation (in A, and presumably B & C) is applied on a single timestep slice, to be passed to the K.rnn control flow for re-iteration. Hence, EMA is computed w.r.t. minibatches and timesteps - which I find questionable:
EMA is used in place of a simple average when population statistics are dynamic (e.g. minibatch-to-minibatch), whereas we have access to all timesteps in a minibatch prior having to update gamma and beta
EMA is a worse but at times necessary alternative to a simple average, but per above, we can use latter - so why don't we? Timestep statistics can be cached, averaged at the end, then discarded - holds also for stateful=True

Is beta being trained or is it fixed at zero?

The paper states that both the hidden and input betas are set to the zero vector to avoid unnecessary redundancy (e.g. LSTM bias is enough) but it's not clear if they are trained or just kept constant at zero. They are called parameters throughout, and in the experiments they are initialized, implying training.

Weight initialization scheme, identity matrix for hidden-to-hidden, why?

First, I love the empirical reasoning about keeping initial gamma values smaller than unit variance. Good stuff! 💃

However, I'm curious why you didn't just use orthogonal initialization throughout your experiments. Your paper states that for particular tasks you got better generalization with the batch normalized LSTM when you let the hidden-to-hidden weights start from the identity matrix. Do you think this has more to do with the specific tasks than the model? It struck me as a fairly exotic initialization and thus I worry that it might be especially important with your particular model (that I'm trying to reimplement). Could you elaborate a little, please?

Also, is it the hidden-to-hidden weights for the input (g in the paper, eq. 6), or also the gates (e.g. is this right?)?

Bad Performance in test data set

I have read your this paper which give excellent result than original lstm. Recently, after implementing this idea in TensorFlow, However, I have got very bad results in the test data set. Here is the source code.
The following picture is the result of accuracy. The yellow line is training loss. The blue line is training loss when the training flag in the batch norm is set to false. The red line is test loss.

I have tried the bn-lstm code you have mentioned. For example:
https://github.com/OlavHN/bnlstm
https://gist.github.com/spitis/27ab7d2a30bbaf5ef431b4a02194ac60
They got a similar result. Please take a look at this discussion. Please help me whether I have mad a stupid mistake.
@cooijmanstim @ballasn @Thrandis @caglar

What is the meaning of `dummy_states`?

I am not familiar with Theano and just can understand the code in general. The usage of dummy_states makes me really confused. What is the meaning of this variable? And If I didn't get it wrong, it was added to the state updates, which is much more confusing.

Padding Affecting Batch Norm

Hey Tim,

The batch norm LSTM paper published has pretty stellar results. I've followed the keras issue thread where you stated that gamma and beta are shared throughout all timesteps, yet the actual statistics should be kept for each timestep separately. Thanks for clarifying this.

Unfortunately, I implemented the batch norm LSTM (gamma = 0.1) in tensorflow, and it seems to not perform as well as a regular LSTM. I'm applying this to sequences that have padding in them.

My question is this: Is it possible that the padding is throwing everything off? Towards the end of the sequence, padding probably disrupts the batch mean and variance Laurents suggested https://arxiv.org/pdf/1510.01378v1.pdf

Did you try padding for any of your experiments? If so, how did you bridge that gap? I'm thinking towards the end of sequence, the mean and variance should be kept from timestep 10, when clearly no padding is occurring.

Thanks!

cooijmanstim / recurrent-batch-normalization Goto Github PK

recurrent-batch-normalization's Introduction

recurrent-batch-normalization's People

Contributors

Stargazers

Watchers

Forkers

recurrent-batch-normalization's Issues

How are batch statistics computed?

Is beta being trained or is it fixed at zero?

Weight initialization scheme, identity matrix for hidden-to-hidden, why?

Bad Performance in test data set

What is the meaning of `dummy_states`?

Padding Affecting Batch Norm

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent