gpapamak / maf Goto Github PK

View Code? Open in Web Editor NEW

197.0 197.0 35.0 162.2 MB

Masked Autoregressive Flow

License: Other

Python 100.00%

maf's People

Contributors

Stargazers

Watchers

maf's Issues

Migration to version python3.6.4

Trying to understand the code I realized that it runs in python2, I decided to try to do a mini migration. From what I've seen there are serious compatibility problems with python > 3.6. I've managed to get the code working after a few minor changes in version 3.6.4. (It would be necessary to test the changes exhaustively.) Would be could if you can give me access to push my migration branch.

You can find my code in the fork that I have on my github

Link to the datasets in the README is dead

README links to the datasets at https://zenodo.org/record/1161203#.Wmtf_XVl8eN , the link throws 404

Updation requested

I'm not an expert but I've been working hard to run this code on google colab. Looks like this is not working with the latest python package. Please make those small changes that are required to run on python3 so that people like me can run this code. Nice paper by the way!

Problem with preprocessing of UCI datasets, especially MiniBooNE

When doing density estimation on the UCI datasets HEPMASS and MiniBooNE, I saw in the appendix D.2 of the article that several dimensions of the raw data were removed since certain real values are reoccurring too frequently. This does make sense to me since such densities would involve Dirac delta distributions being problematic when trying to estimate them with continuous densities. However, when I checked the code I stumbled upon the following lines:

maf/datasets/hepmass.py

Line 91 in ea057bf

max_count = np.array([v for k, v in sorted(c.iteritems())])[0]

maf/datasets/miniboone.py

Line 52 in ea057bf

# max_count = np.array([v for k, v in sorted(c.iteritems())])[0]

They seem to compute the maximum over the counts of each real value but when implementing it myself this is not the case. The sorted function is sorting the array based on the first entry, which is the real value corresponding to the count and not the count itself. I demonstrate this problem in the following notebook:
https://gist.github.com/VincentStimper/bed1aa10ac187dc51eefa85e683a7df4
It also showcases the consequences. For the HEPMASS dataset there is coincidentally no difference between the features that get dropped and the features that would be dropped when max_count is computed correctly, i.e. by using

max_count = np.max(np.unique(feature, return_counts=True)[1])

On the other side, for MiniBooNE there are some dimension which are drop although max_count is only moderately high, e.g. 6, while dimensions with values reoccurring 3434 times are kept.

This might be a minor issue but since the version of the MiniBooNE dataset you made publicly available has been used numerous times by others as a benchmark for density estimation I think it is an issue which requires our attention.

Can you provide details on you configuration (theano version especially)

I have tried running your code but got the following error message (MNIST experiments):

theano.gof.fg.MissingInputError: A variable that is an input to the graph was neither provided as an input to the function nor given a value. A chain of variables leading from this input to an output is [x, dot.0, Elemwise{add,no_inplace}.0, Elemwise{add,no_inplace}.0, Elemwise{add,no_inplace}.0, h1, dot.0, Elemwise{add,no_inplace}.0, Elemwise{add,no_inplace}.0, h2, dot.0, logp, Elemwise{mul,no_inplace}.0, Elemwise{exp,no_inplace}.0, Elemwise{mul,no_inplace}.0, Sum{axis=[0], acc_dtype=float64}.0, mean]. This chain may not be unique
Backtrace when the variable is created:
  File "run_experiments.py", line 245, in <module>
    main()
  File "run_experiments.py", line 241, in main
    methods[name]()
  File "run_experiments.py", line 184, in run_experiments_mnist
    ex.train_maf_cond([n_hiddens]*2, act_fun, n_layers*i, mode)
  File "/u/home/maf/experiments.py", line 248, in train_maf_cond
    model = mafs.ConditionalMaskedAutoregressiveFlow(data.n_labels, data.n_dims, n_hiddens, act_fun, n_mades, mode=mode)
  File "/u/home/maf/ml/models/mafs.py", line 172, in __init__
    self.input = tt.matrix('x', dtype=dtype) if input is None else input

It looks like the model is not getting the data properly. Could this be caused by changes in theano version ?

How you preprocess your data?

I am trying to run your code on POWER, GAS datasets.
The data I download from the link is 'txt' files.
However, in your code, you read from a file called 'data.npy'.

def load_data():
    return np.load(datasets.root + 'power/data.npy')

Could you please provide the code to preprocess the data and generate npy files?
Thanks.

Error in log likelihood computation

maf/ml/models/mades.py

Line 234 in 3239e80

 self.L = -0.5 * (n_inputs * np.log(2 * np.pi) + tt.sum(self.u ** 2 - self.logp, axis=1)) 

The tf.sum term should be tf.sum(self.u ** 2 * tf.exp(self.logp) - self.logp))

Broken datasets due to pandas API changes

Hello @gpapamak,

Due to API changes in pandas, the GAS and HEPMASS datasets are not usable anymore. Notably, the DataFrame.as_matrix method has been deprecated since pandas=0.23.0 and the DataFrame pickling format of pandas<2.0 is not compatible with pandas>=2.0. There is also an issue with Counter.iteritems which is deprecated since Python 3.0.

I don't think modifying this repository to fix these issues is a good idea as it could break the code. Instead, I made a lightweight fork (francois-rozet/uci-datasets) of the repo's UCI datasets and wrote instructions to generate environment-agnostic .npy files containing the processed data. These .npy files can then be used without relying on the original code and its dependencies. I hope it's ok for you.

Can you provide the preprocessed datasets?

It's unclear how every attribute with a Pearson correlation coefficient greater than 0.98 are eliminated. As correlation is calculated in pairs, how do you decide which attribute to eliminate?

Thanks.

Batch normalization

Hi!

Thanks for sharing amazing work!

I'm trying to port your code to PyTorch (for further use in my research).

I have a question regarding your implementation of Batch Norm. As you mention in the paper, it's implemented using global batch statistics. Could you please provide pointers to the lines where it is implemented exactly? My knowledge of Theano is a little bit rusty.

Preprocessed data

Hello,

Thank you for the code.
Could you specify the preprocessing methods you apply in the original datasets (e.g. mnist)? Apart from dequantization, logit and all the functions which are already in the code.

Log-likelihood of Gaussian MADE

Hi,
I am not sure I understand the log-likelihood expression for the Gaussian MADE:
https://github.com/gpapamak/maf/blob/master/ml/models/mades.py#L234

is this correct?
I assumed that the log -likelihood would be that of a univariate gaussian with mean mu and var alpha? is my understanding wrong?

gpapamak / maf Goto Github PK

maf's People

Contributors

Stargazers

Watchers

Forkers

maf's Issues

Recommend Projects

Recommend Topics

Recommend Org