coree / chinet Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 7.3 MB

NLU project 2 - Story Cloze Task

Python 75.33% TeX 24.67%

chinet's People

Contributors

Stargazers

Watchers

chinet's Issues

Attention complications

A few things are unclear with regards to attention:

Sentence generation could depend on target sentence: The generator takes in the document hidden state obtained by processing the input sentences. However, if we use attention to weigh the input sentences when computing the document hidden state, the generated sentence will somewhat depend on the target sentence, which seems unintended. However, "helping" the generator may be a good thing.
Whenever we have more than one discriminator score in a session run, the document hidden state likely needs to be computed for both target sentences once we implement attention.

Add all libraries, packages and needed things in the README/Reqs. file

Pretty self explanatory, but we need to precisely provide all the dependencies needed to run it.
Important to not forget nltk.download('punkt') and nltk.download('names')

Debug model.py

Should be fairly easy to ensure that everything here is working correctly as code is not too extensive (relevant functions are load_embedding, pretrain, train and evalute)

Confirm validity of Gumbel softmax

I implemented the Gumbel Softmax in cgan.py:gumbel_softmax(). I am using matrix operations while they do more explicit loops in the Chinese et al. paper. We need to confirm that the calculation done in the gumbel_softmax() function is correct.

Fix summary logging

Atm summary logging is bugged (requests excessive variables) and has been commented out in model.py:train(). We need to fix the summary so it requests only the necessary variables.

Improve fetching optimizers

The way optimizers are retrieved with lists in model.py:train is very hacky, this should be improved using a dictionary instead

Implement load embeddings

separate function in model.py

Implement attention

Implement attention mechanism in cgan.py

Probably talking to me will be helpful to understand how and where attention fits into the graph code

ToDo

Generator loss doubts

I was just going over the generator loss function

generator_loss = tf.reduce_sum(-tf.log(score_generated) - cosine_similarity(target_state, generated_state))

and there are a couple of things I don't fully get.

On the one side the cosine similarity part I assume is a way to make the generated sentence more similar to the objective sentence. Should we really be doing that? On the other side, we don't have any "explicit feedback" from the discriminator, maybe it's late and I just mixing up stuff but, shouldn't the discriminator play an active (or at least explicit) role on the generator loss?

Implement pretrain GAN loss

provide a correct loss for pretraining generator in gan.py

Check if symbols are in the pre-trained embedding

Check if the bos, eos and pad are in the pre-trained embeddings

Stop/start words and preprocessing questions

We need a used to determine when a sentence output by the generator is finished. We need to decide on what the corresponding vocab index should be, and need to extend the "load embeddings" functions in model.py to assign the embedded value to the graph in cgan.py.

Since the generator uses the last word it generated when it generates a new word, we need to decide what the first word we show it is. Similarly to the stop word, this needs to be assigned in the graph.

These considerations raise the question of whether we really should be preprocessing our sentences with <start> and <stop> tokens like in the previous project.

Clean README, add basic information and explanations/dependencies to run code

We have to hand in a project with a very explicit README with all the steps to obtain the final test result. @lasaouma can you take care of it?

Implement training generator

provide a correct loss for training generator in gan.py

Deal with non-constant length data

As we have not padded our stories nor done any pre-processing on that line, our inputs of, let's say shape [batch_size, X], have non-constant size X for the same batch. So we cannot convert that into placeholders.

As a result of that our current version of the model only accepts batch one data and crashes if it is changed. I haven't been able to see if there is an easy fix on for that, or if we have to do something more convoluted.

Edit: To make it clear, is not a batch size problem, is just that if you try to make a tensor out of a list like [[1, 2], [1, 2, 3], ... [1, 2, 3, ..., n]] is gonna crash as the elements of the vector have different sizes.

Debug cgan.py

I imagine most of our time debugging will be spent on this

Implement prediction

provide binary prediction value (sentence1/sentence2 was right ending) in gan.py for evaluation/testing

Add write on Test File

The current version of the code returns a list of {0,1}s for the prediction/evaluation. We need to grab that, change it to {1,2}s I think (need to check how they want it, and write it over in a file. I'm not sure of the specifications atm, tho. @coree can you take care of it?

How to train GANs (well)

Fix summary ops for pretraining

Similar to issues we had earlier in model:train where inappropriate placeholders would be required due to summary ops (has this been fixed?), model:pretrain exhibits the same behavior. As a temporary fix, I have commented out the summary ops inside model:pretrain for now.

Implement pretrained GAN

separate function model.py

Conditional generation input

It is unclear exactly how inputs should be fed into the generator during conditional generation.

At the moment, the document hidden state, a random seed and the previous word (embedded start word at first timestep) are concatenated and fed into the generator at every timestep along with the generator hidden state. It is unclear whether this is the right way to do it.

Alternatives include (not mutually exclusive)

adding random noise to the document hidden state and concatenating this with the previous word
using the random noise and document hidden state to initialize the generator hidden state and only passing in the previous word at every time step

During unconditional generation, the document hidden state is not used, but the generator hidden state size must be fixed.

Implement adding noise to discriminator

Later on in training, random noise should be added to either the generated sentence state or the target sentence state during discriminator training. This is supposed to improve GAN training for technical reasons.

We need to implement this under the "Generator loss" part in cgan.py. It is unclear to me exactly how to pass the information about the current training step from model.py into cgan.py. We also need to decide what a good training step threshold is.

Load embeddings: vocab

The load_embeddings() function in model.py needs access to our vocab (used in preprocessor.py?)

Improving preprocessing given our current embedding

I just realized that we are doing an extremely naive preprocessing, especially with punctuation signs, numbers or composed (e.j. hyphened words). For sure this is something we should look into.

In the particular case of punctuation marks, we should be able to get rid of them without any big problem, given that we just have a set of short sentences. Also, we should probably convert all numbers to text; @coree you that are more familiar with it, any idea if NLTK, spaCy or similar libraries process this things directly? There are some words like "to" which I don't get at all why fail. Last thing are the names which, as I said, we are non-properly changing to an arbitrary string in the dehumanizingfunction and we should probably use real names.

This is an example of the current status when loading the word2vec embedding:

03/06 00:10 WARNING . not in embedding file
03/06 00:10 WARNING to not in embedding file
03/06 00:10 WARNING friendly_blorg_1 not in embedding file
03/06 00:10 WARNING a not in embedding file
03/06 00:10 WARNING and not in embedding file
03/06 00:10 WARNING , not in embedding file
03/06 00:10 WARNING of not in embedding file
03/06 00:10 WARNING 's not in embedding file
03/06 00:10 WARNING ! not in embedding file
03/06 00:10 WARNING friendly_blorg_2 not in embedding file
03/06 00:10 WARNING friendly_blorg_3 not in embedding file
03/06 00:10 WARNING 10 not in embedding file
03/06 00:10 WARNING ' not in embedding file
03/06 00:10 WARNING 20 not in embedding file
03/06 00:10 WARNING 100 not in embedding file
03/06 00:10 WARNING - not in embedding file
03/06 00:10 WARNING 30 not in embedding file
03/06 00:10 WARNING 50 not in embedding file
03/06 00:10 WARNING 15 not in embedding file
03/06 00:10 WARNING '' not in embedding file
03/06 00:10 WARNING `` not in embedding file
03/06 00:10 WARNING ? not in embedding file
03/06 00:10 WARNING 12 not in embedding file
03/06 00:10 WARNING cancelled not in embedding file
03/06 00:10 WARNING 18 not in embedding file
03/06 00:10 WARNING co-worker not in embedding file
03/06 00:10 WARNING 16 not in embedding file

Implement training discriminator loss

provide a correct loss for training discriminator in gan.py

Embedded stop word identification

At every step of generation, we need to identify whether the generator generated the <stop_word>. Generated words are computed in embedded form from the generator hidden state through the Gumbel softmax, which results in somewhat "noisy" embeddings that do not correspond exactly to the embedded word. We need to resolve this.

My current solution involves taking the reduced sum of the subtraction of the generated embedded word and the true embedded <stop_word>, and deciding that they are identical if the result is less than a "stop word error bound". If we proceed with this approach, we need to determine our value for the error bound.

coree / chinet Goto Github PK

chinet's People

Contributors

Stargazers

Watchers

chinet's Issues

Recommend Projects

Recommend Topics

Recommend Org