Git Product home page Git Product logo

chinet's People

Contributors

axnedergaard avatar coree avatar lasaouma avatar niladell avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

chinet's Issues

Attention complications

A few things are unclear with regards to attention:

  1. Sentence generation could depend on target sentence: The generator takes in the document hidden state obtained by processing the input sentences. However, if we use attention to weigh the input sentences when computing the document hidden state, the generated sentence will somewhat depend on the target sentence, which seems unintended. However, "helping" the generator may be a good thing.

  2. Whenever we have more than one discriminator score in a session run, the document hidden state likely needs to be computed for both target sentences once we implement attention.

Debug model.py

Should be fairly easy to ensure that everything here is working correctly as code is not too extensive (relevant functions are load_embedding, pretrain, train and evalute)

Confirm validity of Gumbel softmax

I implemented the Gumbel Softmax in cgan.py:gumbel_softmax(). I am using matrix operations while they do more explicit loops in the Chinese et al. paper. We need to confirm that the calculation done in the gumbel_softmax() function is correct.

Fix summary logging

Atm summary logging is bugged (requests excessive variables) and has been commented out in model.py:train(). We need to fix the summary so it requests only the necessary variables.

Improve fetching optimizers

The way optimizers are retrieved with lists in model.py:train is very hacky, this should be improved using a dictionary instead

Implement attention

Implement attention mechanism in cgan.py

Probably talking to me will be helpful to understand how and where attention fits into the graph code

Generator loss doubts

I was just going over the generator loss function

generator_loss = tf.reduce_sum(-tf.log(score_generated) - cosine_similarity(target_state, generated_state))

and there are a couple of things I don't fully get.

On the one side the cosine similarity part I assume is a way to make the generated sentence more similar to the objective sentence. Should we really be doing that? On the other side, we don't have any "explicit feedback" from the discriminator, maybe it's late and I just mixing up stuff but, shouldn't the discriminator play an active (or at least explicit) role on the generator loss?

Stop/start words and preprocessing questions

We need a used to determine when a sentence output by the generator is finished. We need to decide on what the corresponding vocab index should be, and need to extend the "load embeddings" functions in model.py to assign the embedded value to the graph in cgan.py.

Since the generator uses the last word it generated when it generates a new word, we need to decide what the first word we show it is. Similarly to the stop word, this needs to be assigned in the graph.

These considerations raise the question of whether we really should be preprocessing our sentences with <start> and <stop> tokens like in the previous project.

Deal with non-constant length data

As we have not padded our stories nor done any pre-processing on that line, our inputs of, let's say shape [batch_size, X], have non-constant size X for the same batch. So we cannot convert that into placeholders.

As a result of that our current version of the model only accepts batch one data and crashes if it is changed. I haven't been able to see if there is an easy fix on for that, or if we have to do something more convoluted.

Edit: To make it clear, is not a batch size problem, is just that if you try to make a tensor out of a list like [[1, 2], [1, 2, 3], ... [1, 2, 3, ..., n]] is gonna crash as the elements of the vector have different sizes.

Debug cgan.py

I imagine most of our time debugging will be spent on this

Implement prediction

provide binary prediction value (sentence1/sentence2 was right ending) in gan.py for evaluation/testing

Add write on Test File

The current version of the code returns a list of {0,1}s for the prediction/evaluation. We need to grab that, change it to {1,2}s I think (need to check how they want it, and write it over in a file. I'm not sure of the specifications atm, tho. @coree can you take care of it?

Fix summary ops for pretraining

Similar to issues we had earlier in model:train where inappropriate placeholders would be required due to summary ops (has this been fixed?), model:pretrain exhibits the same behavior. As a temporary fix, I have commented out the summary ops inside model:pretrain for now.

Conditional generation input

It is unclear exactly how inputs should be fed into the generator during conditional generation.

At the moment, the document hidden state, a random seed and the previous word (embedded start word at first timestep) are concatenated and fed into the generator at every timestep along with the generator hidden state. It is unclear whether this is the right way to do it.

Alternatives include (not mutually exclusive)

  1. adding random noise to the document hidden state and concatenating this with the previous word
  2. using the random noise and document hidden state to initialize the generator hidden state and only passing in the previous word at every time step

During unconditional generation, the document hidden state is not used, but the generator hidden state size must be fixed.

Implement adding noise to discriminator

Later on in training, random noise should be added to either the generated sentence state or the target sentence state during discriminator training. This is supposed to improve GAN training for technical reasons.

We need to implement this under the "Generator loss" part in cgan.py. It is unclear to me exactly how to pass the information about the current training step from model.py into cgan.py. We also need to decide what a good training step threshold is.

Load embeddings: vocab

The load_embeddings() function in model.py needs access to our vocab (used in preprocessor.py?)

Improving preprocessing given our current embedding

I just realized that we are doing an extremely naive preprocessing, especially with punctuation signs, numbers or composed (e.j. hyphened words). For sure this is something we should look into.

In the particular case of punctuation marks, we should be able to get rid of them without any big problem, given that we just have a set of short sentences. Also, we should probably convert all numbers to text; @coree you that are more familiar with it, any idea if NLTK, spaCy or similar libraries process this things directly? There are some words like "to" which I don't get at all why fail. Last thing are the names which, as I said, we are non-properly changing to an arbitrary string in the dehumanizingfunction and we should probably use real names.

This is an example of the current status when loading the word2vec embedding:

03/06 00:10 WARNING . not in embedding file
03/06 00:10 WARNING to not in embedding file
03/06 00:10 WARNING friendly_blorg_1 not in embedding file
03/06 00:10 WARNING a not in embedding file
03/06 00:10 WARNING and not in embedding file
03/06 00:10 WARNING , not in embedding file
03/06 00:10 WARNING of not in embedding file
03/06 00:10 WARNING 's not in embedding file
03/06 00:10 WARNING ! not in embedding file
03/06 00:10 WARNING friendly_blorg_2 not in embedding file
03/06 00:10 WARNING friendly_blorg_3 not in embedding file
03/06 00:10 WARNING 10 not in embedding file
03/06 00:10 WARNING ' not in embedding file
03/06 00:10 WARNING 20 not in embedding file
03/06 00:10 WARNING 100 not in embedding file
03/06 00:10 WARNING - not in embedding file
03/06 00:10 WARNING 30 not in embedding file
03/06 00:10 WARNING 50 not in embedding file
03/06 00:10 WARNING 15 not in embedding file
03/06 00:10 WARNING '' not in embedding file
03/06 00:10 WARNING `` not in embedding file
03/06 00:10 WARNING ? not in embedding file
03/06 00:10 WARNING 12 not in embedding file
03/06 00:10 WARNING cancelled not in embedding file
03/06 00:10 WARNING 18 not in embedding file
03/06 00:10 WARNING co-worker not in embedding file
03/06 00:10 WARNING 16 not in embedding file

Embedded stop word identification

At every step of generation, we need to identify whether the generator generated the <stop_word>. Generated words are computed in embedded form from the generator hidden state through the Gumbel softmax, which results in somewhat "noisy" embeddings that do not correspond exactly to the embedded word. We need to resolve this.

My current solution involves taking the reduced sum of the subtraction of the generated embedded word and the true embedded <stop_word>, and deciding that they are identical if the result is less than a "stop word error bound". If we proceed with this approach, we need to determine our value for the error bound.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.