Git Product home page Git Product logo

Comments (11)

j314erre avatar j314erre commented on July 28, 2024 3

I think I've gotten to the bottom of the performance differences on the rt-polarity dataset, and since there is no inherent problem with this tensorflow implementation, I'll close this issue.

In summary, I was able to replicate the accuracy of CNN_Sentence by making a few simple code changes to cnn-text-classification-tf to mimic approach to weight initialization, sentence padding, and learning rate.

For future reference, I'll detail my findings below.

In all cases I ran the code with the following commands to equalize the model sizes at all layers:

THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python conv_net_sentence.py -nonstatic -rand

python train.py --embedding_dim 300 --num_filters 100 --batch_size 50 --num_epochs 25 --l2_reg_lambda 0.15

I've found the difference in dev set accuracy over the same number of epoch steps between cnn-text-classification-tf and CNN_Sentence can be attributed by these implementation differences:

  • CNN_Sentence initializes the weights W and b to zero instead of truncated normal distribution (improves accuracy from 0.66 to 0.72)
  • CNN_Sentence pads front of the sentences with five PAD symbols not just the end of the sentences (improves accuracy from 0.72 to 0.76)

Here is a graph of dev set accuracy comparing CNN_Sentence to cnn-text-classification-tf run out-of-the-box, plus two versions where I made code changes to cnn-text-classification-tf to address the above differences:

image

I made code changes to my copy of cnn-text-classification-tf as follows:

+zero_weights

Initialized output weights to 0.0 in text_cnn.py:

        # Final (unnormalized) scores and predictions
        with tf.name_scope("output"):
            #W = tf.Variable(tf.truncated_normal([num_filters_total, num_classes], stddev=0.1), name="W")
            W = tf.Variable(tf.constant(0.0, shape=[num_filters_total, num_classes]), name="W")
            b = tf.Variable(tf.constant(0.0, shape=[num_classes]), name="b")
            ...

+initial_padding

Adding padding symbols to beginning of sentences and extended all sentences to length 64 by padding the ends in data_helpers.py:

def pad_sentences(sentences, padding_word="<PAD/>", max_filter=5):
    """
    Pads all sentences to the same length. The length is defined by the longest sentence.
    Returns padded sentences.
    """
    pad_filter = max_filter -1
    sequence_length = max(len(x) for x in sentences)
    sequence_length = sequence_length + 2*pad_filter
    #print "sequence_length=%d" % sequence_length
    #sequence_length = 64
    padded_sentences = []
    for i in range(len(sentences)):
        sentence = sentences[i]
        num_padding = sequence_length - len(sentence) - pad_filter
        new_sentence = [padding_word]*max_filter + sentence + [padding_word] * num_padding
        padded_sentences.append(new_sentence)
    return padded_sentences

+learning_rate

Increased learning rate for optimizer in train.py:

optimizer = tf.train.AdamOptimizer(0.001)

from cnn-text-classification-tf.

dennybritz avatar dennybritz commented on July 28, 2024

Hm, which dataset are you comparing? Your own?

A bit of a discrepancy is expected, but not as big as the one in your graph. In my blog post I got accuracy pretty similar to Kim on the movie review dataset.

Is your graph accuracy on the training set? The dev set? Graphing the loss may also be helpful. The first thing I would look at are definitely the network hyperparameters (embedding, filter sizes, num filters, dropout, etc). I'm sure Kim's code has other defaults than mine. I doubt that the optimizer is the problem.

I'd start by graphing the training set accuracy and see if you are able to overfit it - if you cannot your network is probably not "big" enough.

from cnn-text-classification-tf.

j314erre avatar j314erre commented on July 28, 2024

I am comparing the exact same rt-polarity data set in both cases with the exact same hyperparameters. The graph is on the dev set. It should be apples-to-apples. So far I have increased the learning rate, added L2 regularization, and added padding to sentences the way Kim did which has all help to increase the accuracy to closer to the 0.75 level that Kim's code is using. I have found the choice of optimizer does make a difference, but unfortunately tensorflow does not support Adadelta which Kim used so I can't do a true apples-to-apples. Working on getting the rest of the way to the 0.78 accuracy that Kim's code is showing out-of-the-box....stay tuned.

from cnn-text-classification-tf.

dennybritz avatar dennybritz commented on July 28, 2024

Can you also plot the train set? That may give some insight into what's going on.

from cnn-text-classification-tf.

j314erre avatar j314erre commented on July 28, 2024

Here are accuracy and loss plots for both models.

Basically running Kim's code like this so it does not use pretrained word2vec:

THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python conv_net_sentence.py -nonstatic -rand

and your code like this to replicate Kim's hard-wired params:

python train.py --embedding_dim 300 --num_filters 100 --batch_size 50 --num_epochs 25 --l2_reg_lambda 0.15

...attempting to run the same sized model on the same data, except for different optimizer as noted above...also Kim's model does not seem to include L2 lambda regularization...

Looks like Kim's model is overfitting, but it is getting 15% higher accuracy on the dev set. I might be missing something obvious here.

image
image

image
image

from cnn-text-classification-tf.

dennybritz avatar dennybritz commented on July 28, 2024

Wow, thanks for getting to the bottom of this.

I'm surprised that the 0-initialization works better. Initializing biases to 0 is pretty standard, but initializing W to 0 seems strange to me. Maybe another kind of initialization works just as well, or better.

What's the intuition behind the extra padding?

from cnn-text-classification-tf.

j314erre avatar j314erre commented on July 28, 2024

I was also surprised that these where the changes that accounted for as much of the difference as they did and don't really know why, except they were in Y. Kim's implementation and I was just trying to understand CNN for text classification using tensorflow vs theano in an equivalent model. I suspect these changes are specific to this dataset (which is small & noisy) and not indicative of best practice in other datasets and maybe not even this one! But here are some thoughts...

On the weights initialization: I noticed that the training minibatch cross entropy loss was initially quite high like above ~5.0 and it took half of the epochs to get below 1.0, whereas on a binary classification problem you'd expect a random coin flip classifier to start out at -ln(0.5) ~ 0.69 and go down from there (which I found that Y. Kim's code did)...as the above loss graphs show. I traced the discrepancy to the initialization of W and b in the last layer where the initial random values created a very high loss value at the starting gate and the optimizer was taking a long time to climb down to a minimum. Also with l2 regularization,the initial l2 loss might have had too much an effect in the early stages. Note that Y. Kim used an AdaDelta optimizer not available in tensorflow so we're comparing to the Adam.....it looks like Kim's code is converging faster, but is clearly over-fitting...whereas your original model is behaving well over time and it might be that running it over many more epochs you'd actually have a more robust and accurate model.

On zero padding: I'm guessing that putting padding at the beginning of the sentences allows the model to discriminate words appearing near the beginning of sentences...it was already padding the ends. This might allow the model to pick out some patterns involving the global placement of words in a sentence. The other idea is that it means more feature maps per sequence which might just be a good thing?

Anyway thanks again for this solid implementation of a CNN model for text classification in tensorflow.

from cnn-text-classification-tf.

dennybritz avatar dennybritz commented on July 28, 2024

I see, that makes sense on a high level. For the padding, maybe a wide convolution would get around that and perform just as well.

I guess this shows how important preprocessing and initialization are :(

from cnn-text-classification-tf.

chaitjo avatar chaitjo commented on July 28, 2024

Hey @j314erre! Thanks for this. 👍

I was wondering how to implement the pad_sentences part. At what point in train.py do I apply the function to all my sentences? I'm guessing before it learns the vocabulary.

Also, regarding padding_word, how do I implement it if I am using pre-trained 300 dimensional Google embeddings? I don't think there would be a padding symbol in those word vectors. Should I hard-code a condition to make it a zero matrix of the same dimensions as the word vectors?

from cnn-text-classification-tf.

j314erre avatar j314erre commented on July 28, 2024

My suggestions above relate to an older version of the code...

Easiest thing is to look at the repository on "Commits on Apr 2, 2016" for train.py before it got refactored to use VocabularyProcessor. The idea is to swap in my example for pad_sequences instead of the one in data_helpers.py. Then all the padding and vocab building will take care of itself in that repository snapshot (or you can at least see how the code worked back then and re-implement something similar off of the latest version.)

For pre-trained word vectors, I would use those to initialize your embedding, but have your embedding continue to learn & adapt word vectors from your training data. Therefore any words that are important in your data set but that didn't happen to be in the pre-trained vocabulary will be taken into account by the embedding. The pad symbol is just one of those words not in the pre-trained set.

For clarity:

  • build your vocab and embedding (with same size as pre-trained e.g. 300) from your data set
  • initialize the vocab words that exist in the pre-trained set with the pre-trained vectors
  • initialize the vocab words that don't exist in the pre-trained set with random vectors
  • the padding symbol is just another word not in the pre-trained set
  • train your model with all word vectors as trainable parameters

from cnn-text-classification-tf.

Arpitha1996 avatar Arpitha1996 commented on July 28, 2024

pls help me to run this code,ie give the procedures to run in the code in ubuntu 14.04 version

from cnn-text-classification-tf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.