dennybritz / cnn-text-classification-tf Goto Github PK

Convolutional Neural Network for Text Classification in Tensorflow

License: Apache License 2.0

Python 100.00%

cnn-text-classification-tf's Introduction

This code belongs to the "Implementing a CNN for Text Classification in Tensorflow" blog post.

It is slightly simplified implementation of Kim's Convolutional Neural Networks for Sentence Classification paper in Tensorflow.

Requirements

Python 3
Tensorflow > 0.12
Numpy

Training

Print parameters:

./train.py --help

optional arguments:
  -h, --help            show this help message and exit
  --embedding_dim EMBEDDING_DIM
                        Dimensionality of character embedding (default: 128)
  --filter_sizes FILTER_SIZES
                        Comma-separated filter sizes (default: '3,4,5')
  --num_filters NUM_FILTERS
                        Number of filters per filter size (default: 128)
  --l2_reg_lambda L2_REG_LAMBDA
                        L2 regularizaion lambda (default: 0.0)
  --dropout_keep_prob DROPOUT_KEEP_PROB
                        Dropout keep probability (default: 0.5)
  --batch_size BATCH_SIZE
                        Batch Size (default: 64)
  --num_epochs NUM_EPOCHS
                        Number of training epochs (default: 100)
  --evaluate_every EVALUATE_EVERY
                        Evaluate model on dev set after this many steps
                        (default: 100)
  --checkpoint_every CHECKPOINT_EVERY
                        Save model after this many steps (default: 100)
  --allow_soft_placement ALLOW_SOFT_PLACEMENT
                        Allow device soft device placement
  --noallow_soft_placement
  --log_device_placement LOG_DEVICE_PLACEMENT
                        Log placement of ops on devices
  --nolog_device_placement

Train:

./train.py

Evaluating

./eval.py --eval_train --checkpoint_dir="./runs/1459637919/checkpoints/"

Replace the checkpoint dir with the output from the training. To use your own data, change the eval.py script to load your data.

References

cnn-text-classification-tf's People

Contributors

Stargazers

Watchers

Forkers

liangkai xibaer ml-lab cjbayesian lai-bluejay tungmeo andradeandrey kevinwenya nipengmath binbinbian nexcafe wellsguo qjay612 hitluobin zdx lu839684437 robertadams luvten mdeland caomw zhouruiapple aihgf yilaguan denmoroz chagri debbierr zhangkom laisun frank1993 doubler wangxiong2015 zhoujialinmumu qibaoyuan coodoo anukat2015 christiansch tigerneil nudtchengqing sunxingxingtf milesqli xunyuw atveit wavelets gragtah furyphoenix fancyerii jackdogan codeaudit nagyistoce cpturz-afeliop zentechthaingo shaoguangcheng caidongyun ackermann colinferguson affixalex nrlewis remedyhealthcare chenxiaoqun farisology flyinggh xiaokc xrxiao gameofthrow phddone stevenlol johnb30 giahy2507 lixiaosi33 kevinhsu hntee gfarnadi rahulkumariitd57 haridatascientist magic-lantern prateeky2806 belvo nguyenducnhaty sguzwf yanleirex sjhddh vsooda yanshanjing pprivulet gitforhf sanjaynatraj channndiii deeplearningprojects zhuzzjlu nplevitt deeplearningsprint wazzy jwl2006 huozi07 techstone artiya4u devroy73 alexandresablayrolles appscluster zxc0258741

cnn-text-classification-tf's Issues

Testing

Hey i have successfully trained the model but am unable to test it.
I have tried to change the files in data_helpers.py(changed rt-polarity files) but then too i am not getting
any new results.i also tried printing all_predictions in eval.py.
Kindly help.

Value error during zipping of training batch

Hi,

I am trying to run train.py with my own dataset for a multi-class classification problem. I get a Value error during the generation of test batch. What can be the issue? It works fine on the provided +ve/-ve dataset

Change the max pooling size

In text_cnn.py line 49, I want to change ksize=[1, sequence_length - filter_size + 1, 1, 1] to ksize=[1, 4, 1, 1]

I did not figure out how should I change the following lines in order to make it work. Could someone help me on this, please? Thanks for your time.

                # Maxpooling over the outputs
                pooled = tf.nn.max_pool(
                    h,
                    ksize=[1, 4, 1, 1],
                    strides=[1, 1, 1, 1],
                    padding='VALID',
                    name="pool")
                pooled_outputs.append(pooled)

        # Combine all the pooled features
        num_filters_total = num_filters * len(filter_sizes)
        self.h_pool = tf.concat(3, pooled_outputs)
        self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total])

        # Add dropout
        with tf.name_scope("dropout"):
            self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)

        # Final (unnormalized) scores and predictions
        with tf.name_scope("output"):
            W = tf.get_variable(
                "W",
                shape=[num_filters_total, num_classes],
                initializer=tf.contrib.layers.xavier_initializer())
            b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")
            l2_loss += tf.nn.l2_loss(W)
            l2_loss += tf.nn.l2_loss(b)
            self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores")
            self.predictions = tf.argmax(self.scores, 1, name="predictions")

Including word2vec?

Hi Denny,

I wanted to ask you how can I go about integrating word2vec word embeddings in this model instead of learning them from scratch. As mentioned in Yoon Kim's paper, word2vec embeddings perform better than those learned from scratch and I wanted to utilize that.
Can you please help me?

issues regarding the version of python 3

i installed tensor flow in my python 3.4 version and i tried to run the cnn-text classification -tf code. but i got error in data input say (x_train, y_train). how to make this code compatible to version 3.4.

Transfer Learning?

Hi,

Is there anyway I can utilise this for Transfer Learning? I mean, if I train the model on the rt data, can I then used the trained model to perform text classification on a different dataset?

PS: Thanks for this awesome repo and the blog ! :)

num_epochs ignored?

I see num_epochs defaults to 200, but when I run it locally ( ./train.py ) it runs indefinitely.

For example, I stopped this run after 11000.

Parameters:
ALLOW_SOFT_PLACEMENT=True
BATCH_SIZE=64
CHECKPOINT_EVERY=100
DROPOUT_KEEP_PROB=0.5
EMBEDDING_DIM=128
EVALUATE_EVERY=100
FILTER_SIZES=3,4,5
L2_REG_LAMBDA=0.0
LOG_DEVICE_PLACEMENT=False
NUM_EPOCHS=200
NUM_FILTERS=128

...

Evaluation:
2016-01-08T19:28:26.785109: step 11000, loss 0.613605, acc 0.715

Saved model checkpoint to <my_path>/cnn-text-classification-tf/runs/1452309199/checkpoints/model-11000

Why my loss of dev increases steadily？？？？

If text have many sentences, how to do

I read your code just now, but I have a problem and wish that you can teach me.
there just have sentence_length in the textCNN, if I have many sentences, how can I do, to get average on all sentences?
and if many sentences, how to choice batch in the training stage.

thank you!

Purpose of line 31 in train.py

In the beginning, I am a bit confused about the purpose of this line, namely,
FLAGS.batch_size
Later I guess this is to make the parameter printing work? Because I removed it and the parameters won't print. Anyway, I found that using FLAGS._parse_flags() would work and is probably more clear. Just thought you would like to know.

Generating error for other language dataset

Please check. it generates the following error for other language dataset

File "train.py", line 158, in
x_batch, y_batch = zip(*batch)
ValueError: need more than 0 values to unpack

How I can apply batch normalization?

I read this paper: https://arxiv.org/abs/1502.03167 and I would to test in my model the technique.

I found the implementation for tensorflow 0.9 (function batch_norm(...)):
https://github.com/tensorflow/tensorflow/blob/r0.9/tensorflow/contrib/layers/python/layers/layers.py

but how I can apply this at the code?

Process killed by the system

Hello, I've tested your code on my own data of 20,000 examples, and the result is quite good. I have another data of 300,000 examples. Each example is a short sentence of approximate 20 words. When tested on this new dataset, the first 100 steps are fine. However, it stops on the evaluation step. The dev set has more than 60,000 examples, and the message when it stops is "Killed".
I guess the reason is that the number of examples of the dev set is very large, so it consumes a lot of memory. Is that true? And how can I fix that?
Thank you very much.

cannot import name learn

Traceback (most recent call last):
File "./train.py", line 10, in
from tensorflow.contrib import learn
ImportError: cannot import name learn

Academic - CNN text vs Deep Convolution Generative Adversarial Net (DCGAN) - exercise for the reader

I took a look here - and wondered if anyone wanted to have a crack at applying the CNN to DC Generative Adversarial Networks? mqtlam/dcgan-tfslim#3
(Python is not my major + still getting head around science behind it)

Why bother? Finished product could replicate an authors style of prose / writing.
https://bamos.github.io/2016/08/09/deep-completion/

Eg. input Stephen King novel and have gan spit out sentences you couldn't tell were not his.

I have a docker container for Tensorflow with grpc- to cherry pick from.
https://github.com/johndpope/DockerParseyMcParsefaceAPI/

Use sentences without padding

I try to remove the padding in every sentences.
I modify your code:

def load_data(): sentences, labels = load_data_and_labels() vocabulary, vocabulary_inv = build_vocab(sentences) x, y = build_input_data(sentences, labels, vocabulary) return [x, y, vocabulary, vocabulary_inv]

But I have a problem:
File "train.py", line 68, in <module> sequence_length=x_train.shape[1], IndexError: tuple index out of range
I don't understand why the code fails because the size of the matrix is same.
Do you have suggestions to solve my problem?

How to get the classification result?

Code:
self.predictions = tf.argmax(self.scores, 1, name="predictions")

But, I want to get the classification result, such as [1, 0, 1, 1, ...], How to get it?
Thank you!

continue learning

how to continue learning after PC shutdown?

Stop-words removal

The pre-processing script does not includes any stopword removal functionality. Is this due to the use of Neural nets. that we assume that it will take care of saliency of each word feature ?

Maybe a better way for prediction.

In the code segment below.
self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total])
self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)
self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores") self.predictions = tf.argmax(self.scores, 1, name="predictions")
The predictions operation do not automatically leave the dropout.

Maybe a good way for prediction:
self.predictions = tf.argmax(tf.nn.xw_plus_b(self.h_pool_flat, W, b), 1, name="predictions")
In this way, we do not need to feed the keep_rate=1.0 every time we make a prediction and the loss function will not be affected.

what is needed for upgrading this code into word2vec version

As is stated in @dennybritz 's blog, this code can also be used together with a pre-trained word2vec.

I am thinking about how to make the modification for that.

The obvious steps that I came across are:

instead of padding and indexing vocabulary, use a fix sized w2v vector to represent the sentence.
remove the embedding layer on TextCNN class.

Is that all? Did I misunderstand something?

Thx

Re-training the model with more data

Hey @dennybritz i have a trained model but i have more dataset available.Do i need to re train the model with the whole data(prev + new) or is there any method to train the pre trianed model with new data(say intitialy i had 400 q and now i have 5 more)?

Extreme large vocabulary

I recently met a problem that the training algorithm becomes much slower when the vocabulary size gets extreme large. There is a warning from tensorflow saying that "Converting sparse IndexedSlices to a dense Tensor with 145017088 elements. This may consume a large amount of memory."

I guess Tensorflow is using a dense gradient update on the embedding matrix. Does anyone have any ideas on that?

Thanks

Scrambling batches when flattening feature vector

After each iteration of the while loop you end up with a tensor of shape [batch_size, 1, 1, num_filters] that you append to a list.
So you end up with a list, of size len(filter_sizes), of tensors of shape [batch_size, 1, 1, num_filters], basically pooled_outputs will have shape [len(filter_sizes), batch_size, 1, 1, num_filters].
After tf.concat(3, pooled_outputs) you will have shape [len(filter_sizes), batch_size, 1, num_filters].

From my understanding the first dimension is "len(filter_sizes)" the different dimension sizes, and the second is the different batches.

If you apply tf.reshape(self.h_pool, [-1, num_filters_total]) to this, you will get a feature vector of shape [batch_size, num_filters_total]

But won't you scramble the different batches? Instead of creating each "new" batch (the batch dimension in the final feature vector) from flattening each "old" batches; instead you will have each "new" batch created from more than one "old" batches and a single filter size? Because you are flattening each filter_size.

Am I missing something?

multi-labels support + auto applicable vocabulary, label list, max sentence size and unseen word

First, I'm sorry.
I have poor English.

Thanks to Yoon Kim and dennybritz.

I edit your script to apply multi-labels support.
this script support multi-label data like next:

text1....\t label1
text2....\t label2
text3....\t label1
text4....\t label3

no vocabulary difference problem,
no label set diffenence problem,
no max sentence length difference problem,
no unseen word problem.

Just simply use

python train.py --train_data_path="./data/train.txt"
python eval.py --checkpoint_dir="./runs/1463968251/checkpoints/" --test_data_path="./data/test.txt"

Here I attached my script:

data_helpers.py

import codecs
import os.path
import numpy as np
import re
import itertools
from collections import Counter

PAD_MARK = "<PAD/>"
UNK_MARK = "<UNK/>"

def clean_str(string):
    """
    Tokenization/string cleaning for all datasets except for SST.
    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
    """
#    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)  # blocked to allow non-english char-set
    string = re.sub(r"\'s", " \'s", string)
    string = re.sub(r"\'ve", " \'ve", string)
    string = re.sub(r"n\'t", " n\'t", string)
    string = re.sub(r"\'re", " \'re", string)
    string = re.sub(r"\'d", " \'d", string)
    string = re.sub(r"\'ll", " \'ll", string)
    string = re.sub(r",", " , ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", " \( ", string)
    string = re.sub(r"\)", " \) ", string)
    string = re.sub(r"\?", " \? ", string)
    string = re.sub(r"\s{2,}", " ", string)
    return string.strip().lower()


def load_data_and_labels( train_data_path ):
    """
    Loads MR polarity data from files, splits the data into words and generates labels.
    Returns split sentences and labels.
    """
    # Load data from files
    data = list()
    labels = list()
    for line in codecs.open( train_data_path, 'r', encoding='utf8' ).readlines() :
        if 1 > len( line.strip() ) : continue;
        t = line.split(u"\t");
        if 2 != len(t) :
            print "data format error" + line
            continue;
        data.append(t[0])
        labels.append(t[1])
    data   = [s.strip() for s in data]
    labels = [s.strip() for s in labels]
    # Split by words
    x_text = [clean_str(sent) for sent in data]
    x_text = [s.split(u" ") for s in x_text]
    return [x_text, labels]


def pad_sentences(sentences, max_sent_len_path):
    """
    Pads all sentences to the same length. The length is defined by the longest sentence.
    Returns padded sentences.
    """
    max_sequence_length = 0
    # Load base max sent length
    if len(max_sent_len_path) > 0 :
        max_sequence_length = int( open( max_sent_len_path, 'r' ).readlines()[0] )
    else : 
        max_sequence_length = max(len(x) for x in sentences)
    padded_sentences = []
    for i in range(len(sentences)):
        sentence = sentences[i]
        if max_sequence_length <= len(sentence) :
            padded_sentences.append(sentence[:max_sequence_length])
            continue
        num_padding = max_sequence_length - len(sentence)
        new_sentence = sentence + [PAD_MARK] * num_padding
        padded_sentences.append(new_sentence)
    return padded_sentences, max_sequence_length


def build_vocab(sentences, base_vocab_path):
    """
    Builds a vocabulary mapping from word to index based on the sentences.
    Returns vocabulary mapping and inverse vocabulary mapping.
    """
    vocabulary_inv = []
    # Load base vocabulary
    if len(base_vocab_path) > 0 :
        vL = [ [w.strip()] for w in codecs.open( base_vocab_path, 'r', encoding='utf8' ).readlines() ]
        c = Counter(itertools.chain(*vL))
        vocabulary_inv = [x[0] for x in c.most_common()]
    else :
        # Build vocabulary
        word_counts = Counter(itertools.chain(*sentences))
        # Mapping from index to word
        vocabulary_inv = vocabulary_inv + [x[0] for x in word_counts.most_common()] 
        if not UNK_MARK in vocabulary_inv :
            vocabulary_inv.append(UNK_MARK)
    vocabulary_inv = list(set(vocabulary_inv))
    vocabulary_inv.sort()
    # Mapping from word to index
    vocabulary = {x: i for i, x in enumerate(vocabulary_inv)}
    if not UNK_MARK in vocabulary :
        vocabulary[UNK_MARK] = vocabulary[PAD_MARK]

    return [vocabulary, vocabulary_inv]


def make_onehot(idx, size) :
    onehot = []
    for i in range(size) :
        if idx==i : onehot.append(1);
        else      : onehot.append(0);
    return onehot
# end def

def make_label_dic(labels) :
    """
    creator: [email protected]
    create date: 2016.05.22
    make 'label : one hot' dic
    """
    label_onehot = dict()
    onehot_label = dict()
    for i, label in enumerate(labels) :
        onehot =  make_onehot(i,len(labels))
        label_onehot[label] = onehot
        onehot_label[str(onehot)] = label
    return label_onehot, onehot_label
# end def

def build_onehot(labels, base_label_path):
    """
    Builds a vocabulary mapping from label to onehot based on the sentences.
    Returns vocabulary mapping and inverse vocabulary mapping.
    """
    uniq_labels = []
    # Load base vocabulary
    if len(base_label_path) > 0 :
        vL = [ [w.strip()] for w in codecs.open( base_label_path, 'r', encoding='utf8' ).readlines() ]
        c = Counter(itertools.chain(*vL))
        uniq_labels = [x[0] for x in c.most_common()]
    else :
        # Build vocabulary
        label_counts = Counter(labels)
        # Mapping from index to word
        uniq_labels = uniq_labels + [x[0] for x in label_counts.most_common()]
    uniq_labels = list(set(uniq_labels))
    uniq_labels.sort()
    label_onehot, onehot_label = make_label_dic( uniq_labels )
    return [uniq_labels, label_onehot, onehot_label]


def build_input_data(sentences, vocabulary, labels, label_onehot):
    """
    Maps sentencs and labels to vectors based on a vocabulary.
    """
    vL = []
    for sentence in sentences :
        wL = []
        for word in sentence :
            if word in vocabulary :
                wL.append( vocabulary[word] )
            else :
                wL.append( vocabulary[UNK_MARK] )
        vL.append(wL)
    x = np.array(vL)
    y = np.array([ label_onehot[label] for label in labels ])
    return [x, y]


def load_data( train_data_path, checkpoint_dir="" ):
    """
    Loads and preprocessed data for the MR dataset.
    Returns input vectors, labels, vocabulary, and inverse vocabulary.
    """
    # Load and preprocess data
    max_sent_len_path = "" if len(checkpoint_dir)<1 else checkpoint_dir+"/max_sent_len" 
    vocab_path        = "" if len(checkpoint_dir)<1 else checkpoint_dir+"/vocab" 
    label_path        = "" if len(checkpoint_dir)<1 else checkpoint_dir+"/label" 
    sentences, labels          = load_data_and_labels( train_data_path )
    sentences_padded, max_sequence_length = pad_sentences(sentences, max_sent_len_path)
    vocabulary, vocabulary_inv = build_vocab(sentences_padded, vocab_path)
    uniq_labels, label_onehot, onehot_label = build_onehot(labels, label_path) 
    x, y = build_input_data(sentences_padded, vocabulary, labels, label_onehot)
    return [x, y, vocabulary, vocabulary_inv, onehot_label, max_sequence_length]


def batch_iter(data, batch_size, num_epochs, shuffle=True):
    """
    Generates a batch iterator for a dataset.
    """
    data = np.array(data)
    data_size = len(data)
    num_batches_per_epoch = int(len(data)/batch_size) + 1
    for epoch in range(num_epochs):
        # Shuffle the data at each epoch
        if shuffle:
            shuffle_indices = np.random.permutation(np.arange(data_size))
            shuffled_data = data[shuffle_indices]
        else:
            shuffled_data = data
        for batch_num in range(num_batches_per_epoch):
            start_index = batch_num * batch_size
            end_index = min((batch_num + 1) * batch_size, data_size)
            yield shuffled_data[start_index:end_index]

train.py

#! /usr/bin/env python

import codecs
import tensorflow as tf
import numpy as np
import os
import time
import datetime
import data_helpers
from text_cnn import TextCNN

# Parameters
# ==================================================

# Model Hyperparameters
tf.flags.DEFINE_string("train_data_path", "./data/train.txt", "Data path to training")
tf.flags.DEFINE_integer("embedding_dim", 128, "Dimensionality of character embedding (default: 128)")
tf.flags.DEFINE_string("filter_sizes", "3,4,5", "Comma-separated filter sizes (default: '3,4,5')")
tf.flags.DEFINE_integer("num_filters", 128, "Number of filters per filter size (default: 128)")
tf.flags.DEFINE_float("dropout_keep_prob", 0.5, "Dropout keep probability (default: 0.5)")
tf.flags.DEFINE_float("l2_reg_lambda", 0.0, "L2 regularizaion lambda (default: 0.0)")

# Training parameters
tf.flags.DEFINE_integer("batch_size", 64, "Batch Size (default: 64)")
tf.flags.DEFINE_integer("num_epochs", 200, "Number of training epochs (default: 200)")
tf.flags.DEFINE_integer("evaluate_every", 100, "Evaluate model on dev set after this many steps (default: 100)")
tf.flags.DEFINE_integer("checkpoint_every", 100, "Save model after this many steps (default: 100)")
# Misc Parameters
tf.flags.DEFINE_boolean("allow_soft_placement", True, "Allow device soft device placement")
tf.flags.DEFINE_boolean("log_device_placement", False, "Log placement of ops on devices")

FLAGS = tf.flags.FLAGS
FLAGS._parse_flags()
print("\nParameters:")
for attr, value in sorted(FLAGS.__flags.items()):
    print("{}={}".format(attr.upper(), value))
print("")


# Data Preparatopn
# ==================================================

# Load data
print("Loading data...")
x, y, vocabulary, vocabulary_inv, onehot_label, max_sequence_length = data_helpers.load_data( FLAGS.train_data_path )
# Randomly shuffle data
np.random.seed(10)
shuffle_indices = np.random.permutation(np.arange(len(y)))
x_shuffled = x[shuffle_indices]
y_shuffled = y[shuffle_indices]
# Split train/test set
# TODO: This is very crude, should use cross-validation
x_train, x_dev = x_shuffled[:-1000], x_shuffled[-1000:]
y_train, y_dev = y_shuffled[:-1000], y_shuffled[-1000:]
print("Labels: %d: %s" % ( len(onehot_label), ','.join( onehot_label.values() ) ) )
print("Vocabulary Size: {:d}".format(len(vocabulary)))
print("Train/Dev split: {:d}/{:d}".format(len(y_train), len(y_dev)))


# Training
# ==================================================

with tf.Graph().as_default():
    session_conf = tf.ConfigProto(
      allow_soft_placement=FLAGS.allow_soft_placement,
      log_device_placement=FLAGS.log_device_placement)
    sess = tf.Session(config=session_conf)
    with sess.as_default():
        cnn = TextCNN(
            sequence_length=x_train.shape[1],
            num_classes=len(onehot_label),
            vocab_size=len(vocabulary),
            embedding_size=FLAGS.embedding_dim,
            filter_sizes=list(map(int, FLAGS.filter_sizes.split(","))),
            num_filters=FLAGS.num_filters,
            l2_reg_lambda=FLAGS.l2_reg_lambda)

        # Define Training procedure
        global_step = tf.Variable(0, name="global_step", trainable=False)
        optimizer = tf.train.AdamOptimizer(1e-3)
        grads_and_vars = optimizer.compute_gradients(cnn.loss)
        train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)

        # Keep track of gradient values and sparsity (optional)
        grad_summaries = []
        for g, v in grads_and_vars:
            if g is not None:
                grad_hist_summary = tf.histogram_summary("{}/grad/hist".format(v.name), g)
                sparsity_summary = tf.scalar_summary("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g))
                grad_summaries.append(grad_hist_summary)
                grad_summaries.append(sparsity_summary)
        grad_summaries_merged = tf.merge_summary(grad_summaries)

        # Output directory for models and summaries
        timestamp = str(int(time.time()))
        out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))
        print("Writing to {}\n".format(out_dir))

        # Summaries for loss and accuracy
        loss_summary = tf.scalar_summary("loss", cnn.loss)
        acc_summary = tf.scalar_summary("accuracy", cnn.accuracy)

        # Train Summaries
        train_summary_op = tf.merge_summary([loss_summary, acc_summary, grad_summaries_merged])
        train_summary_dir = os.path.join(out_dir, "summaries", "train")
        train_summary_writer = tf.train.SummaryWriter(train_summary_dir, sess.graph_def)

        # Dev summaries
        dev_summary_op = tf.merge_summary([loss_summary, acc_summary])
        dev_summary_dir = os.path.join(out_dir, "summaries", "dev")
        dev_summary_writer = tf.train.SummaryWriter(dev_summary_dir, sess.graph_def)

        # Checkpoint directory. Tensorflow assumes this directory already exists so we need to create it
        checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints"))
        checkpoint_prefix = os.path.join(checkpoint_dir, "model")
        if not os.path.exists(checkpoint_dir):
            os.makedirs(checkpoint_dir)

        # Save additional model info
        codecs.open( os.path.join(checkpoint_dir, "max_sent_len"), "w", encoding='utf8').write( str(max_sequence_length) )
        codecs.open( os.path.join(checkpoint_dir, "vocab"),        "w", encoding='utf8').write( '\n'.join(vocabulary_inv) )
        codecs.open( os.path.join(checkpoint_dir, "label"),        "w", encoding='utf8').write( '\n'.join(onehot_label.values()) )

        saver = tf.train.Saver(tf.all_variables())

        # Initialize all variables
        sess.run(tf.initialize_all_variables())

        def train_step(x_batch, y_batch):
            """
            A single training step
            """
            feed_dict = {
              cnn.input_x: x_batch,
              cnn.input_y: y_batch,
              cnn.dropout_keep_prob: FLAGS.dropout_keep_prob
            }
            _, step, summaries, loss, accuracy = sess.run(
                [train_op, global_step, train_summary_op, cnn.loss, cnn.accuracy],
                feed_dict)
            time_str = datetime.datetime.now().isoformat()
            print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
            train_summary_writer.add_summary(summaries, step)

        def dev_step(x_batch, y_batch, writer=None):
            """
            Evaluates model on a dev set
            """
            feed_dict = {
              cnn.input_x: x_batch,
              cnn.input_y: y_batch,
              cnn.dropout_keep_prob: 1.0
            }
            step, summaries, loss, accuracy = sess.run(
                [global_step, dev_summary_op, cnn.loss, cnn.accuracy],
                feed_dict)
            time_str = datetime.datetime.now().isoformat()
            print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
            if writer:
                writer.add_summary(summaries, step)

        # Generate batches
        batches = data_helpers.batch_iter(
            list(zip(x_train, y_train)), FLAGS.batch_size, FLAGS.num_epochs)
        # Training loop. For each batch...
        for batch in batches:
            x_batch, y_batch = zip(*batch)
            train_step(x_batch, y_batch)
            current_step = tf.train.global_step(sess, global_step)
            if current_step % FLAGS.evaluate_every == 0:
                print("\nEvaluation:")
                dev_step(x_dev, y_dev, writer=dev_summary_writer)
                print("")
            if current_step % FLAGS.checkpoint_every == 0:
                path = saver.save(sess, checkpoint_prefix, global_step=current_step)
                print("Saved model checkpoint to {}\n".format(path))

eval.py

#! /usr/bin/env python

import tensorflow as tf
import numpy as np
import os
import time
import datetime
import data_helpers
from text_cnn import TextCNN

# Parameters
# ==================================================

# Eval Parameters
tf.flags.DEFINE_string("test_data_path", "./data/test.txt", "Data path to evaluation")
tf.flags.DEFINE_integer("batch_size", 64, "Batch Size (default: 64)")
tf.flags.DEFINE_string("checkpoint_dir", "", "Checkpoint directory from training run")

# Misc Parameters
tf.flags.DEFINE_boolean("allow_soft_placement", True, "Allow device soft device placement")
tf.flags.DEFINE_boolean("log_device_placement", False, "Log placement of ops on devices")


FLAGS = tf.flags.FLAGS
FLAGS._parse_flags()
print("\nParameters:")
for attr, value in sorted(FLAGS.__flags.items()):
    print("{}={}".format(attr.upper(), value))
print("")

# Load data. Load your own data here
print("Loading data...")
x_test, y_test, vocabulary, vocabulary_inv, onehot_label, max_sequence_length = data_helpers.load_data( FLAGS.test_data_path, FLAGS.checkpoint_dir )
y_test = np.argmax(y_test, axis=1)
print("Labels: %d: %s" % ( len(onehot_label), ','.join( sorted(onehot_label.values()) ) ) )
print("Vocabulary size: {:d}".format(len(vocabulary)))
print("Test set size {:d}".format(len(y_test)))

print("\nEvaluating...\n")

# Evaluation
# ==================================================
checkpoint_file = tf.train.latest_checkpoint(FLAGS.checkpoint_dir)
graph = tf.Graph()
with graph.as_default():
    session_conf = tf.ConfigProto(
      allow_soft_placement=FLAGS.allow_soft_placement,
      log_device_placement=FLAGS.log_device_placement)
    sess = tf.Session(config=session_conf)
    with sess.as_default():
        # Load the saved meta graph and restore variables
        print "FLAGS.checkpoint_dir %s" % FLAGS.checkpoint_dir
        print "checkpoint_file %s" % checkpoint_file
        saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file))
        saver.restore(sess, checkpoint_file)

        # Get the placeholders from the graph by name
        input_x = graph.get_operation_by_name("input_x").outputs[0]
        # input_y = graph.get_operation_by_name("input_y").outputs[0]
        dropout_keep_prob = graph.get_operation_by_name("dropout_keep_prob").outputs[0]

        # Tensors we want to evaluate
        predictions = graph.get_operation_by_name("output/predictions").outputs[0]

        # Generate batches for one epoch
        batches = data_helpers.batch_iter(x_test, FLAGS.batch_size, 1, shuffle=False)

        # Collect the predictions here
        all_predictions = []

        for x_test_batch in batches:
            batch_predictions = sess.run(predictions, {input_x: x_test_batch, dropout_keep_prob: 1.0})
            all_predictions = np.concatenate([all_predictions, batch_predictions])

# Print accuracy
print "y_test: " + str(y_test)
print "all_predictions: " + str(all_predictions)
correct_predictions = float(sum(all_predictions == y_test))
print("Total number of test examples: {}".format(len(y_test)))
print("Accuracy: {:g}".format(correct_predictions/float(len(y_test))))

How to stop the pre-trained embeddings from further learning and adapting to my dataset?

Dear all,
I use the pre-trained embeddings in the CNN model, but I found the pre-trained embeddings keep learning and adapting to my dataset. But I just want to test the quality of my pre-trained embeddings. So my question is how to stop the pre-trained embeddings from further learning and adapting to my dataset?

Please shed some light on me.
Thank you.

Getting error on train.py line #16

Hi,

Thanks for providing this script. While I am trying train.py script and I am getting below error.

ArgumentError Traceback (most recent call last)
in ()
14
15 # Model Hyperparameters
---> 16 tf.flags.DEFINE_integer("embedding_dim", 128, "Dimensionality of character embedding (default: 128)")
17 tf.flags.DEFINE_string("filter_sizes", "3,4,5", "Comma-separated filter sizes (default: '3,4,5')")
18 tf.flags.DEFINE_integer("num_filters", 128, "Number of filters per filter size (default: 128)")

/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/flags.pyc in DEFINE_integer(flag_name, default_value, docstring)
83 docstring: A helpful message explaining the use of the flag.
84 """
---> 85 _define_helper(flag_name, default_value, docstring, int)
86
87

/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/flags.pyc in _define_helper(flag_name, default_value, docstring, flagtype)
57 default=default_value,
58 help=docstring,
---> 59 type=flagtype)
60
61

I am using Tensorflow version 0.11.0rc2 and python version 2.7.6. Please let me know how to solve this issue.

Many Thanks, Thirumalai M

AttributeError: type object 'NewBase' has no attribute 'is_abstract' using 0.8.0

Using tensorflow 0.8.0 (installed via pip3 install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.8.0-cp34-cp34m-linux_x86_64.whl) I have the following error:

python3 train.py 
Traceback (most recent call last):
  File "train.py", line 3, in <module>
    import tensorflow as tf
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/__init__.py", line 23, in <module>
    from tensorflow.python import *
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/__init__.py", line 94, in <module>
    from tensorflow.python.platform import test
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/platform/test.py", line 62, in <module>
    from tensorflow.python.framework import test_util
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/test_util.py", line 41, in <module>
    from tensorflow.python.platform import googletest
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/platform/googletest.py", line 32, in <module>
    from tensorflow.python.platform import benchmark  # pylint: disable=unused-import
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/platform/benchmark.py", line 112, in <module>
    class Benchmark(six.with_metaclass(_BenchmarkRegistrar, object)):
  File "/usr/lib/python3/dist-packages/six.py", line 617, in with_metaclass
    return meta("NewBase", bases, {})
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/platform/benchmark.py", line 107, in __new__
    if not newclass.is_abstract():
AttributeError: type object 'NewBase' has no attribute 'is_abstract'

module object has no attribute 'contrib'

i faced with a simple? problem. i don't know whether it is common thing.

i put my own data into directory and started to train

but error happend with message " AttributeError: 'module' object has no attribute 'contrib' " on line 69 in text_cnn.py => initializer=tf.contrib.layers.xavier_initializer())

do i need to install any other package? or i'm doing it wrong?

Error with python 3

Can it be run in a jupiter notebook with just python 2 kernel installed?
I am facing errors as:It shows ArgumentError: argument --embeding_dim: conflicting option string(s): --embeding_dim

tf.flags.DEFINE_integer("embedding_dim", 128, "Dimensionality of character embedding (default: 128)")
14 tf.flags.DEFINE_string("filter_sizes", "3,4,5", "Comma-separated filter sizes (default: '3,4,5')")
15 tf.flags.DEFINE_integer("num_filters", 128, "Number of filters per filter size (default: 128)")

/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/default/_flags.pyc in DEFINE_integer(flag_name, default_value, docstring)
82 docstring: A helpful message explaining the use of the flag.
83 """
---> 84 _define_helper(flag_name, default_value, docstring, int)
85
86

Kindly help .

Add export of the model to be reused ?'

Would be great to be able to reuse the model into the Cloud ML, in order to achieve this should be exported like this: saver.save(sess, os.path.join(FLAGS.model_dir, 'export')

Add early stopping

How should I try to implement early stopping and avoid overfitting? Looks like eval.py takes the latest model and not the best one. I'm studying this but I'm not sure about where it should go.
thanks and keep up the excellent work!

How to get the value of hidden node?

Say I want to use the values of the penultimate layer, how to do that? Thanks a lot...

Loading a checkpoint and classifying new data

How do I load one of my checkpoints and get a prediction on new data? I looked around in past issues and see #2 but I wasn't sure how to load in the checkpoint first.

sess.run returns an empty array

Hello!
I've been using this code to test out a few different datasets. It works with your example rt-polaritydata but now I'm trying to use a different set of train/test split data.
I trained this on one set of data, and I'm testing on a subset I pulled off from the shuffled positive/negatives. When I test it I get this:

`Parameters:
ALLOW_SOFT_PLACEMENT=True
BATCH_SIZE=64
CHECKPOINT_DIR=./runs/1465995324/checkpoints/
LOG_DEVICE_PLACEMENT=False

Loading data...
Vocabulary size: 1082
Test set size 2000

Evaluating...

Number of runs: 32
Traceback (most recent call last):
File "./eval.py", line 81, in
correct_predictions = float(sum(all_predictions == y_test))
TypeError: 'bool' object is not iterable`

I narrowed down the issue to the point where the line:
batch_predictions = sess.run(predictions, {input_x: x_test_batch, dropout_keep_prob: 1.0})
where sess.run returns nothing and therefore the batch_predictions is an empty array, which leads to the bool issue.

If you have any ideas as to what happened and how to fix this, it would be much appreciated. Thank you in advance!!

AttributeError: 'module' object has no attribute 'layers'

Hello,
I get this error when i try to train with the existing data on AWS, any ideas on this?

Traceback (most recent call last):
File "./train.py", line 73, in
l2_reg_lambda=FLAGS.l2_reg_lambda)
File "/home/ubuntu/username/cnn-text-classification-tf/text_cnn.py", line 69, in init
initializer=tf.contrib.layers.xavier_initializer())
AttributeError: 'module' object has no attribute 'layers'

I have no problems with TF installation, the examples run just fine.
Thanks!

tf.train.import_meta_graph

Hi, @dennybritz
when I use tf.train.import_meta_graph to eval the trained model for new samples, it is very slow to load
the graph, do you known why, is there other method to load the trained model, the official tutorial use tf.train.saver(), but it is very complicated, you need to redefine the variables, and load it again. Using load meta graph is very convenient, but it is so slow, do you have and suggestions?
Thanks

Using pretrained word embeddings

@dennybritz

Hi ,
First of all many thanks for sharing your code. I am trying to use pretrained word embeddings instead of randomly initialized word embedings based on the vocabulary size.

My pretrained word embedding is numpy array : ( N, 300, dtype=float.32) where N is indice of the word for which word embedding of dimension (300,) is stored.

However , unable to pass the step of
self.embedded_chars = tf.nn.embedding_lookup(W, self.input_x) because tensorflow only allows int32,int64 as lookup indices.

Can you suggest how can i resolve this issue?

Many thanks

run it on SGE grid?

Hi Denny, when I tried 'train.py', it seems to be running on a single machine.
Is it possible to configure tensorflow so that it runs on a grid, say the SGE grid?
Please advise. thanks!

model no attribute 'global_variables'

After I run the python train.py, these is an error and I could not solve it. Can anyone know how?

Parameters:
ALLOW_SOFT_PLACEMENT=True
BATCH_SIZE=64
CHECKPOINT_EVERY=100
DEV_SAMPLE_PERCENTAGE=0.1
DROPOUT_KEEP_PROB=0.5
EMBEDDING_DIM=128
EVALUATE_EVERY=100
FILTER_SIZES=3,4,5
L2_REG_LAMBDA=0.0
LOG_DEVICE_PLACEMENT=False
NEGATIVE_DATA_FILE=./data/rt-polaritydata/rt-polarity.neg
NUM_EPOCHS=200
NUM_FILTERS=128
POSITIVE_DATA_FILE=./data/rt-polaritydata/rt-polarity.pos

Loading data...
Vocabulary Size: 18758
Train/Dev split: 9596/1066
Writing to /Users/zomi/Downloads/cnn-text-classification-tf-master/runs/1481718705

Traceback (most recent call last):
  File "./train.py", line 129, in <module>
    saver = tf.train.Saver(tf.global_variables())
AttributeError: 'module' object has no attribute 'global_variables'

how to test the trained module?

i trained the movie review training set using this code. i got trained files in the path "runs/1458022294/summaries/train". how can i test the module is there any API in python to test it?

result of eval and continue learning

have files in ./runs/1466699046/checkpoints

checkpoint 
model-29800  
model-29800.meta  
model-29900  
model-29900.meta  
model-30000  
model-30000.meta  
model-30100  
model-30100.meta  
model-30200  
model-30200.meta

and have such result for eval.py

$ ./eval.py --checkpoint_dir="./runs/1466699046/checkpoints/"

Parameters:
ALLOW_SOFT_PLACEMENT=True
BATCH_SIZE=64
CHECKPOINT_DIR=./runs/1466699046/checkpoints/
LOG_DEVICE_PLACEMENT=False


Evaluating...

Total number of test examples: 2
Accuracy: 0

what does it mean?

multi_class input

I see that in your code,[0,1] represent to label pos and [1,0] represent to label neg.
But if I got a multi_class input such as 20 labels.How can I represent theses labels?

eval.py execution

I trained the latest code with default parameters and when executing the same eval.py with the proper directory name, I am getting this output:

Parameters:
ALLOW_SOFT_PLACEMENT=True
BATCH_SIZE=64
CHECKPOINT_DIR=./runs/1466501212/checkpoints/
LOG_DEVICE_PLACEMENT=False

Evaluating...

Total number of test examples: 2
Accuracy: 0

Why am I getting accuracy as '0'?

Is there a way for me to adjust the number of classifications?

Currently, the code is fixated on the number of text classifications as provided by the existing model of TensorFlow. For example, the model has 8 types of classifications. But I want to increase/decrease the number of classifications. Is there a way to make it work?

Cannot feed value of shape

hi, am trying to testdrive the code in linux, it runs good.
but when I use 390 lines only sample data to eval, it shows error:

ValueError: Cannot feed value of shape (390, 47) for Tensor 'input_x:0', which has shape '(?, 56)'

should I modify the function below? :
pred_y = sess.run(predictions, {input_x: x_test, dropout_keep_prob: 1.0})

Any help would be appreciated

fail while trying to inference a encoded sentence

im trying to inference trained model with raw_input(stdin). it is encoded and padded as same as processing training data
but fail.
error message is bellow
tensorflow.python.framework.errors.InvalidArgumentError: input must be 4-dimensional[100,128,1](i filxed sentence length 100 and embeding128)
but bulk inference works.
how can i fix it?

accuracy vs CNN_sentence

Thank you for this helpful reference implementation in tensorflow.

I'm running your code out-of-the-box but with hyperparameters chosen to match Yoon Kim's theano implementation https://github.com/yoonkim/CNN_sentence

I'm finding accuracy is much lower on the same data set over the same number of mini-batches / epochs...(not using pre-trained word2vec)

I am going to look at model topology, learning rate, dropout, L2, optimizer, loss function, etc. to get to the bottom of this and make sure it is an apples-to-apples comparison, but if you know where to focus efforts any help is appreciated.

Can I apply this model to documents?

I have collection of documents with multiclass classification problem. I am trying out Recent CNN model (char-CNN from NYU) but it is giving very less accuracy.

Can I use this model on my dataset (short documents with avg. 4-5 sentences). I am thinking to flatten the document into one sentence and train this model on top of it,

Since this model is for sentences that's why I am asking this ques.

set number of running steps

Hello, I've tested my own data with your code and the result is great. However, it takes too much time to finish. Sometimes, the total step number is 30,200, and sometime it's more than 50,000. Why it's not consistent? And do you have an upper limit for it?