brightmart / text_classification Goto Github PK

View Code? Open in Web Editor NEW

7.8K 299.0 2.6K 14.46 MB

all kinds of text classification models and more with deep learning

License: MIT License

Python 94.28% Jupyter Notebook 5.72%

classification nlp fasttext textcnn textrnn tensorflow multi-label multi-class attention-mechanism text-classification

text_classification's Issues

out of memory, vocabulary size, multi-label to single label

label size is to big, out of memory.

the last ouput of Bi-RNN in TextRNN

text_classification/a03_TextRNN/p8_TextRNN_model.py

Line 67 in 68e2fcf

 self.output_rnn_last=tf.reduce_mean(output_rnn,axis=1) #[batch_size,hidden_size*2] #output_rnn_last=output_rnn[:,-1,:] ##[batch_size,hidden_size*2] #TODO 

In the implementation, the final outputs of Bi-RNN are calculated as the reduce mean among all time stamps. Compared with output_rnn_last=output_rnn[:,-1,:], what is the difference between these two strategies on the impact of the final classification results?

No Such file or dictionary: 'zhihu-word2vec.bin-100

when i run the p8_TextRNN_train.py, it does show the error:
No Such file or dictionary: 'zhihu-word2vect.bin-100'
where can i find the file?

attentive attention of hierarchical attention network

it seems that the way you implement of attention mechanism is different from original paper, can you give more ideas?

不好意思，读了你的HAN_model.py代码感觉你的代码不太完整，缺少了textRNN.accuracy, textRNN.predictions, textRNN.W_projection这些部分。而且textRNN.input_y:没有定义。还有Attention求权重的方法好像和论文原著不太一样，论文中好像接入了个softmax在和隐藏层相乘累加。
请问能大概介绍一下你文章的思路吗？有点云里雾里的。对word级别的为什么要写成每篇文章的第一句，每篇文章的第二句这样循环输入呢？最后的Loss是什么意思？

No such file or directory: '../cache_vocabulary_label_pik/hierAtten_word_voabulary.pik'

where is hierAtten_word_voabulary.pik?

您好，每次我运行过CNN的train，接着运行predict时出现问题

File "/home/wt/桌面/class/text_classification-master/a02_TextCNN/other_experiement/data_util_zhihu.py", line 28, in create_voabulary
model=word2vec.load(word2vec_model_path,kind='bin')
File "/usr/local/lib/python3.5/dist-packages/word2vec/io.py", line 18, in load
return word2vec.WordVectors.from_binary(fname, *args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/word2vec/wordvectors.py", line 185, in from_binary
with open(fname, 'rb') as fin:
FileNotFoundError: [Errno 2] No such file or directory: 'zhihu-word2vec.bin-100'

No such file or directory error

I try to run p7_TextCNN_train.py in a02_TextCNN, and I've downloaded the zhihu-word2vec-title-desc.bin-100 and put it in the same place as p7_TextCNN_train.py, then the error occurs: [Errno 2] No such file or directory: 'cache_vocabulary_label_pik/cnn2_word_voabulary.pik'
I've noticed that someone met almost the same problem as me, but I don't understand now that cnn2_word_voabulary.pik does not exist, how can the program utilize it? I can't figure out how to debug it. Should I update somewhere in the program?

Flake8: F821 undefined name 'predict'

flake8 testing of https://github.com/brightmart/text_classification on Python 2.7.13

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./a07_Transformer/a2_predict.py:102:55: F821 undefined name 'predict'
            print("===============>",start,"predict:",predict)
                                                      ^

sample data is missing

if not elsewhere reported, when you run the "python -u p7_TextCNN_train.py", a sample data is missing:

FileNotFoundError: [Errno 2] No such file or directory: '../data/train_label_single100_merge.txt' - and as noted in code

By the way, in which way the sample data is expected to be ? I am speaking of a normal text that needs to be classified. How long should each of the lines in the sample data be and what are they allowed to contain / not contain? And how they are pre-processed for the classification ?

can not find the module named 'p4_zhihu_load_data'

hi,
I can not find the module named 'p4_zhihu_load_data', where is it?thanks.

License?

I'm guessing that this code is intended to be open sourced otherwise it wouldn't be here, but is there a particular license that you'd like to choose for your work?

p4_zhihu_load_data is missing

in a01_FastText/p5_fastTextB_train.py , p4_zhihu_load_data is imported, but this file is not appended.

你好，你的项目是不是没有数据文件啊

FileNotFoundError: [Errno 2] No such file or directory: '../zhihu-word2vec.bin-100'

test-zhihu-forpredict-v4only-title.txt not found

I can't find information about 'test-zhihu-forpredict-v4only-title.txt' from all closed issue, can you provide the url for it, or some example data for me? @brightmart

ImportError: No module named p4_zhihu_load_data

当我运行 p5_fastTextB_train.py时，报错ImportError: No module named p4_zhihu_load_data
该怎么办呢？

请问p71_TextRCNN_model.py中rnn-cnn layer定义原理是什么？最后ensemble left, embedding, right to output的方式不像Bi-LSTM layer呀

What is the "zhihu-word2vec-multilabel.bin-100" file? Can I use "zhihu-word2vec-title-desc.bin-100" instead?

pre-trained word embedding

where to find zhihu-word2vec-title-desc.bin-100?

fastText cannot find p4_zhihu_load_data module

ModuleNotFoundError Traceback (most recent call last)
in ()
8 import numpy as np
9 from p5_fastTextB_model import fastTextB as fastText
---> 10 from p4_zhihu_load_data import load_data,create_voabulary,create_voabulary_label
11 from tflearn.data_utils import to_categorical, pad_sequences
12 import os

ModuleNotFoundError: No module named 'p4_zhihu_load_data'

sess.run() blocks

Hello! I am new to tensorflow and when I run your model TextCNN, I get a issue, that is, sess.run() blocks.
I can only get the print before the code: "curr_loss,curr_acc,_=sess.run([textCNN.loss_val,textCNN.accuracy,textCNN.train_op],feed_dict=feed_dict)" and then , the program blocks! I already make sure the input data exists and I fail to figure it out.
Hope you can give me the answer, thanks for your patience.

where is the _word_voabulary.pik file

when I run a01_FastText/p5_fastTextB_train.py, but I get a error "can not find p4_zhihu_load_data library".

Then, I modify the code as
"
sys.path.append('../aa1_data_util')
from data_util_zhihu import load_data,create_voabulary,create_voabulary_label
"

but I get:

cache_path: cache_vocabulary_label_pik/_word_voabulary.pik file_exists: False
create vocabulary. word2vec_model_path: zhihu-word2vec-title-desc.bin-100
Traceback (most recent call last):
File "p5_fastTextB_train.py", line 159, in
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "p5_fastTextB_train.py", line 42, in main
vocabulary_word2index, vocabulary_index2word = create_voabulary()
File "../aa1_data_util/data_util_zhihu.py", line 40, in create_voabulary
with open(cache_path, 'a') as data_f:
FileNotFoundError: [Errno 2] No such file or directory: 'cache_vocabulary_label_pik/_word_voabulary.pik'

So where is the _word_voabulary.pik file? how can I get it?

not found data file

hdf5 is not supported on this machine (please install/reinstall h5py for optimal experience)
('cache_path:', 'cache_vocabulary_label_pik/_word_voabulary.pik', 'file_exists:', False)
('create vocabulary. word2vec_model_path:', 'zhihu-word2vec-title-desc.bin-100')
Traceback (most recent call last):
  File "p5_fastTextB_train.py", line 161, in <module>
    tf.app.run()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "p5_fastTextB_train.py", line 44, in main
    vocabulary_word2index, vocabulary_index2word = create_voabulary()
  File "../aa1_data_util/data_util_zhihu.py", line 26, in create_voabulary
    model=word2vec.load(word2vec_model_path,kind='bin')
  File "/usr/lib64/python2.7/site-packages/word2vec/io.py", line 18, in load
    return word2vec.WordVectors.from_binary(fname, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/word2vec/wordvectors.py", line 154, in from_binary
    with open(fname, 'rb') as fin:
IOError: [Errno 2] No such file or directory: 'zhihu-word2vec-title-desc.bin-100'

please help.

你好，请问用Dynamic Memory network做分类的时候, query是什么

rt，请问就这个场景，怎么构造的query

reload() was moved to importlib in Python 3

As discussed in #6, these four lines (or something similar) need to be added to the following files for them to work in Python 3.

./a8_predict.py:5:1: F821 undefined name 'reload'
./a8_train.py:5:1: F821 undefined name 'reload'
./a01_FastText/p5_fastTextB_predict.py:5:1: F821 undefined name 'reload'
./a01_FastText/p5_fastTextB_predict_multilabel.py:5:1: F821 undefined name 'reload'
./a01_FastText/p5_fastTextB_train.py:5:1: F821 undefined name 'reload'
./a02_TextCNN/p7_TextCNN_predict.py:5:1: F821 undefined name 'reload'
./a02_TextCNN/p7_TextCNN_train.py:5:1: F821 undefined name 'reload'
./a02_TextCNN/other_experiement/p7_TextCNN_predict_exp.py:5:1: F821 undefined name 'reload'
./a02_TextCNN/other_experiement/p7_TextCNN_predict_exp512.py:5:1: F821 undefined name 'reload'
./a02_TextCNN/other_experiement/p7_TextCNN_predict_exp512_0609.py:5:1: F821 undefined name 'reload'
./a02_TextCNN/other_experiement/p7_TextCNN_predict_exp512_simple.py:5:1: F821 undefined name 'reload'
./a02_TextCNN/other_experiement/p7_TextCNN_train_exp.py:5:1: F821 undefined name 'reload'
./a02_TextCNN/other_experiement/p7_TextCNN_train_exp512.py:5:1: F821 undefined name 'reload'
./a02_TextCNN/other_experiement/p7_TextCNN_train_exp_512_0609.py:5:1: F821 undefined name 'reload'
./a02_TextCNN/other_experiement/p8_TextCNN_predict_exp.py:5:1: F821 undefined name 'reload'
./a03_TextRNN/p8_TextRNN_predict.py:5:1: F821 undefined name 'reload'
./a03_TextRNN/p8_TextRNN_train.py:5:1: F821 undefined name 'reload'
./a04_TextRCNN/p71_TextRCNN_predict.py:5:1: F821 undefined name 'reload'
./a04_TextRCNN/p71_TextRCNN_train.py:5:1: F821 undefined name 'reload'
./a05_HierarchicalAttentionNetwork/p1_HierarchicalAttention_predict.py:5:1: F821 undefined name 'reload'
./a05_HierarchicalAttentionNetwork/p1_HierarchicalAttention_train.py:5:1: F821 undefined name 'reload'
./a06_Seq2seqWithAttention/a1_seq2seq_attention_predict.py:5:1: F821 undefined name 'reload'
./a06_Seq2seqWithAttention/a1_seq2seq_attention_train.py:5:1: F821 undefined name 'reload'
./a07_Transformer/a2_predict.py:5:1: F821 undefined name 'reload'
./a07_Transformer/a2_train.py:5:1: F821 undefined name 'reload'
./a08_EntityNetwork/a3_predict.py:5:1: F821 undefined name 'reload'
./a08_EntityNetwork/a3_train.py:5:1: F821 undefined name 'reload'
./aa1_data_util/2_predict_zhihu_get_question_representation.py:3:1: F821 undefined name 'reload'
./aa1_data_util/3_process_zhihu_question_topic_relation.py:3:1: F821 undefined name 'reload'
./aa4_TextCNN_with_RCNN/p72_TextCNN_with_RCNN_train.py:5:1: F821 undefined name 'reload'
./aa5_BiLstmTextRelation/p9_BiLstmTextRelation_train.py:5:1: F821 undefined name 'reload'
./aa6_TwoCNNTextRelation/p9_twoCNNTextRelation_train.py:5:1: F821 undefined name 'reload'

why not just use the w2v_256d that is provided by ZHIHU?

Do you have any experiments that compare w2v_256（ZHIHU provide） with w2v_100（trained by yourself）？

About Average

2.average vectors, to get representation of the sentence

self.sentence_embeddings = tf.reduce_mean(sentence_embeddings, axis=1) # [None,self.embed_size]

When training, we pad all sentences as max sentence length, then tf.reduce_mean(sentence_embeddings, axis=1) means sum/(max sentence length). How could it know the length of each sentence?

Thanks!

missing word2vec package

I am wondering that the word2vec package is the user defined package, it is not include int the project

conda install word2vec

does not support multi-label classification

Thank you for sharing these wonderful codes!

Just one issue, it seems in a03 TextRNN, the result part (self.predictions) does not support multi-label classification. There is a lot to change if I want to adapt it to a multi-label classification task. Also the inference part is mostly about single label classification.

Best wishes,
Hang

error while running p8_TextRNN_train.py from a03_TextRNN

@brightmart
Dear Mr.brightmart,
Hi,

While I run train a03_TextRNN with google_news_wor22vec.bin and a text file with my documents + labels, I've got these errors :

How can I solve this issue?

cache_path: cache_vocabulary_label_pik/rnn_word_voabulary.pik file_exists: False
create vocabulary. word2vec_model_path: GoogleNews-vectors-negative300.bin
rnn_model.vocab_size: 3000001
create_voabulary_label_sorted.started.traning_data_path: train-zhihu4-only-title-all.txt
length of list_label: 146
label: 8476641588870267502 count_value: 3
label: 3738968195649774859 count_value: 3
label: -3517637179126242000 count_value: 3
label: 810067918938531886 count_value: 2
label: 7476760589625268543 count_value: 2
label: 4313812860434517324 count_value: 2
label: 1462130073299421617 count_value: 2
label: -8377411942628634656 count_value: 2
label: -7046289575185911002 count_value: 2
label: -6259864339809244567 count_value: 2
count top10: 23
create_voabulary_label_sorted.ended.len of vocabulary_label: 146
load_data.started...
load_data_multilabel_new.training_data_path: train-zhihu4-only-title-all.txt
0 x0: w18476 w4454 w1674 w6 w25 w474 w1333 w1467 w863 w6 w4430 w11 w813 w4463 w863 w6 w4430 w111
0 x1: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
ys_index:
0 y: -6270130442784051389 ;ys_mulithot_list: 107
1 x1: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
ys_index:
1 y: 1945786109636206690 ;ys_mulithot_list: 70
ys_index:
2 y: 7792886053889220161 ;ys_mulithot_list: 22
ys_index:
3 y: 465065448523711562 ;ys_mulithot_list: 49
number_examples: 164
load_data.ended...
start padding & transform to one hot...
trainX[0]: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
end padding & transform to one hot...

Traceback (most recent call last):
File "p8_TextRNN_train.py", line 184, in
tf.app.run()
File "/home/eslami/anaconda3/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "p8_TextRNN_train.py", line 68, in main
vocab_size, FLAGS.embed_size, FLAGS.is_training)
File "/home/eslami/Downloads/all-kind-text_classification-master/a03_TextRNN/p8_TextRNN_model.py", line 33, in init
self.instantiate_weights()
File "/home/eslami/Downloads/all-kind-text_classification-master/a03_TextRNN/p8_TextRNN_model.py", line 45, in instantiate_weights
self.Embedding = tf.get_variable("Embedding",shape=[self.vocab_size, self.embed_size],initializer=self.initializer) #[vocab_size,embed_size] tf.random_uniform([self.vocab_size, self.embed_size],-1.0,1.0)
File "/home/eslami/anaconda3/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 1065, in get_variable
use_resource=use_resource, custom_getter=custom_getter)
File "/home/eslami/anaconda3/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 962, in get_variable
use_resource=use_resource, custom_getter=custom_getter)
File "/home/eslami/anaconda3/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 367, in get_variable
validate_shape=validate_shape, use_resource=use_resource)
File "/home/eslami/anaconda3/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 352, in _true_getter
use_resource=use_resource)
File "/home/eslami/anaconda3/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 664, in _get_single_variable
name, "".join(traceback.format_list(tb))))
ValueError: Variable Embedding already exists, disallowed. Did you mean to set reuse=True in VarScope? Originally defined at:

File "/home/eslami/Downloads/all-kind-text_classification-master/a03_TextRNN/p8_TextRNN_model.py", line 45, in instantiate_weights
self.Embedding = tf.get_variable("Embedding",shape=[self.vocab_size, self.embed_size],initializer=self.initializer) #[vocab_size,embed_size] tf.random_uniform([self.vocab_size, self.embed_size],-1.0,1.0)
File "/home/eslami/Downloads/all-kind-text_classification-master/a03_TextRNN/p8_TextRNN_model.py", line 33, in init
self.instantiate_weights()
File "/home/eslami/Downloads/all-kind-text_classification-master/a03_TextRNN/p8_TextRNN_model.py", line 123, in test
textRNN=TextRNN(num_classes, learning_rate, batch_size, decay_steps, decay_rate,sequence_length,vocab_size,embed_size,is_training)

pickle.UnpicklingError: could not find MARK

Hi @brightmart
When i run the fasttext train script on windows or centos machine, I've got these errors "pickle.UnpicklingError: could not find MARK", it puzzled me a few days,please help me

D:\Anaconda3\python.exe D:/text_classification/a01_FastText/p6_fastTextB_train_multilabel.py
started...
ended...
D:\Anaconda3\lib\site-packages\gensim\utils.py:862: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
curses is not supported on this machine (please install/reinstall curses for an optimal experience)
Traceback (most recent call last):
File "D:/text_classification/a01_FastText/p6_fastTextB_train_multilabel.py", line 192, in
tf.app.run()
File "D:\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "D:/text_classification/a01_FastText/p6_fastTextB_train_multilabel.py", line 38, in main
vocabulary_word2index, vocabulary_index2word = create_voabulary()
File "D:\text_classification\aa1_data_util\data_util_zhihu.py", line 31, in create_voabulary
model = Word2Vec.load(word2vec_model_path)
File "D:\Anaconda3\lib\site-packages\gensim\models\word2vec.py", line 1483, in load
model = super(Word2Vec, cls).load(*args, **kwargs)
File "D:\Anaconda3\lib\site-packages\gensim\utils.py", line 282, in load
cache_path: cache_vocabulary_label_pik/_word_voabulary.pik file_exists: False
create vocabulary. word2vec_model_path: ../data/zhihu-word2vec-title-desc.bin-100
obj = unpickle(fname)
File "D:\Anaconda3\lib\site-packages\gensim\utils.py", line 935, in unpickle
return _pickle.load(f, encoding='latin1')
_pickle.UnpicklingError: could not find MARK

FileNotFoundError

No such file or directory: '..test-zhihu-forpredict-title-desc-v6.txt'

Is it possible to implement Hierarchical Attention Network with parsing real sentences?

Thank you a lot for your sharing.

I find that in your implementation of Hierarchical Attention Network (HAN), the sentences are separated through setting an equal sentence length. This is however not the true sentence length in the data.

I wonder if it is easy to change this to using a sentence parser to find the sentences? How would be the difference in performance?

Please kindly let me know if you have any idea on parsing the real sentences based on your HAN code. Many thanks!

Getting error during running fasttext_train

@brightmart
Hi Mr.Brightmart
I want to run fast text, but in first step for running train - during running train I've got below error :

Actually I have 2 datasets (one for economic_news and one for lifestyle -- totally there are 14000 documents) in both of them at the end of each document with label labels have been defined.
I attached them
train-zhihu4-only-title-all.txt
Also I made some changes on these 3 program :

data_util_zhihu.txt
p5_fastTextB_model.txt
p5_fastTextB_train.txt

How to import word2vec?

interpreter report no module name this when i run fasttext.sorry, i realize it is a python package

ValueError: Variable Embedding already exists

Traceback (most recent call last):
File "p5_fastTextB_train.py", line 163, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "p5_fastTextB_train.py", line 76, in main
fast_text=fastText(FLAGS.label_size, FLAGS.learning_rate, FLAGS.batch_size, FLAGS.decay_steps, FLAGS.decay_rate,FLAGS.num_sampled,FLAGS.sentence_len,vocab_size,FLAGS.embed_size,FLAGS.is_training)
File "/home/defy/text_classification-master/a01_FastText/p5_fastTextB_model.py", line 29, in init
self.instantiate_weights()
File "/home/defy/text_classification-master/a01_FastText/p5_fastTextB_model.py", line 42, in instantiate_weights
self.Embedding = tf.get_variable("Embedding", [self.vocab_size, self.embed_size])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 1049, in get_variable
use_resource=use_resource, custom_getter=custom_getter)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 948, in get_variable
use_resource=use_resource, custom_getter=custom_getter)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 356, in get_variable
validate_shape=validate_shape, use_resource=use_resource)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 341, in _true_getter
use_resource=use_resource)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 653, in _get_single_variable
name, "".join(traceback.format_list(tb))))
ValueError: Variable Embedding already exists, disallowed. Did you mean to set reuse=True in VarScope? Originally defined at:

File "/home/defy/text_classification-master/a01_FastText/p5_fastTextB_model.py", line 42, in instantiate_weights
self.Embedding = tf.get_variable("Embedding", [self.vocab_size, self.embed_size])
File "/home/defy/text_classification-master/a01_FastText/p5_fastTextB_model.py", line 29, in init
self.instantiate_weights()
File "/home/defy/text_classification-master/a01_FastText/p5_fastTextB_model.py", line 104, in test
fastText=fastTextB(num_classes, learning_rate, batch_size, decay_steps, decay_rate,5,sequence_length,vocab_size,embed_size,is_training)

python 2.7 ,tensorlfow 1.1, run p5_fastTextB_model, but get this error @brightmart

TextRNN model details

Hello.

Is there any chance to get some reference to papers (or any other documents) to the TextRNN model?

Thanks in advance.

could you give a brief intro about the ensemble

Hi bright,

nice job on these baselines. I saw you said you did some ensemble work, so could you give some words about?

I didn't find a good way to do a traditional bagging or boosting, and I tried the hard voting, but results' bad. :-(

Training and validation accuracy not changing during training

Multiple models (a02_TextCNN, a04_TextRCNN) have training and validation accuracy fixed at 0.5, while training loss/validation loss stably drops over time (dramatically, from 10^8-9 digits down to 1 to 2 digits.)

Is it normal or something wrong?

One of models, aa6_TwoCNNTextRelation has training accuracy fluctuate (above 0.5).

Thx.

ImportError: No module named word2vec

I can't find the word2vec.py anywhere.

Downloading the dataset from pan.baidu.com

I followed the steps in issue#3 and still having an issue with the download.
@brightmart Can you please provide steps to download and/or to run with another dataset?

Errors when debugging

As a noob,I find it hard to debug your programs.There are too many problems...
Dear author,would you please provide even one complete code that could run directly?

Besides,I would really appreciate it if you explain how to make personal files similar with 'zhihu-word2vec-title-desc.bin-100'.

p4_zhihu_load_data

Can not find this file under the folder.

Have you ever try to add 'bias' variable to l2_loss?

When I meet a mult-class and imbalance dataset.Add 'bias' variable to l2_loss looks like better.

数据格式

你好，我想请问一下怎么把我自己的英文语料库的格式改成和你的数据一样的格式？

ModuleNotFoundError: No module named 'a02_TextCNN'

Issue when trying to run the p7_TextCNN_predict.py after training

ModuleNotFoundError: No module named 'a02_TextCNN'

Question: can you get it running with clean anaconda and Python 3.6 and additional such word2vec, etc. installs ? I think that there are a some code changes needed in the files.

import word2vec

i tried to run the sample. but i found the code" import word2vec" in data_util_zhihu.py couldn't work. i want to know where can i download the word2vec, thank you very mcuh.

Do we need a mask tensor for averaging

1.get emebedding of words in the sentence

    sentence_embeddings = tf.nn.embedding_lookup(self.Embedding,self.sentence)  #  [None,self.sentence_len,self.embed_size]

2.average vectors, to get representation of the sentence

    self.sentence_embeddings = tf.reduce_mean(sentence_embeddings, axis=1)  # [None,self.embed_size]

Since the length of sentences is variable, do we need a mask tensor for geting the average?
For example, tf.reduce_sum(tf.multiply(sentence_embeddings,mask), axis=1) / tf.reduce_sum(mask, axis=1)

sample data, pre-trained word embedding

I'm getting this issue when I run training on a08 entity network and a06 seq2seq models.

Can I get or train this file?

zhihu-word2vec-title-desc.bin-100

Also, do you have sample datasets compatible with these models?

hyperparameter in textRNN

text_classification/a03_TextRNN/p8_TextRNN_model.py

Lines 74 to 83 in 68e2fcf

 def loss(self,l2_lambda=0.0001): 

 with tf.name_scope("loss"): 

 #input: `logits` and `labels` must have the same shape `[batch_size, num_classes]` 

 #output: A 1-D `Tensor` of length `batch_size` of the same type as `logits` with the softmax cross entropy loss. 

 losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=self.input_y, logits=self.logits);#sigmoid_cross_entropy_with_logits.#losses=tf.nn.softmax_cross_entropy_with_logits(labels=self.input_y,logits=self.logits) 

 #print("1.sparse_softmax_cross_entropy_with_logits.losses:",losses) # shape=(?,) 

 loss=tf.reduce_mean(losses)#print("2.loss.loss:", loss) #shape=() 

 l2_losses = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables() if 'bias' not in v.name]) * l2_lambda 

 loss=loss+l2_losses 

 return loss

the final loss is defined as the sum of cross_entropy_loss and L2-loss to penalty large variables, where a hyperparameter l2_lambda is used to balance these two items. In the algorithm, as the hyperparameter l2_lambda is directly set to 0.0001, would it be too small(or too large) under some circumstances such that one in the two items loses its contribution to the whole loss?
in general, is there any pratical methods to guide me to set values of hyperparameters, e.g. l2_lambda
@brightmart

There is no file named "test-zhihu-forpredict-title-desc-v6.txt" in the Hierarchical Attention Network

There is no file named "test-zhihu-forpredict-title-desc-v6.txt",when i run the p1_HierarchicalAttention_predict.py in the Hierarchical Attention Network. Also i have tried to use the test-zhihu6-title-desc.txt instead, but there will be an error. Can you give me some advice? @brightmart

	def loss(self,l2_lambda=0.0001):
	with tf.name_scope("loss"):
	#input: `logits` and `labels` must have the same shape `[batch_size, num_classes]`
	#output: A 1-D `Tensor` of length `batch_size` of the same type as `logits` with the softmax cross entropy loss.
	losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=self.input_y, logits=self.logits);#sigmoid_cross_entropy_with_logits.#losses=tf.nn.softmax_cross_entropy_with_logits(labels=self.input_y,logits=self.logits)
	#print("1.sparse_softmax_cross_entropy_with_logits.losses:",losses) # shape=(?,)
	loss=tf.reduce_mean(losses)#print("2.loss.loss:", loss) #shape=()
	l2_losses = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables() if 'bias' not in v.name]) * l2_lambda
	loss=loss+l2_losses
	return loss

brightmart / text_classification Goto Github PK

text_classification's Issues

2.average vectors, to get representation of the sentence

While I run train a03_TextRNN with google_news_wor22vec.bin and a text file with my documents + labels, I've got these errors :

1.get emebedding of words in the sentence

2.average vectors, to get representation of the sentence

Recommend Projects

Recommend Topics

Recommend Org