zll17 / neural_topic_models Goto Github PK

View Code? Open in Web Editor NEW

407.0 407.0 77.0 138.2 MB

Implementation of topic models based on neural network approaches.

Python 30.80% Jupyter Notebook 69.20%

neural_topic_models's Issues

GSM的实现问题

gsm的实现和论文中描述的不一样吧？论文里还有个beta矩阵，表示topic-word distribution，代码里没有看到

Error on using Spacy Tokenizer

I edited tokenize.py
and in main called

tokenizer=SpacyTokenize()

to use the Spacy Tokenizer for English text. Tho I always end up getting a :

tcmalloc large alloc

memory error on running on Google Colab.

Thoughts on how I can use the English tokenizer for my dataset? Or for the English dataset dailydialoguttr_lines.txt, how do you run the code for the GSM model? @zll17

Stopwords.txt 如何产生

请问作者，data/ 下的stopwords.txt 是现有的停用词还是自己构建的，如是现有的，请问出处；如是自己构建的，请问构建规则和注意事项。

原版的DailyDialog有13118个对话，与zhdd的对应关系是怎样的？

Hi and thank you very much for your really helpful code!
I am trying to test my trained model and have problems with the inference.py file.
I specified a checkpoint stored in the ckpt folder, but I get a "KeyError: 'param'".
Could you please elaborate on how to use the --model_path flag? (And in general, it would be useful to have a quick overview on how to use the inference.py file.)
Thank you very much in advance and best regards.

NVDM-GSM最后存下来的embedding是不是文档的主题向量？怎么复原到文字表述的topic

[0.011269581504166126 0.00033260477357544005 0.3443009555339813
0.0049138059839606285 0.007035833317786455 0.0002668765955604613
0.0021645957604050636 0.04201849177479744 0.0041013904847204685
0.005380461923778057 0.005701055750250816 0.30710265040397644
0.12966400384902954 0.06940549612045288 0.021206317469477654
0.0028165027033537626 0.0014157032128423452 0.00024422683054581285
0.0011101358104497194 0.039549320936203]:['2006', 'Pangandaran', 'earthquake', 'tsunami', 'occur', 'July', '17', 'subduction', 'zone', 'coast', 'west', 'central', 'Java', 'large', 'densely', 'populated', 'island', 'indonesian', 'archipelago', 'shock', 'moment', 'magnitude', '7.7', 'maximum', 'perceive', 'intensity', 'IV', 'Light', 'Jakarta', 'capital', 'large', 'city', 'Indonesia', 'direct', 'effect', 'earthquake', 'shake', 'low', 'intensity', 'large', 'loss', 'life', 'event', 'result', 'tsunami', 'inundate', 'portion', 'Java', 'coast', 'unaffected', 'early', '2004', 'Indian', 'Ocean', 'earthquake', 'tsunami', 'coast', 'Sumatra', 'July', '2006', 'earthquake', 'center', 'Indian', 'Ocean', 'coast', 'Java', 'duration', 'minute', 'abnormally', 'slow', 'rupture', 'Sunda', 'Trench', 'tsunami', 'unusually', 'strong', 'relative', 'size', 'earthquake', 'factor', 'lead', 'categorize', 'tsunami', 'earthquake', 'thousand', 'kilometer', 'southeast', 'surge', 'meter', 'observe', 'northwestern', 'Australia', 'Java', 'tsunami', 'runup', 'height', 'normal', 'sea', 'level', 'typically', 'result', 'death', '600', 'people', 'factor', 'contribute', 'exceptionally', 'high', 'peak', 'runup', 'small', 'uninhabited', 'island', 'Nusa', 'Kambangan', 'east', 'resort', 'town', 'Pangandaran', 'damage', 'heavy', 'large', 'loss', 'life', 'occur', 'shock', 'feel', 'moderate', 'intensity', 'inland', 'shore', 'surge', 'arrive', 'little', 'warning', 'factor', 'contribute', 'tsunami', 'largely', 'undetected', 'late', 'tsunami', 'watch', 'post', 'american', 'tsunami', 'warning', 'center', 'japanese', 'meteorological', 'center', 'information', 'deliver', 'people', 'coast']

Inference for BATM

How do I conduct inference using the BATM model? I see the "show_topic_words" method that returns the top words per topic but am unsure of how to get the topic distribution for documents.

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes

This might be a typo in tokenization.py.

vimos@vimos-Z270MX-Gaming5 (base) ➜  Neural_Topic_Models git:(master) python3 ETM_run.py --taskname zhdd --n_topic 20 --num_epochs 1000 --no_below 5 --auto_adj --emb_dim 300
Traceback (most recent call last):
  File "ETM_run.py", line 22, in <module>
    from dataset import DocDataset
  File "/home/vimos/git/Neural_Topic_Models/dataset.py", line 13, in <module>
    from tokenization import Tokenizer
  File "/home/vimos/git/Neural_Topic_Models/tokenization.py", line 23
    '''
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 318-320: truncated \uXXXX escape

你好作者，请问为什么我在实际训练中只是CPU在工作呢？同时也是吃的内存？

Inference not working!

I just check your code and the train *_script run perfectly but when i try inference it seem like some 'param' was not save while training and as I see you didn't write the instruction for using inference.py. Hope you could help me solve this issue @zll17

Topics as they apply to documents

Hi, I have been running BATM with my own dataset and I get the topic output just fine. In my case, I set n_topic =5 and get the 5 topics and all the numbers regarding the topics.

Now, here is what I have trouble wrapping my head around. The dataset is a bunch of lines which are supposed to each represent 1 document. When I have done topic modelling in the past, the problem is to label each document with a topic or topics (either 1 topic for traditional clustering or perhaps more than 1 for soft clustering). So my dataset is a collection of n documents (n lines in my text editor) and I want to apply these generated topics to each document. How would I accomplish this? Am I talking about the same problem?

Zombie Process Issue

Every time I stopped running ETM_run.py, there were multiple zombie processes left, can u shed light on this issue? Thanks!

作者您好关于BATM模型的问题

最近我按照您的代码实现思路，我复现了一下论文中的模型，有几个问题想请教您：
1.不知道您是否增加试验过增加epoch，实验中我epoch=3000左右的时候Dsicrimimator_loss已经收敛，但是encoder和generator的loss我训练到20K的时候两者还未收敛？
2.topic words分布刚开始不是很好，会出现很多主题下有相同的词，但是在可能10Kepoch后有了效果。
希望能和您进一步讨论感谢~

Inference problem when trying to convert to json

When running inference on WTM, I get the following:

File "/home/dgolem/eclipse-workspace/Neural_Topic_Models/inference.py", line 91, in
main()
File "/home/dgolem/eclipse-workspace/Neural_Topic_Models/inference.py", line 87, in main
json.dump(infer_topics,f)
File "/usr/lib64/python3.6/json/init.py", line 179, in dump
for chunk in iterable:
File "/usr/lib64/python3.6/json/encoder.py", line 428, in _iterencode
yield from _iterencode_list(o, _current_indent_level)
File "/usr/lib64/python3.6/json/encoder.py", line 325, in _iterencode_list
yield from chunks
File "/usr/lib64/python3.6/json/encoder.py", line 437, in _iterencode
o = _default(o)
File "/usr/lib64/python3.6/json/encoder.py", line 180, in default
o.class.name)
TypeError: Object of type 'ndarray' is not JSON serializable

AttributeError: 'ETM' object has no attribute 'dist'

Traceback (most recent call last):
  File "ETM_run.py", line 97, in <module>
    main()
  File "ETM_run.py", line 69, in main
    model.train(train_data=docSet,batch_size=batch_size,test_data=docSet,num_epochs=num_epochs,log_every=10,beta=1.0,criterion=criterion)
  File "/home/vimos/git/Neural_Topic_Models/models/ETM.py", line 122, in train
    save_name = f'./ckpt/WTM_{self.taskname}_tp{self.n_topic}_{self.dist}_{time.strftime("%Y-%m-%d-%H-%M", time.localtime())}.ckpt'
AttributeError: 'ETM' object has no attribute 'dist'

GSM实现有问题？

您好，感谢您贡献的代码。我跑了一下cnews10在GSM上的代码，有一个疑问是，KL散度消失（基本为0）而且主题发现效果很差，请问这是实现上的问题吗，我使用的是默认参数？

topic diversity:0.03866666666666667
c_v:0.7579875287637481, c_w2v:None, c_uci:-18.122450398623315, c_npmi:-0.6600369278689214
mimno topic coherence:-326.14847513585073

从TD和NPMI看出模型是有问题的。

zll17 / neural_topic_models Goto Github PK

neural_topic_models's Issues

Recommend Projects

Recommend Topics

Recommend Org