Git Product home page Git Product logo

neural_topic_models's People

Contributors

vimos avatar xdarklemon avatar zll17 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

neural_topic_models's Issues

GSM的实现问题

gsm的实现和论文中描述的不一样吧?论文里还有个beta矩阵,表示topic-word distribution,代码里没有看到

Topics as they apply to documents

Hi, I have been running BATM with my own dataset and I get the topic output just fine. In my case, I set n_topic =5 and get the 5 topics and all the numbers regarding the topics.

Now, here is what I have trouble wrapping my head around. The dataset is a bunch of lines which are supposed to each represent 1 document. When I have done topic modelling in the past, the problem is to label each document with a topic or topics (either 1 topic for traditional clustering or perhaps more than 1 for soft clustering). So my dataset is a collection of n documents (n lines in my text editor) and I want to apply these generated topics to each document. How would I accomplish this? Am I talking about the same problem?

AttributeError: 'ETM' object has no attribute 'dist'

Traceback (most recent call last):
  File "ETM_run.py", line 97, in <module>
    main()
  File "ETM_run.py", line 69, in main
    model.train(train_data=docSet,batch_size=batch_size,test_data=docSet,num_epochs=num_epochs,log_every=10,beta=1.0,criterion=criterion)
  File "/home/vimos/git/Neural_Topic_Models/models/ETM.py", line 122, in train
    save_name = f'./ckpt/WTM_{self.taskname}_tp{self.n_topic}_{self.dist}_{time.strftime("%Y-%m-%d-%H-%M", time.localtime())}.ckpt'
AttributeError: 'ETM' object has no attribute 'dist'

Inference from Checkpoints

Hi and thank you very much for your really helpful code!
I am trying to test my trained model and have problems with the inference.py file.
I specified a checkpoint stored in the ckpt folder, but I get a "KeyError: 'param'".
Could you please elaborate on how to use the --model_path flag? (And in general, it would be useful to have a quick overview on how to use the inference.py file.)
Thank you very much in advance and best regards.

作者您好关于BATM模型的问题

最近我按照您的代码实现思路,我复现了一下论文中的模型,有几个问题想请教您:
1.不知道您是否增加试验过增加epoch,实验中我epoch=3000左右的时候Dsicrimimator_loss已经收敛,但是encoder和generator的loss我训练到20K的时候两者还未收敛?
2.topic words分布刚开始不是很好,会出现很多主题下有相同的词,但是在可能10Kepoch后有了效果。
希望能和您进一步讨论 感谢~

Inference not working!

I just check your code and the train *_script run perfectly but when i try inference it seem like some 'param' was not save while training and as I see you didn't write the instruction for using inference.py. Hope you could help me solve this issue @zll17

Inference problem when trying to convert to json

When running inference on WTM, I get the following:

File "/home/dgolem/eclipse-workspace/Neural_Topic_Models/inference.py", line 91, in
main()
File "/home/dgolem/eclipse-workspace/Neural_Topic_Models/inference.py", line 87, in main
json.dump(infer_topics,f)
File "/usr/lib64/python3.6/json/init.py", line 179, in dump
for chunk in iterable:
File "/usr/lib64/python3.6/json/encoder.py", line 428, in _iterencode
yield from _iterencode_list(o, _current_indent_level)
File "/usr/lib64/python3.6/json/encoder.py", line 325, in _iterencode_list
yield from chunks
File "/usr/lib64/python3.6/json/encoder.py", line 437, in _iterencode
o = _default(o)
File "/usr/lib64/python3.6/json/encoder.py", line 180, in default
o.class.name)
TypeError: Object of type 'ndarray' is not JSON serializable

NVDM-GSM最后存下来的embedding是不是文档的主题向量?怎么复原到文字表述的topic

[0.011269581504166126 0.00033260477357544005 0.3443009555339813
0.0049138059839606285 0.007035833317786455 0.0002668765955604613
0.0021645957604050636 0.04201849177479744 0.0041013904847204685
0.005380461923778057 0.005701055750250816 0.30710265040397644
0.12966400384902954 0.06940549612045288 0.021206317469477654
0.0028165027033537626 0.0014157032128423452 0.00024422683054581285
0.0011101358104497194 0.039549320936203]:['2006', 'Pangandaran', 'earthquake', 'tsunami', 'occur', 'July', '17', 'subduction', 'zone', 'coast', 'west', 'central', 'Java', 'large', 'densely', 'populated', 'island', 'indonesian', 'archipelago', 'shock', 'moment', 'magnitude', '7.7', 'maximum', 'perceive', 'intensity', 'IV', 'Light', 'Jakarta', 'capital', 'large', 'city', 'Indonesia', 'direct', 'effect', 'earthquake', 'shake', 'low', 'intensity', 'large', 'loss', 'life', 'event', 'result', 'tsunami', 'inundate', 'portion', 'Java', 'coast', 'unaffected', 'early', '2004', 'Indian', 'Ocean', 'earthquake', 'tsunami', 'coast', 'Sumatra', 'July', '2006', 'earthquake', 'center', 'Indian', 'Ocean', 'coast', 'Java', 'duration', 'minute', 'abnormally', 'slow', 'rupture', 'Sunda', 'Trench', 'tsunami', 'unusually', 'strong', 'relative', 'size', 'earthquake', 'factor', 'lead', 'categorize', 'tsunami', 'earthquake', 'thousand', 'kilometer', 'southeast', 'surge', 'meter', 'observe', 'northwestern', 'Australia', 'Java', 'tsunami', 'runup', 'height', 'normal', 'sea', 'level', 'typically', 'result', 'death', '600', 'people', 'factor', 'contribute', 'exceptionally', 'high', 'peak', 'runup', 'small', 'uninhabited', 'island', 'Nusa', 'Kambangan', 'east', 'resort', 'town', 'Pangandaran', 'damage', 'heavy', 'large', 'loss', 'life', 'occur', 'shock', 'feel', 'moderate', 'intensity', 'inland', 'shore', 'surge', 'arrive', 'little', 'warning', 'factor', 'contribute', 'tsunami', 'largely', 'undetected', 'late', 'tsunami', 'watch', 'post', 'american', 'tsunami', 'warning', 'center', 'japanese', 'meteorological', 'center', 'information', 'deliver', 'people', 'coast']

Error on using Spacy Tokenizer

I edited tokenize.py
and in main called

tokenizer=SpacyTokenize()

to use the Spacy Tokenizer for English text. Tho I always end up getting a :

tcmalloc large alloc 

memory error on running on Google Colab.

Thoughts on how I can use the English tokenizer for my dataset? Or for the English dataset dailydialoguttr_lines.txt, how do you run the code for the GSM model? @zll17

GSM实现有问题?

您好,感谢您贡献的代码。我跑了一下cnews10在GSM上的代码,有一个疑问是,KL散度消失(基本为0)而且主题发现效果很差,请问这是实现上的问题吗,我使用的是默认参数?

image

topic diversity:0.03866666666666667
c_v:0.7579875287637481, c_w2v:None, c_uci:-18.122450398623315, c_npmi:-0.6600369278689214
mimno topic coherence:-326.14847513585073

从TD和NPMI看出模型是有问题的。

Stopwords.txt 如何产生

请问作者,data/ 下的stopwords.txt 是现有的停用词还是自己构建的,如是现有的,请问出处;如是自己构建的,请问构建规则和注意事项。

Inference for BATM

How do I conduct inference using the BATM model? I see the "show_topic_words" method that returns the top words per topic but am unsure of how to get the topic distribution for documents.

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes

This might be a typo in tokenization.py.

vimos@vimos-Z270MX-Gaming5 (base) ➜  Neural_Topic_Models git:(master) python3 ETM_run.py --taskname zhdd --n_topic 20 --num_epochs 1000 --no_below 5 --auto_adj --emb_dim 300
Traceback (most recent call last):
  File "ETM_run.py", line 22, in <module>
    from dataset import DocDataset
  File "/home/vimos/git/Neural_Topic_Models/dataset.py", line 13, in <module>
    from tokenization import Tokenizer
  File "/home/vimos/git/Neural_Topic_Models/tokenization.py", line 23
    '''
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 318-320: truncated \uXXXX escape

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.