zll17 / neural_topic_models Goto Github PK
View Code? Open in Web Editor NEWImplementation of topic models based on neural network approaches.
Implementation of topic models based on neural network approaches.
gsm的实现和论文中描述的不一样吧?论文里还有个beta矩阵,表示topic-word distribution,代码里没有看到
Hi, I have been running BATM with my own dataset and I get the topic output just fine. In my case, I set n_topic =5 and get the 5 topics and all the numbers regarding the topics.
Now, here is what I have trouble wrapping my head around. The dataset is a bunch of lines which are supposed to each represent 1 document. When I have done topic modelling in the past, the problem is to label each document with a topic or topics (either 1 topic for traditional clustering or perhaps more than 1 for soft clustering). So my dataset is a collection of n documents (n lines in my text editor) and I want to apply these generated topics to each document. How would I accomplish this? Am I talking about the same problem?
Traceback (most recent call last):
File "ETM_run.py", line 97, in <module>
main()
File "ETM_run.py", line 69, in main
model.train(train_data=docSet,batch_size=batch_size,test_data=docSet,num_epochs=num_epochs,log_every=10,beta=1.0,criterion=criterion)
File "/home/vimos/git/Neural_Topic_Models/models/ETM.py", line 122, in train
save_name = f'./ckpt/WTM_{self.taskname}_tp{self.n_topic}_{self.dist}_{time.strftime("%Y-%m-%d-%H-%M", time.localtime())}.ckpt'
AttributeError: 'ETM' object has no attribute 'dist'
pyhanlp不支持Python3.9及以上版本。
Hi and thank you very much for your really helpful code!
I am trying to test my trained model and have problems with the inference.py file.
I specified a checkpoint stored in the ckpt folder, but I get a "KeyError: 'param'".
Could you please elaborate on how to use the --model_path flag? (And in general, it would be useful to have a quick overview on how to use the inference.py file.)
Thank you very much in advance and best regards.
最近我按照您的代码实现思路,我复现了一下论文中的模型,有几个问题想请教您:
1.不知道您是否增加试验过增加epoch,实验中我epoch=3000左右的时候Dsicrimimator_loss已经收敛,但是encoder和generator的loss我训练到20K的时候两者还未收敛?
2.topic words分布刚开始不是很好,会出现很多主题下有相同的词,但是在可能10Kepoch后有了效果。
希望能和您进一步讨论 感谢~
I just check your code and the train *_script run perfectly but when i try inference it seem like some 'param' was not save while training and as I see you didn't write the instruction for using inference.py. Hope you could help me solve this issue @zll17
When running inference on WTM, I get the following:
File "/home/dgolem/eclipse-workspace/Neural_Topic_Models/inference.py", line 91, in
main()
File "/home/dgolem/eclipse-workspace/Neural_Topic_Models/inference.py", line 87, in main
json.dump(infer_topics,f)
File "/usr/lib64/python3.6/json/init.py", line 179, in dump
for chunk in iterable:
File "/usr/lib64/python3.6/json/encoder.py", line 428, in _iterencode
yield from _iterencode_list(o, _current_indent_level)
File "/usr/lib64/python3.6/json/encoder.py", line 325, in _iterencode_list
yield from chunks
File "/usr/lib64/python3.6/json/encoder.py", line 437, in _iterencode
o = _default(o)
File "/usr/lib64/python3.6/json/encoder.py", line 180, in default
o.class.name)
TypeError: Object of type 'ndarray' is not JSON serializable
[0.011269581504166126 0.00033260477357544005 0.3443009555339813
0.0049138059839606285 0.007035833317786455 0.0002668765955604613
0.0021645957604050636 0.04201849177479744 0.0041013904847204685
0.005380461923778057 0.005701055750250816 0.30710265040397644
0.12966400384902954 0.06940549612045288 0.021206317469477654
0.0028165027033537626 0.0014157032128423452 0.00024422683054581285
0.0011101358104497194 0.039549320936203]:['2006', 'Pangandaran', 'earthquake', 'tsunami', 'occur', 'July', '17', 'subduction', 'zone', 'coast', 'west', 'central', 'Java', 'large', 'densely', 'populated', 'island', 'indonesian', 'archipelago', 'shock', 'moment', 'magnitude', '7.7', 'maximum', 'perceive', 'intensity', 'IV', 'Light', 'Jakarta', 'capital', 'large', 'city', 'Indonesia', 'direct', 'effect', 'earthquake', 'shake', 'low', 'intensity', 'large', 'loss', 'life', 'event', 'result', 'tsunami', 'inundate', 'portion', 'Java', 'coast', 'unaffected', 'early', '2004', 'Indian', 'Ocean', 'earthquake', 'tsunami', 'coast', 'Sumatra', 'July', '2006', 'earthquake', 'center', 'Indian', 'Ocean', 'coast', 'Java', 'duration', 'minute', 'abnormally', 'slow', 'rupture', 'Sunda', 'Trench', 'tsunami', 'unusually', 'strong', 'relative', 'size', 'earthquake', 'factor', 'lead', 'categorize', 'tsunami', 'earthquake', 'thousand', 'kilometer', 'southeast', 'surge', 'meter', 'observe', 'northwestern', 'Australia', 'Java', 'tsunami', 'runup', 'height', 'normal', 'sea', 'level', 'typically', 'result', 'death', '600', 'people', 'factor', 'contribute', 'exceptionally', 'high', 'peak', 'runup', 'small', 'uninhabited', 'island', 'Nusa', 'Kambangan', 'east', 'resort', 'town', 'Pangandaran', 'damage', 'heavy', 'large', 'loss', 'life', 'occur', 'shock', 'feel', 'moderate', 'intensity', 'inland', 'shore', 'surge', 'arrive', 'little', 'warning', 'factor', 'contribute', 'tsunami', 'largely', 'undetected', 'late', 'tsunami', 'watch', 'post', 'american', 'tsunami', 'warning', 'center', 'japanese', 'meteorological', 'center', 'information', 'deliver', 'people', 'coast']
I edited tokenize.py
and in main called
tokenizer=SpacyTokenize()
to use the Spacy Tokenizer for English text. Tho I always end up getting a :
tcmalloc large alloc
memory error on running on Google Colab.
Thoughts on how I can use the English tokenizer for my dataset? Or for the English dataset dailydialoguttr_lines.txt
, how do you run the code for the GSM model? @zll17
请问模型训练出来应该怎么使用?比如我想判断谋篇文章(词)属于哪个主题。
请问作者,data/ 下的stopwords.txt 是现有的停用词还是自己构建的,如是现有的,请问出处;如是自己构建的,请问构建规则和注意事项。
How do I conduct inference using the BATM model? I see the "show_topic_words" method that returns the top words per topic but am unsure of how to get the topic distribution for documents.
This might be a typo in tokenization.py
.
vimos@vimos-Z270MX-Gaming5 (base) ➜ Neural_Topic_Models git:(master) python3 ETM_run.py --taskname zhdd --n_topic 20 --num_epochs 1000 --no_below 5 --auto_adj --emb_dim 300
Traceback (most recent call last):
File "ETM_run.py", line 22, in <module>
from dataset import DocDataset
File "/home/vimos/git/Neural_Topic_Models/dataset.py", line 13, in <module>
from tokenization import Tokenizer
File "/home/vimos/git/Neural_Topic_Models/tokenization.py", line 23
'''
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 318-320: truncated \uXXXX escape
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.