您好，感谢您贡献的代码。我跑了一下cnews10在GSM上的代码，有一个疑问是，KL散度消失（基本为0）而且主题发现效果很差，请问这是实现上的问题吗，我使用的是默认参数？<

gensim的调用可以参照CTM的实现 <a href="https://github.com/MilaNLProc/contex

GSM实现有问题？ about neural_topic_models HOT 7 OPEN

zll17 commented on August 14, 2024

GSM实现有问题？

from neural_topic_models.

Comments (7)

A11en0 commented on August 14, 2024

其他几个模型基本也都出现了这个问题。

from neural_topic_models.

zll17 commented on August 14, 2024

谢谢反馈，我检查一下，稍后回复。

from neural_topic_models.

A11en0 commented on August 14, 2024

感谢感谢！

from neural_topic_models.

zll17 commented on August 14, 2024

你好，我试验了GSM、WLDA和WTM这三个模型，是从仓库拉取重新配置的环境，其中GSM跑了三次，WLDA和WTM-GMM各跑了一次，结果大致是正常的，不过确实发现GSM有不稳定的现象，因为同样的参数两次的实验TD浮动了0.2（分别是0.423和0.626）左右，以下是我的命令和结果：

exp_id	主要参数差异	TD
gsm_exp0_manual	--no_above 0.0134 --no_below 5	0.423
gsm_exp1_autoadj	--autoadj	0.466
gsm_exp2_manual	--no_above 0.0134 --no_below 5	0.627
wlda_exp0_autoadj	--autoadj --dist dirichlet	0.953
wtmgmm_exp0_autoadj	--autoadj --dist gmm-std	0.99

python GSM_run.py --taskname cnews10k --n_topic 20 --num_epochs 600 --no_above 0.0134 --no_below 5 --rebuild
实验日志：
gsm_exp0_manual.log

python GSM_run.py --taskname cnews10k --n_topic 20 --num_epochs 600 --auto_adj --rebuild
实验日志：
gsm_exp1_autoadj.log

python GSM_run.py --taskname cnews10k --n_topic 20 --num_epochs 600 --no_above 0.0134 --no_below 5 --rebuild
实验日志：
gsm_exp2_manual.log

python WTM_run.py --taskname cnews10k --n_topic 20 --num_epochs 600 --no_above 0.013 --dist dirichlet --rebuild
实验日志：
wlda_exp0_autoadj.log

python WTM_run.py --taskname cnews10k --n_topic 20 --num_epochs 600 --auto_adj --rebuild --dist gmm-std --rebuild
实验日志：
wtmgmm_exp0_autoadj.log

我的依赖库是：
reqs.txt
CUDA版本是11.3

WLDA和WTM-GMM的主题一致性看起来在可接受的范围(?)。

GSM在exp0和exp1的结果不好，同质化严重，在exp2的结果还可以，这个不稳定估计是跟VAE本身的分布重叠有关系，我再核对一下后来的改动和原论文。

不知道你的实验结果是否是因为依赖版本的不同，建议再跑下看看，可以试试用我的配置。

from neural_topic_models.

A11en0 commented on August 14, 2024

感谢您的实验！看完了所有的log，我也重复了下实验，除了GSM（其实大家更多叫它NVDM）在cnews10k，我还做了一组英文的，感觉还是有点问题。

log 只放了400个epoch以后的记录，此时KL到了2.x左右，而刚开始的KL我猜测也是和我的截图一样很小的，KL散度消失的问题还是存在。
TD太小了，一致性指标就不太能说明问题，应该还是有问题。
再说我的实验，直接clone的代码拿过来跑的。其他库的版本我没太关注，个人觉得不太可能是库的版本的问题，至多是和tokenizer的版本相关，这个暂时不讨论。下面是实验结果：

英文：GoogleNews
topic diversity:0.36533333333333334
c_v:0.560458509945799, c_w2v:None, c_uci:-12.310158051451925, c_npmi:-0.4185083982965566
mimno topic coherence:-352.54628749569434

结果是显然不太正常的，TD还是很小。CV有点大的离谱，已经超过现在很多新模型。

中文：Cnews10k

from neural_topic_models.

zll17 commented on August 14, 2024

CV等指标目前确实还存疑，这几个指标的计算是基于gensim的，但因为跟我的实现的接口不一致，我这里选择的是gensim给出的第二种调用方式(gensim.models.coherencemodel)，但正确性有待核验，也可能我的用法不对，这部分我再看看。

Another way of using this feature is through providing tokenized topics such as:
.. sourcecode:: pycon
>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.models.coherencemodel import CoherenceModel
>>> topics = [
... ['human', 'computer', 'system', 'interface'],
... ['graph', 'minors', 'trees', 'eps']
... ]
>>>
>>> cm = CoherenceModel(topics=topics, corpus=common_corpus, dictionary=common_dictionary, coherence='u_mass')
>>> coherence = cm.get_coherence() # get coherence value
"""

英文的tokenizer发现确实有问题，预处理没做够，spacy的分词器在数据过多的情况下出现了不能做lemmatization的bug。

命名是这样的，
NVDM的提出论文是这篇：Y Miao (2015). Neural Variational Inference for Text Processing，
GSM的提出论文是这篇：Y Miao (2017). Discovering Discrete Latent Topics with Neural Variational Inference，
其中的一个区别是NVDM对隐变量h没有归一化的约束，而GSM有(所以GSM处理的是分布$\theta$)。GSM这篇论文里依据归一化函数g(x)的不同构造，给出了GSM、GSB和RSB三种模型，在原论文的表格2中作者有区分GSM和NVDM，在作者的NVDM的实现中也能看到没有对h归一化。因为从原论文来看，GSM、GSB和RSB三种模型大同小异，性能相差也不大，这里就只实现了GSM。

可以分享一下GoogleNews的数据集链接吗？我只找到了训练后的word2vec的预训练权重，打算试一下。

from neural_topic_models.

A11en0 commented on August 14, 2024

gensim的调用可以参照CTM的实现 https://github.com/MilaNLProc/contextualized-topic-models，同时他们也给了不错的预处理方法，可以参考。
https://arxiv.org/abs/1904.07695 这篇综述的代码中有数据集，https://github.com/qiang2100/STTM?utm_source=catalyzex.com。
老模型的文章确实没仔细读，感谢指正！但您的README里建议同样修改。

from neural_topic_models.

GSM实现有问题？ about neural_topic_models HOT 7 OPEN

Comments (7)

Related Issues (16)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent