insanelife / dssm Goto Github PK

View Code? Open in Web Editor NEW

654.0 20.0 231.0 808 KB

DSSM and Multi-View DSSM

Python 100.00%

dssm

dssm's People

Contributors

Stargazers

Watchers

Forkers

zhongyunuestc wangjianyong ambier meccy yueyongjiao qiqipipioioi mulanhero sowhatyc huangpingchun wentaotao skybirdhe ansvver langwenjing leoruc2016 zazor1993 hezhimin zbn123 phychaos brucekyle99 waiteryee127 traveler817 qwzhong1988 ablinke kelsey777 gaoyz0625 tk1363704 dulimei chengli0327 dreadlord1984 lddsdu izarek mingspy flyingkiss yijiuzai michael200892458 zkyzq linessiex iwii0425 ningshiqi shj1987 libin19861023 lucosax omcar17 wangfengjs yueyedeai cuizhigang1989 tornadozou berryhn cshaowang gbacillus szhl lovehoroscoper yxiao1994 carolinexull vincentami zhangwei0411 walter1218 talentlei jerrycatleung yipeng5 coldkey2003 mokundong shang271828 qianrenjian suyangshuo linpingta panchaichai nyutartaros akalz mingtown yhyyhyyhyyhy lichao88 knowledgehacker fall-in-fall buptygz poseidon1214 zhaozhiyong19890102 mc-zealot czhiming wangmingxjtu impossibleyjh xcwen1993 fancycheung beyondliangcai jerryzhong lvzcl useric shengkaishuai lilingyunsunn williamwhe rongchen89 yaoxingnihao shunyuanxue tiffen kennylsn code-learner mengyuliu debbierr zwtt1994 youhebuke

dssm's Issues

训练dssm_rnn过程中，训练loss下降，为什么测试loss不下降反而增大

如图，训练和测试样本都是5000个，随机从oppo_round1_train_20180929.txt中采样的

[SEP]是不是需要加一下？

_transform_2seq2bert_id
我看这个函数里是拼接bert的输入，两个句子时间是不是要加入[SEP]

ValueError: Cannot feed value of shape (0,) for Tensor 'input/query_batch:0', which has shape '(?, ?)'

您好，按照最新的代码，还是报数据错误，求指教，谢谢。

数据格式

你好，能否告知训练样本的格式是怎么样的呢（正负样本如何组织的，输入是一个query对应１个正样本，４个负样本吗），还有你中文特征提取是只用了uni_gramn吗，方便留个邮箱或者联系方式吗，谢谢（by the way,　我也是在成都哟，哈哈）

The link of dataset cannot be retrieved，can you send me a new version ?

为何用数据里的label？label表示“是否点击”，不是相似性

多谢！对么？
@InsaneLife
这个任务是做啥的？

ModuleNotFoundError: No module named 'multi_view_data_input'

multi_view_data_input.py 没找到。可以发我一份吗？

能说一下输入的query_in，positive_in，negative_in的shape吗

Can you tell me the datasets format or show a screenshot ?
In the following, you use data_sets.query_test_data, data_sets.doc_test_positive, data_sets.doc_test_negative, so I don't quite understand the format.
Thanks!

mac或者linux可以训练吗

你好，请问mac或者linux可以训练吗

Readme中loss指标训练数据和测试数据是一样的吗？

data_input能发一份吗？

邮箱是[email protected]谢谢

测试准确度低

用”siamese_bert“模型，在80万公司数据集上，1（正）：4（负），跑出来的cos倒排，感觉完全不靠谱，发愁
auc: 0.64
准确率： 0.75

dssm loss计算为什么是reduce_sum

在dssm.py中，计算loss的代码
with tf.name_scope('Loss'):
# Train Loss
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=doc_label_batch, logits=cos_sim)
losses = tf.reduce_sum(cross_entropy)
tf.summary.scalar('loss', losses)
pass
是不是有问题？为什么是reduce_sum?而不是reduce_mean

pull batch使用的是验证级的数据

pull batch的数据好像是写死成vali_data了。
你试试数据集分开跑一下。。

why my train loss ,after 4or5 epoch ,softmax value equal nan。

为什么我的在训练5个epoch ，loss还在下降，但是输出的softmax的值都变味nan了。结果auc变味0.5了。

what the return of data_input.get_search_data

train.py里的SiamenseBert系列是不是就没有负例了，跟dssm paper里的任务不一致了。

i want to predict a query pair like "今天天气怎么样" “今天温度怎么样” ，what to do?

data_input

import data_input ModuleNotFoundError:No module named 'data_input'
Where is the data_input

如何预测

需要先训练模型，然后做预测，训练入口：train.py
训练（默认使用功LCQMC数据集）：

python train.py --mode=train

预测：

python train.py --mode=train --file=$predict_file$

测试文件格式: q1\tq2, 例如：

今天天气怎么样	今天温度怎么样

Originally posted by @InsaneLife in #25 (comment)

训练loss正常下降，验证loss下降缓慢

训练集使用的oppo_round1_train_20180929.txt
验证使用的oppo_round1_vali_20180929.txt
请问有人遇到相似情况吗

more details on dataset format?

🚨 Potential Deserialization of Untrusted Data

👋 Hello, @InsaneLife - a potential high severity Deserialization of Untrusted Data vulnerability in your repository has been disclosed to us.

Next Steps

1️⃣ Visit https://huntr.dev/bounties/1-other-InsaneLife/dssm for more advisory information.

2️⃣ Sign-up to validate or speak to the researcher for more assistance.

3️⃣ Propose a patch or outsource it to our community - whoever fixes it gets paid.

Confused or need more help?

Join us on our Discord and a member of our team will be happy to help! 🤗
Speak to a member of our team: @JamieSlome

This issue was automatically generated by huntr.dev - a bug bounty board for securing open source code.

损失函数的定义只涉及到了正样本？

您好，我看代码里定义损失函数那一块，先对query分别和正样本负样本的out_embedding求cos，然后外接softmax之后，只用到了正样本的概率结果，为什么不把负样本的概率结果求负之后也加进来呢？

如果按照您的loss定义，那么完全可以舍去负样本的输入。

代码有误？

dssm/dssm_rnn.py

Line 78 in eefe42e

stacked_gru_bw = tf.contrib.rnn.MultiRNNCell([cell_fw], state_is_tuple=True)

这行应该写错了，bw这个没有用上

Why my train loss equal to nan?

Hi,
When I run dssm_rnn.py, the train loss always shows nan. Change learning rate, no matter what.
I print out the variables in the model, and the variable embedding in the word_embeddings_layer shows nan for the first time.
How to deal with it. Thanks!

效果怎么样？

谢谢！

怎么做预测呀？

作者，这个预测怎么做呀