zhanlaoban / eda_nlp_for_chinese Goto Github PK
View Code? Open in Web Editor NEWAn implement of the paper of EDA for Chinese corpus.中文语料的EDA数据增强工具。NLP数据增强。论文阅读笔记。
An implement of the paper of EDA for Chinese corpus.中文语料的EDA数据增强工具。NLP数据增强。论文阅读笔记。
No such file or directory: 'stopwords/HIT_stop_words.txt',还要自己去下载吗
“实验结果就是,增强句子的隐藏空间表征紧紧环绕在这些原始句子的周围。作者的结论是,句子中有多个单词被改变了,那么句子的原始标签类别就可能无效了。” 表征紧紧的在原表征周围的话,那句子不是应该语意接近吗。那么句子的原始类别标签是有效的哇
random seed不起作用的原因,如何保持每次生成同样的结果
I see your script is python code/augment.py --input=train.txt --output=train_augmented.txt --num_aug=16 --alpha=0.05
to share an alpha value.
But I want to set different alpha. What should I do
alpha_sr=alpha, alpha_ri=alpha, alpha_rs=alpha, alpha_rd=alpha
我想问下大家,你们使用这个方法,对于当前任务有提升吗?谢谢~
楼主不觉得生成的数据都不通顺吗
我看到数据前面有01标签,我只想得到扩充的数据,用于机器翻译,是否可以不适用标签,或者直接用0123456顺序号
正在使用EDA生成增强语句...
Traceback (most recent call last):
File "C:\Users\HP-OMEN\Desktop\project\code\EDA_NLP_for_Chinese-master\EDA_NLP_for_Chinese-master\code\augment.py", line 54, in
gen_eda(args.input, output, alpha=alpha, num_aug=num_aug)
File "C:\Users\HP-OMEN\Desktop\project\code\EDA_NLP_for_Chinese-master\EDA_NLP_for_Chinese-master\code\augment.py", line 44, in gen_eda
sentence = parts[1]
IndexError: list index out of range
生成的output.txt文件内容为:
0 今天天气 很棒 哦 。
0 今天天气 不错 哦 。
0 哟 不错 哦 。
0 喔 不错 哦 。
0 今天天气 哈哈哈 不错 哦 。
0 今天天气 不错 吧 哦 。
0 今天天气 不错 哦
0 今天天气 不错 哦
0 今天天气 不错 。 哦
0 yoi 不错 哦 。
0 今天天气 不错 哦 呵呵 。
0 今儿个 今天天气 不错 哦 。
0 今天天气 很棒 哦 。
0 。 不错 哦 今天天气
0 今天天气 不错 哦 。
0 今天天气 不错 。 哦
0 今天天气 不错 哦 。
n_sr = max(1, int(alpha_sr * num_words))
n_ri = max(1, int(alpha_ri * num_words))
n_rs = max(1, int(alpha_rs * num_words))
请问为啥要和1比呀?这样一来替换、删除或插入n最多只能变一个?(不知道我理解错没,还望指正!
刚才粗略看了一下,代码基本都是原文作者的,你是换成了哈工大的同义词林么?还有其他创新点么?谢谢,看的不仔细
@zhanlaoban What is the license type of this repo? Could other projects use it? thx
raise Exception("SYNONYMS_DL_LICENSE is not in Environment variables, check out Installation Guide on https://github.com/chatopera/Synonyms")
Exception: SYNONYMS_DL_LICENSE is not in Environment variables, check out Installation Guide on https://github.com/chatopera/Synonyms
请问作者中文的测试数据集是哪些哦?
您好,我也看完了9个issue,还有里面eda的代码。
但是我看到label除了读进来,写入。。没有发现他对rs等有什么具体的意义。
如果我理解错了,希望您能告诉我
只需要名词替换,需要更改哪个地方
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.