互联网新闻情感分析 Emotional Analysis of Internet News
随着各种社交平台的兴起,网络上用户的生成内容越来越多,产生大量的文本信息,如新闻、微博、博客等,面对如此庞大且富有情绪表达的文本信息,完全可以考虑通过探索他们潜在的价值为人们服务。因此近年来情绪分析受到计算机语言学领域研究者们的密切关注,成为一项进本的热点研究任务。 With the rise of various social platforms, a growing number of content is generated by users on the internet, which produces a large amount of text information, such as news, Weibo, blogs, etc. Faced with such huge and emotional text information, it is entirely possible to consider exploring their potential value to serve people. Therefore, in recent years, emotional analysis has been paid close attention by researchers in the field of computer linguistics and has become a hot research task.
本赛题目标为在庞大的数据集中精准的区分文本的情感极性,情感分为正中负三类。面对浩如烟海的新闻信息,精确识别蕴藏在其中的情感倾向,对舆情有效监控、预警及疏导,对舆情生态系统的良性发展有着重要的意义。 The goal of this question is to accurately distinguish the emotional polarity of text in a big data set. Emotions can be divided into three types: positive, negative and neutral. Faced with the vast amount of news information, it is of great significance for the effective monitoring, warning and guiding of public opinion and healthy development of public opinion ecosystem to accurately identify the emotional tendencies hidden in it.
参赛者需要对我们提供的新闻数据进行情感极性分类,其中正面情绪对应0,中性情绪对应1以及负面情绪对应2。根据我们提供的训练数据,通过您的算法或模型判断出测试集中新闻的情感极性。 Participants need to categorize the emotional polarity of the news data provided by us. Positive emotions correspond to 0, neutral emotions correspond to 1 and negative emotions correspond to 2. According to the training data provided by us, the emotional polarity of the news in the test set should be judged by your algorithm or model.
1.从该开源代码中改写的
2.该模型将文本截成k段,分别输入语言模型,然后顶层用GRU拼接起来。好处在于设置小的max_length和更大的k来降低显存占用,因为显存占用是关于长度平方级增长的,而关于k是线性增长的
模型 | 线上F1 |
---|---|
Bert-base | 80.3 |
Bert-wwm-ext | 80.5 |
XLNet-base | 79.25 |
XLNet-mid | 79.6 |
XLNet-large | -- |
Roberta-mid | 80.5 |
Roberta-large (max_seq_length=512, split_num=1) | 81.25 |
注:
1)实际长度 = max_seq_length * split_num
2)实际batch size 大小= per_gpu_train_batch_size * numbers of gpu
3)上面的结果所使用的是4卡GPU,因此batch size为4。如果只有1卡的话,那么per_gpu_train_batch_size应设为4, max_length设置小一些。
4)如果显存太小,可以设置gradient_accumulation_steps参数,比如gradient_accumulation_steps=2,batch size=4,那么就会运行2次,每次batch size为2,累计梯度后更新,等价于batch size=4,但速度会慢两倍。而且迭代次数也要相应提高两倍,即train_steps设为10000
具体batch size可看运行时的log,如:
09/06/2019 21:03:41 - INFO - main - ***** Running training *****
09/06/2019 21:03:41 - INFO - main - Num examples = 5872
09/06/2019 21:03:41 - INFO - main - Batch size = 4
09/06/2019 21:03:41 - INFO - main - Num steps = 5000
请查看该网站了解赛题
从该网站中下载数据集, 并解压在./data目录。
cd data
python preprocess.py
cd ..
bash run_bert.sh
#5 fold取平均
python combine.py --model_prefix ./model_bert --out_path ./sub.csv
从该网站下载pytorch权重,并解压到chinese_wwm_ex_bert目录下: https://github.com/ymcui/Chinese-BERT-wwm
bash run_bert_wwm_ext.sh
python combine.py --model_prefix ./model_bert_wwm_ext --out_path ./sub.csv
从该网站下载pytorch权重,并解压到./chinese_xlnet_mid/目录下: https://github.com/ymcui/Chinese-PreTrained-XLNet
bash run_xlnet.sh
python combine.py --model_prefix ./model_xlnet --out_path ./sub.csv
从该网站下载tensorflow版本的权重,并解压到./chinese_roberta/目录下: https://github.com/brightmart/roberta_zh
mv chinese_roberta/bert_config_middle.json chinese_roberta/config.json
python -u -m pytorch_transformers.convert_tf_checkpoint_to_pytorch --tf_checkpoint_path chinese_roberta/ --bert_config_file chinese_roberta/config.json --pytorch_dump_path chinese_roberta/pytorch_model.bin
bash run_roberta.sh
python combine.py --model_prefix ./model_roberta --out_path ./sub.csv