Git Product home page Git Product logo

sentiment-analysis-of-internet-news's Introduction

赛题名称 Title of the question

互联网新闻情感分析 Emotional Analysis of Internet News

赛题背景 Question Background

随着各种社交平台的兴起,网络上用户的生成内容越来越多,产生大量的文本信息,如新闻、微博、博客等,面对如此庞大且富有情绪表达的文本信息,完全可以考虑通过探索他们潜在的价值为人们服务。因此近年来情绪分析受到计算机语言学领域研究者们的密切关注,成为一项进本的热点研究任务。 With the rise of various social platforms, a growing number of content is generated by users on the internet, which produces a large amount of text information, such as news, Weibo, blogs, etc. Faced with such huge and emotional text information, it is entirely possible to consider exploring their potential value to serve people. Therefore, in recent years, emotional analysis has been paid close attention by researchers in the field of computer linguistics and has become a hot research task.

本赛题目标为在庞大的数据集中精准的区分文本的情感极性,情感分为正中负三类。面对浩如烟海的新闻信息,精确识别蕴藏在其中的情感倾向,对舆情有效监控、预警及疏导,对舆情生态系统的良性发展有着重要的意义。 The goal of this question is to accurately distinguish the emotional polarity of text in a big data set. Emotions can be divided into three types: positive, negative and neutral. Faced with the vast amount of news information, it is of great significance for the effective monitoring, warning and guiding of public opinion and healthy development of public opinion ecosystem to accurately identify the emotional tendencies hidden in it.

赛题任务

参赛者需要对我们提供的新闻数据进行情感极性分类,其中正面情绪对应0,中性情绪对应1以及负面情绪对应2。根据我们提供的训练数据,通过您的算法或模型判断出测试集中新闻的情感极性。 Participants need to categorize the emotional polarity of the news data provided by us. Positive emotions correspond to 0, neutral emotions correspond to 1 and negative emotions correspond to 2. According to the training data provided by us, the emotional polarity of the news in the test set should be judged by your algorithm or model.

Solution

1.从该开源代码中改写的

2.该模型将文本截成k段,分别输入语言模型,然后顶层用GRU拼接起来。好处在于设置小的max_length和更大的k来降低显存占用,因为显存占用是关于长度平方级增长的,而关于k是线性增长的

模型 线上F1
Bert-base 80.3
Bert-wwm-ext 80.5
XLNet-base 79.25
XLNet-mid 79.6
XLNet-large --
Roberta-mid 80.5
Roberta-large (max_seq_length=512, split_num=1) 81.25

注:

1)实际长度 = max_seq_length * split_num

2)实际batch size 大小= per_gpu_train_batch_size * numbers of gpu

3)上面的结果所使用的是4卡GPU,因此batch size为4。如果只有1卡的话,那么per_gpu_train_batch_size应设为4, max_length设置小一些。

4)如果显存太小,可以设置gradient_accumulation_steps参数,比如gradient_accumulation_steps=2,batch size=4,那么就会运行2次,每次batch size为2,累计梯度后更新,等价于batch size=4,但速度会慢两倍。而且迭代次数也要相应提高两倍,即train_steps设为10000

具体batch size可看运行时的log,如:

09/06/2019 21:03:41 - INFO - main - ***** Running training *****

09/06/2019 21:03:41 - INFO - main - Num examples = 5872

09/06/2019 21:03:41 - INFO - main - Batch size = 4

09/06/2019 21:03:41 - INFO - main - Num steps = 5000

赛题说明

请查看该网站了解赛题

下载数据集

从该网站中下载数据集, 并解压在./data目录。

数据预处理

cd data
python preprocess.py
cd ..

Bert-base 模型

bash run_bert.sh
#5 fold取平均
python combine.py --model_prefix ./model_bert --out_path ./sub.csv

Bert Whole Word Masking 模型

从该网站下载pytorch权重,并解压到chinese_wwm_ex_bert目录下: https://github.com/ymcui/Chinese-BERT-wwm

bash run_bert_wwm_ext.sh
python combine.py --model_prefix ./model_bert_wwm_ext --out_path ./sub.csv

XLNet-mid 模型

从该网站下载pytorch权重,并解压到./chinese_xlnet_mid/目录下: https://github.com/ymcui/Chinese-PreTrained-XLNet

bash run_xlnet.sh
python combine.py --model_prefix ./model_xlnet --out_path ./sub.csv

Roberta-mid 模型

从该网站下载tensorflow版本的权重,并解压到./chinese_roberta/目录下: https://github.com/brightmart/roberta_zh

mv chinese_roberta/bert_config_middle.json chinese_roberta/config.json
python -u -m pytorch_transformers.convert_tf_checkpoint_to_pytorch --tf_checkpoint_path chinese_roberta/ --bert_config_file chinese_roberta/config.json --pytorch_dump_path chinese_roberta/pytorch_model.bin
bash run_roberta.sh
python combine.py --model_prefix ./model_roberta --out_path ./sub.csv

sentiment-analysis-of-internet-news's People

Contributors

ryan0v0 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.