Git Product home page Git Product logo

cenms's Introduction

CENMS-Dataset CENMS: the first Chinese Emergency News Mutil-document Summarization Dataset

Introduction Multi-document summarization(MDS) is a fundamental natural language processing application. However, MDS suffers from the lack of of datasets with credible references of the multiple topic-related documents and the models trained in certain domain do not generalize to other ones.

The main causes for this problem include the high cost of human-written references construction after analyzing the multiple documents by domain knowledge and the complexity of related documents collection with low redundancy. To this end, our research proposes an automatic method for MDS dataset construction with domain-aware strategies and built the dataset CENMS taking emergency news as example.

The CENMS dataset is introduced : the first Chinese Emergency News Mutil-document Summarization Dataset.

Properties CENMS contains more than 20K summaries clusters and covers four major categories, including natural disaster, accident, public health and society security. Four major categories can be devived into 39 sub-topics from COVID-19 to earthquake.

Sub-topic number
Blizzard 30
Other Disasters 7
Earthquake 1560
Drought 20
Flood 673
Fog 149
Forest Fire 16
Ice storm 35
Landslide 549
Mudslide 367
Rainstorm 51
Sand Storm 78
Thunderstroke 120
Tornado 129
Tsunami 15
Typhoon 1764
Air Crash 125
Collapse 121
Explosion 1374
Fire Crash 2768
Gas-leak 6
Nuclear Leak 3
Shipwreck 9
Traffic Accident 274
COVID-19 8209
Dengue 33
Avian Influenza 127
Ebola 107
MERS 55
African Swine fever 224
HIV 46
Food-poisoning 10
Zika Virus 20
Pandemic 864
Arson 34
Other Crimes 156
Terrorist 100
Drugs 859
Fraud 59

Samples We select some samples from the dataset and you can see them in samples.csv.

Download We split the corpus into three parts, including training, validation and test set. If you need CENMS for further research, please send application to the e-mail [email protected] for request.

cenms's People

Contributors

adamlau90 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.