CENMS-Dataset CENMS: the first Chinese Emergency News Mutil-document Summarization Dataset
Introduction Multi-document summarization(MDS) is a fundamental natural language processing application. However, MDS suffers from the lack of of datasets with credible references of the multiple topic-related documents and the models trained in certain domain do not generalize to other ones.
The main causes for this problem include the high cost of human-written references construction after analyzing the multiple documents by domain knowledge and the complexity of related documents collection with low redundancy. To this end, our research proposes an automatic method for MDS dataset construction with domain-aware strategies and built the dataset CENMS taking emergency news as example.
The CENMS dataset is introduced : the first Chinese Emergency News Mutil-document Summarization Dataset.
Properties CENMS contains more than 20K summaries clusters and covers four major categories, including natural disaster, accident, public health and society security. Four major categories can be devived into 39 sub-topics from COVID-19 to earthquake.
Sub-topic | number |
---|---|
Blizzard | 30 |
Other Disasters | 7 |
Earthquake | 1560 |
Drought | 20 |
Flood | 673 |
Fog | 149 |
Forest Fire | 16 |
Ice storm | 35 |
Landslide | 549 |
Mudslide | 367 |
Rainstorm | 51 |
Sand Storm | 78 |
Thunderstroke | 120 |
Tornado | 129 |
Tsunami | 15 |
Typhoon | 1764 |
Air Crash | 125 |
Collapse | 121 |
Explosion | 1374 |
Fire Crash | 2768 |
Gas-leak | 6 |
Nuclear Leak | 3 |
Shipwreck | 9 |
Traffic Accident | 274 |
COVID-19 | 8209 |
Dengue | 33 |
Avian Influenza | 127 |
Ebola | 107 |
MERS | 55 |
African Swine fever | 224 |
HIV | 46 |
Food-poisoning | 10 |
Zika Virus | 20 |
Pandemic | 864 |
Arson | 34 |
Other Crimes | 156 |
Terrorist | 100 |
Drugs | 859 |
Fraud | 59 |
Samples We select some samples from the dataset and you can see them in samples.csv.
Download We split the corpus into three parts, including training, validation and test set. If you need CENMS for further research, please send application to the e-mail [email protected] for request.