Git Product home page Git Product logo

juand-r / entity-recognition-datasets Goto Github PK

View Code? Open in Web Editor NEW
1.4K 41.0 242.0 2.53 MB

A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.

License: MIT License

Python 99.16% Shell 0.84%
entity-extraction named-entity-recognition ner datasets entity-recognition nlp-resources nlp corpora natural-language-processing annotations

entity-recognition-datasets's Introduction

Datasets for Entity Recognition

This repository contains datasets from several domains annotated with a variety of entity types, useful for entity recognition and named entity recognition (NER) tasks.

NOTE: I am no longer actively adding datasets to this list -- there are likely more NER datasets that have appeared since 2020. However, I am happy to add more datasets via issues or pull requests.

Datasets for NER in English

The following table shows the list of datasets for English-language entity recognition (for a list of NER datasets in other languages, see below). The data directory contains information on where to obtain those datasets which could not be shared due to licensing restrictions, as well as code to convert them (if necessary) to the CoNLL 2003 format. Links to NER corpora in other languages are also listed below.

Dataset Domain License Reference Availablility
CONLL 2003

News

DUA

Sang and Meulder, 2003

Easy to find

NIST-IEER

News

None

NIST 1999 IE-ER

NLTK data

MUC-6

News

LDC

Grishman and Sundheim, 1996

LDC 2003T13

OntoNotes 5

Various

LDC

Weischedel et al., 2013

LDC 2013T19

BBN

Various

LDC

Weischedel and Brunstein, 2005

LDC 2005T33

GMB-1.0.0

Various

None

Bos et al., 2017

http://gmb.let.rug.nl/data.php

GUM-3.1.0

Wiki

Several (*2)

Zeldes, 2016

✔ Included here

wikigold

Wikipedia

CC-BY 4.0

Balasuriya et al., 2009

✔ Included here

Ritter

Twitter

None

Ritter et al., 2011

No split , Train/test/dev split

BTC

Twitter

CC-BY 4.0

Derczynski et al., 2016

✔ Included here

WNUT17

Social media

CC-BY 4.0

Derczynski et al., 2017

✔ Included here

i2b2-2006

Medical

DUA

Uzuner et al., 2007

http://www.i2b2.org

i2b2-2014

Medical

DUA

Stubbs et al., 2015

http://www.i2b2.org

CADEC

Medical

CSIRO

Karimi et al., 2015

http://data.csiro.au/

AnEM

Anatomical

CC-BY-SA 3.0

Ohta et al., 2012

✔ Included here

MITRestaurant

Queries

None

Liu et al., 2013a

http://groups.csail.mit.edu/sls/

MITMovie

Queries

None

Liu et al., 2013b

http://groups.csail.mit.edu/sls/

MalwareTextDB

Malware

None

Lim et al., 2017

http://www.statnlp.org/

re3d

Defense

Several (*1)

DSTL, 2017

✔ Included here

SEC-filings

Finance

CC-BY 3.0

Alvarado et al., 2015

✔ Included here

Assembly

Robotics

X

Costa et al., 2017

X

WikiNEuRal

Wikipedia

CC BY-SA-NC 4.0

Tedeschi et al., 2021

https://github.com/Babelscape/wikineural

MultiNERD

Wikipedia

CC BY-SA-NC 4.0

Tedeschi et al., 2022

https://github.com/Babelscape/multinerd

HIPE-2022

Historical

CC BY-SA-NC 4.0

Ehrmann et al., 2022

https://github.com/hipe-eval/HIPE-2022-data

Music-NER

Music

MIT

Epure and Hennequin, 2023

https://github.com/deezer/music-ner-eacl2023

WIESP2022-NER

Astrophysics

CC BY-SA-NC 4.0

Grezes et al., 2022

https://huggingface.co/datasets/adsabs/WIESP2022-NER

NNE

News

CC 4.0 / LDC

Ringland et al., 2019

https://github.com/nickyringland/nested_named_entities

Licenses

Notes on licenses:

(1) re3d ("Relationship and Entity Extraction Evaluation Dataset") contains several datasets, with different licenses. These are:

  • CC-BY-SA 3.0 (Wikipedia dataset)
  • CC BY-NC 3.0 (BBC_Online dataset)
  • CC BY 3.0 AU (Australian_Department_of_Foreign_Affairs dataset)
  • public domain (US_State_Department dataset, CENTCOM dataset)
  • UK Open Government Licence v3.0 (UK_Government dataset)
  • Delegation_of_the_European_Union_to_Syria: see https://eeas.europa.eu/delegations/syria/8157/legal-notice_en
  1. GUM 3.1.0 comprises three datasets, with licenses CC-BY 3.0, CC-BY-SA 3.0 and CC-BY-NC-SA 3.0. The annotations are licensed under CC-BY 4.0.

More detailed license information for each dataset can be found in the corresponding subdirectory.

Later ... - Tabassum et al., Code and Named Entity Recognition in StackOverflow https://cocoxu.github.io/publications/ACL2020_stackoverflow_NER.pdf - LitBank: https://github.com/dbamman/litbank (Bamman, Popat and Shen, An Annotated Dataset of Literary Entities, NAACL 2019) - NNE: A Dataset for Nested Named Entity Recognition in English Newswire, 2019 https://github.com/nickyringland/nested_named_entities - Mars Target Encyclopedia - LPSC abstracts labeled data set: https://zenodo.org/record/1048419#.W5a2CBwnZhE - Best Buy queries: https://www.kaggle.com/dataturks/best-buy-ecommerce-ner-dataset/home - Resume entities for NER: https://www.kaggle.com/dataturks/resume-entities-for-ner/home - FEW-NERD: A Few-shot Named Entity Recognition Dataset https://aclanthology.org/2021.acl-long.248/

Datasets for NER in other languages

Lexical Named Entity resources

Code-Switching

German

Dutch

Afrikaans

Spanish

Catalan

Galician

Basque

Portuguese

French

Italian

Romanian

Greek

Hungarian

Czech

Polish

Croatian

Slovak

Slovene

Ukrainian

Serbian

Bulgarian

  • BulTreeBank (BTB)

Icelandic

  • MIM-GOLD-NER (Ingólfsdóttir, Svanhvít Lilja, Sigurjón Þorsteinsson, and Hrafn Loftsson. "Towards High Accuracy Named Entity Recognition for Icelandic." Proceedings of the 22nd Nordic Conference on Computational Linguistics. 2019): http://www.malfong.is/index.php?pg=mim_gold_ner

Danish

Norwegian

Swedish

Finnish

Estonian

Latvian and Lithuanian

Turkish

Kazakh

Uyghur

  • Uyghur Named Entity Relation corpus: https://github.com/kaharjan/UyNeRel (Abiderexiti et al., Annotation Schemes for Constructing Uyghur Named Entity Relation Corpus. IALP 2016)

Armenian

Coptic

Amharic

Arabic

Persian

Sindhi

Urdu

Indic

Hindi

Bengali

Telugu

Maithili

Nepali

Marathi

Punjabi

Tamil

Malayalam

Oriya/Odia

Sinhala/Sinhalese

  • LORELEI (LDC2018E57)

Thai

Indonesian

Vietnamese

Japanese

Korean

Chinese

Russian

Yoruba

Swahili

Igbo

isiNdebele

Xhosa

Zulu

Sepedi

Sesotho

Setswana

Siswati

Venda

Xitsonga

Latin

A long list can be found here: http://damien.nouvels.net/resourcesen/corpora.html

References

[Alvarado et al., 2015] Alvarado, Julio Cesar Salinas, Karin Verspoor, and Timothy Baldwin. Domain adaption of named entity recognition to support credit risk assessment. In Proceedings of the Australasian Language Technology Association Workshop 2015, pp. 84-90. 2015. Accessed: August 2018.

[Balasuriya et al., 2009] Balasuriya, Dominic, Nicky Ringland, Joel Nothman, Tara Murphy, and James R. Curran. Named entity recognition in wikipedia. In Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources, pp. 10-18. Association for Computational Linguistics, 2009

[Bos et al., 2017] Bos, Johan, Valerio Basile, Kilian Evang, Noortje J. Venhuizen, and Johannes Bjerva. The Groningen meaning bank. In Handbook of linguistic annotation, pp. 463-496. Springer, Dordrecht, 2017.

[Derczynski et al., 2016] Derczynski, Leon, Kalina Bontcheva, and Ian Roberts. Broad twitter corpus: A diverse named entity recognition resource. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 1169-1179. 2016. Available at: https://github.com/GateNLP/broad_twitter_corpus Accessed: August 2018.

[Derczynski et al., 2017] Leon Derczynski, Eric Nichols, Marieke van Erp, Nut Limsopatham (2017) Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition, in Proceedings of the 3rd Workshop on Noisy, User-generated Text. Available at: https://noisy-text.github.io/2017/emerging-rare-entities.html

[DSTL, 2017] Defence Science and Technology Laboratory. 2017. Relationship and Entity Extraction Evaluation Dataset. https://github.com/dstl/re3d. Accessed: January 2018.

[Grishman and Sundheim, 1996] Ralph Grishman and Beth Sundheim. 1996. Message understanding conference- 6: A brief history. In COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics.

[Karimi et al., 2015] Sarvnaz Karimi, Alejandro Metke-Jimenez, Madonna Kemp, and Chen Wang. 2015. Cadec: A corpus of adverse drug event annotations. Journal of biomedical informatics, 55:73-81. Available at https://data.csiro.au Accessed: November 2017.

[Lim et al., 2017] Lim, Swee Kiat, Aldrian Obaja Muis, Wei Lu, and Chen Hui Ong. MalwareTextDB: A database for annotated malware articles. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1557-1567. 2017.

[Liu et al., 2013a] Jingjing Liu, Panupong Pasupat, Scott Cyphers, and Jim Glass. 2013. Asgard: A portable architecture for multilingual dialogue systems. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 8386-8390. IEEE. Available at https://groups.csail.mit.edu/sls/downloads/restaurant/ Accessed: January 2018

[Liu et al., 2013b] Jingjing Liu, Panupong Pasupat, Yining Wang, Scott Cyphers, and Jim Glass. 2013. Query understanding enhanced by hierarchical parsing structures. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pages 72-77. IEEE. Available at https://groups.csail.mit.edu/sls/downloads/movie/ We used the trivia10k13 portion. Accessed: January 2018

[NIST, 1999 IE-ER] NIST. 1999. Information Extraction - Entity Recognition Evaluation. http://www.nist.gov/speech/tests/ieer/er_99/er_99.htm. The newswire development test data only (included in the NLTK package).

[Ohta et al., 2012] Tomoko Ohta, Sampo Pyysalo, Jun'ichi Tsujii and Sophia Ananiadou. 2012. Open-domain Anatomical Entity Mention Detection. In Proceedings of ACL 2012 Workshop on Detecting Structure in Scholarly Discourse (DSSD), pp. 27-36. Available at: http://www.nactem.ac.uk/anatomy/ and https://github.com/openbiocorpora/anem Accessed: November 2017.

[Ritter et al., 2011] Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. 2011. Named entity recognition in tweets: An experimental study. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1524-1534, Edinburgh, Scotland, UK., July. Association for Computational Linguistics. Accessed January 2018.

[Sang and Meulder, 2003] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Languageindependent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003.

[Stubbs et al., 2015] Amber Stubbs and Ozlem Uzuner. 2015. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. Journal of biomedical informatics, 58:S20-S29. Available at https://www.i2b2.org/NLP/DataSets/ Accessed: February 2018.

[Uzuner et al., 2007] Ozlem Uzuner, Yuan Luo, and Peter Szolovits. 2007. Evaluating the state-of-the-art in automatic de-identification. Journal of the American Medical Informatics Association, 14(5):550-563. Available at https://www.i2b2.org/NLP/DataSets/ Accessed: February 2018.

[Weischedel and Brunstein, 2005] Ralph Weischedel and Ada Brunstein. 2005. BBN pronoun coreference and entity type corpus. Linguistic Data Consortium, Philadelphia.

[Weischedel et al., 2013] Weischedel, Ralph, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue et al. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA (2013).

[Zeldes, 2017] Amir Zeldes. 2017. The GUM corpus: creating multilayer resources in the classroom. Language Resources and Evaluation, 51(3):581-612. Available at https://github.com/amir-zeldes/gum/tree/master/coref/tsv/ Accessed: November 2017.

entity-recognition-datasets's People

Contributors

abhipec avatar angledluffa avatar hvingelby avatar juand-r avatar leondz avatar mdredze avatar roman-janik avatar sted97 avatar toshihikosakai avatar tutubalinaev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

entity-recognition-datasets's Issues

print() is a function in Python 3

flake8 testing of https://github.com/juand-r/entity-recognition-datasets on Python 3.7.0

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./data/NIST_IEER/CONLL-format/utils/quick_comma_fix.py:41:37: E999 SyntaxError: invalid syntax
                    print annotations
                                    ^
./data/NIST_IEER/CONLL-format/utils/makeconll.py:29:30: E999 SyntaxError: invalid syntax
                print category
                             ^
./data/GUM/CONLL-format/utils/webAnnotsv_to_conll.py:31:32: E999 SyntaxError: invalid syntax
        print 'Made directory: ', newdir
                               ^
./data/re3d/CONLL-format/utils/re3d_to_bratann.py:83:15: E999 SyntaxError: invalid syntax
        print i, e['value']
              ^
./data/i2b2_2006/CONLL-format/utils/i2b2toconll.py:23:15: E999 SyntaxError: invalid syntax
	print filename
              ^
./data/BBN/CONLL-format/utils/bbn2conll.py:35:18: E999 SyntaxError: invalid syntax
    print filename
                 ^
6     E999 SyntaxError: invalid syntax
6

Information of the datasets

Hello,

Thanks for sharing these datasets !
I just try to find some more specific information on it ; for instance, how many tweets/comments/news are on the WNUT17 and on the CONLL 2003 ?

Thanks,
Cheers,
Camille

A Knowledge Graph resource of NER datasets

Dear authors, this repository is such a great resource! Many thanks for creating it. I would like to suggest that maybe the Open Research Knowledge Graph (https://orkg.org/) could be leveraged to enlist such resources for persistence and knowledge sharing. Please find below some resources I created related to the information in this repository.

Named Entity Recognition Tasks in the MUC series

https://orkg.org/comparison/R162797/

NER in the Automatic Content Extraction (ACE) Series

https://orkg.org/comparison/R162851/

Named Entity Recognition in the CoNLL Series and the OntoNotes corpus as a related resource

https://orkg.org/comparison/R166315/

Named Entity Recognition Based on Wikipedia

https://orkg.org/comparison/R166240/

A comparison of the annotated resources of software mentions in scholarly articles

https://orkg.org/comparison/R166560/

NLP Datasets for Named Entity Recognition and Relation Extraction from Biomedicine Scholarly Articles

https://orkg.org/comparison/R163265/

Comparisons and Visualizations of the CrossNER Benchmark Corpus for its Source and Target Domains

https://orkg.org/comparison/R163843/

Surveying BioNLP Shared Tasks Corpora for Named Entity Recognition

https://orkg.org/comparison/R165702/

Surveying BioCreAtIvE Shared Tasks Corpora for Named Entity Recognition

https://orkg.org/comparison/R172155/


The benefits of such machine-encoded data is that Reviews can be automatically created thereby.

Surveying the BioCreAtIvE Shared Task Series

https://orkg.org/review/R172166

Surveying the BioNLP Shared Task Series

https://orkg.org/review/R165924

I would be very happy to offer support in this direction. :)

Tree too deeply nested; IEER dataset

I am trying to convert the NIST-IEER to CoNLL format and see the following error:
It looks like it gets through the first 6 files fine but only gets partway through NYT-19980407

Done with  APW_19980314
Done with  APW_19980424
Done with  APW_19980429
Done with  NYT_19980315
Done with  NYT_19980403
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "makeconll.py", line 109, in write_all_to_conll
    write_conll(filename)
  File "makeconll.py", line 100, in write_conll
    sentences = parse_doc(filename, index)
  File "makeconll.py", line 67, in parse_doc
    tags = tree2conll_without_postags(dt)
  File "makeconll.py", line 35, in tree2conll_without_postags
    raise ValueError("Tree is too deeply nested to be printed in CoNLL format")
ValueError: Tree is too deeply nested to be printed in CoNLL format

I am also wondering if there is a way to reconstruct the original text from the articles

GUM version

I just ran into this list - thanks for putting it up. I curate the GUM corpus included in the data folder, but it seems to be a rather old version. We now have much more data, including four more genres and bringing up the total word count to about 130,000 tokens annotated for nested, (non-)named entities. Would you like to update the data to include the latest version?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.