jeniyat / wnut_2020_ner Goto Github PK

View Code? Open in Web Editor NEW

51.0 51.0 14.0 9.76 MB

This repository will contain the data and codes for WNUT 2020 NER task

License: MIT License

Python 98.32% TeX 1.68%

wnut_2020_ner's People

Contributors

Stargazers

Watchers

Forkers

sufian-latif tanviranik saketgupta1008 jahangircsebuet amishpapneja gztangde greitzmann aritter labk-it chenmalobani durgeshbhagat srushtinandu lisaterumi shuhengl

wnut_2020_ner's Issues

Discrepency between the standoff and the conll data.

Hi! I was looking at the dataset and I found some differences in some of the counts when looking at the standoff vs the conll data. When you look at the number of sentence in the standoff training data it matches the number in your readme.

$ cat data/train_data/Standoff_Format/*.txt | wc -l
8436
$ cat data/dev_data/Standoff_Format/*.txt | wc -l
2838
$ cat data/test_data/Standoff_Format/*.txt | wc -l
2809

(Note that number of sentences in the standoff format test data don't seem to match your read me either unless the sentences in the read me aren't equivalent to steps in the protocol?)

However when I looked at the conll formatted data, first using my custom conll reading code and then using the conllu package from pypi, I am seeing more sentences in the training and development data (8444 and 2839 respectively) and fewer sentences in the test data (2803)

Here is the code I used to get the counts.

from conllu import parse
from glob import iglob
docs = 0
sentences = 0
for file in glob("data/train_data/Conll_Format/*_conll.txt"):
    docs += 1
    with open(file) as f:
        sents = parse(f.read(), fields=["suface", "NER"])
        sentences += len(sents)
print(f"# of documents: {docs}")
print(f"# of sentences: {sentences}")

# of documents: 370
# of sentences: 8444

Any idea what is going on?

Is relation extraction still a part of this Shared Task?

Hi! I was looking at the released data and code and noticed some things that led me to wonder: Is relation extraction still part of this Shared Task?

There are inconsistencies between the Standoff and the CoNLL data, making combining the data difficult. (A similar issue was already raised in #3.) These are related to tokenization. Two randomly chosen examples:
- In protocol_7.ann, line 268, the token untransformed actually contains a space at the end, which messes up token boundaries. This seems to be because it is a non-breaking space and happens multiple times in the data.
- protocol_102_conll.txt, line 149, contains the token FLowCamMake, labeled as B-Device. In contrast, protocol_102.ann correctly identifies that there is actually an entity boundary with in this sequence of characters, splitting it up into FLowCam and Make sure.
Unless I am missing something, there does not seem to be an official evaluation script for relations/events. The Standoff-to-CoNLL conversion script also seems to completely ignore relations. The baseline system also seems to only predict entities.
The Readme.md is titled "WNUT 2020: Named Entity Extraction", not mentioning relations.

So my question is: Are relations still part of the Shared Task? If so, will an official evaluation script be released for them?

transformation between conll and standoff

Predictions are converted to conll after prediction made. From line 224 to 231 in code/baseline_CRF/crf_ner_wlp.py, the format is converted into standoff, and again converted into conll. Could you plz explain why we need this double conversion?

jeniyat / wnut_2020_ner Goto Github PK

wnut_2020_ner's People

Contributors

Stargazers

Watchers

Forkers

wnut_2020_ner's Issues

Discrepency between the standoff and the conll data.

Is relation extraction still a part of this Shared Task?

transformation between conll and standoff

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent