ufal / conll2017 Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 4.0 2.31 MB

CoNLL 2017 Shared Task Proposal: UD End-to-End parsing

Makefile 0.17% Python 37.57% Perl 62.26%

conll2017's People

Contributors

Stargazers

Watchers

Forkers

oplatek kira-d ruxandraburtica martinpopel

conll2017's Issues

Task definition

We need to define exactly what should be output (and scored) by the parsers. I assume that, in addition to dependencies, parsers should output part-of-speech tags and morphological features.

But what about lemmas? On the one hand, they are part of the UD representation. On the other hand, they are language-specific and not consistently defined across languages. I would be inclined to leave them out of the required output, but participants would of course be free to include them in their systems if they think it improves performance.

Another issue concerns language-specific tags, which are available for many languages and which often improve performance on those languages. Are these to be banned in the interest of universality or are they fair game for improving performance although not part of the required output?

Evaluation metrics

I would strongly advocate using a UD-specific main evaluation metric that focuses on content word dependencies. Please see http://stp.lingfil.uu.se/~nivre/docs/udeval-cl.pdf for some of the arguments for such a metric and a concrete proposal.

Shifting to domain adaptation task?

At Berlin meeting, it was agreed that the evaluation data will be "Google-supplied parallel data, about 500 sentences per language".

Will this be the only evaluation data, or will there be two testsets for some languages? (I guess, Google will provide only some languages, so e.g. for Latin, the test set will need to be the current testset in any case.)

Google-supplied data has the advantage of being parallel (more comparable and also in future there can be cross-lingual parsing tasks) and newly created (translated), so unseen by any participants.
On the other hand, it will be a very different domain than most UD treebanks, so the task will shift to domain adaptation techniques. Most probably, it will be also a different UD annotation style than the training treebanks.
For a proper domain-adaptation task it would be nice to provide a small part of the 500 sentences in advance for tuning (or at least some unlabeled sentences from the "Google" domain).
However, shifting to domain adaptation makes the task more difficult (there may be less participants willing to take part).

We could also provide two sets of results (Google-domain adaptation and standard testset).

How to compute the overall score

The original proposal computes the overall score as an arithmetic mean of individual corpus scores. That would probably motivate the participants to spend a lot of time tuning the performance on small corpora (where even 10 words can be more than 1% of the test set) -- we were even worried whether people would manually annotate more data in that language.

Currently, as discussed in #2, we propose to leave out too small corpora, which is partly motivated by this issue.

There are other possibilities how we could compute the overall score -- for one, we could use the F1 score computed from words from all corpora (i.e., analogously to microaccuracy). However, in this case the overall score would be determined by performance on biggest 5-10 corpora.

There are additional more complex ways how to compute the overall score, but in order for the proposal to be simple, we chose from the two described possibilities (and selected the macroaccuracy analogue).

However, maybe we could use some more complex way of computing the overall score -- for example, we could compute the weighted arithmetic mean, using logarithms of corpora sizes to be the weights.

Surprise languages vs. trainable parser task

The current proposal says that for surprise languages there will be "no training/dev data; possibly 10-20 sentences annotated."

Option A: almost unsupervised

There will be really no training data, no parallel data in a known languages and "und" will be used instead of the language code. So this is a kind of unsupervised parsing where we know that it should use UD labels and guidelines. Alternatively, one could try to detect the language automatically and use trained models of similar languages.

Option B: only language code provide

Same as A, just instead of automatically detecting the language, its real ISO code will be provided during the (automatic) testing.

Option C: 20 training sentences

Same as B, plus 20 training sentences (provided X hours/weeks/months before the testing).

The problem of both A and B is that if someone knows (or guesses) what are the surprise languages, it would be a big advantage (it is tempting to adapt the unknown-language parser e.g. for Slavic/Romance/... languages). Also the unsupervised parsing is quite a different task than usual UD parsing, but we require all participants to do it (they could produce some baseline output, but it will influence their final score).
The problem of C is that someone could spend time with last-minute tuning (annotating more sentences) of the parser, which is probably not the goal.

I am not strongly against the options A/B/C (or another option with raw sentences or translations provided for training). I am just missing a different surprise-language task:

Option D: trainable parser task

No training data is provided in advance. The participants submit a trainable parser to the Tira platform. The surprise training data will be provided just before testing, so no manual search for hyperparameters will be possible.

Suggestion for dealing with tokenisation mismatches

Hey all! I had a thought while waiting at the airport day about how we could deal with mismatches in tokenisation and have something that will work for both Turkic and Chinese (and for other languages).

The basic idea is that instead of relying on matching segments of surface forms, we use the surface forms only to delimit the character ranges that syntactic words can be found in.

This isn't quite a concrete proposal yet (I'm going to try and do an implementation), but I thought I'd get it out there early to see if people think I might be on to something, or if it is not worth pursuing.

So, suppose you have the sentence: "Bu ev mavidi." 'This house blue was.' [1]

The gold standard annotation is:

1   Bu  bu  DET DET _   2   det
2   ev  ev  NOUN    NOUN    Case=Nom    3   nsubj
3-4 mavidi  _   _   _   _   _   _
3   mavi    mavi    ADJ ADJ _   0   root
4   _   i   VERB    VERB    Tense=Past  3   cop
5   .   .   PUNCT   PUNCT   _   3   punct

But your tokeniser might produce (a):

1   Bu  bu  DET DET _   2   det
2   ev  ev  NOUN    NOUN    Case=Nom    3   nsubj
3-4 mavidi  _   _   _   _   _   _
3   mavi    mavi    ADJ ADJ _   0   root
4   di  i   VERB    VERB    Tense=Past  3   cop
5   .   .   PUNCT   PUNCT   _   3   punct

or even (b):

1   Bu  bu  DET DET _   2   det
2   ev  ev  NOUN    NOUN    Case=Nom    3   nsubj
3   mavi    mavi    ADJ ADJ _   0   root
4   di  i   VERB    VERB    Tense=Past  3   cop
5   .   .   PUNCT   PUNCT   _   3   punct

What we are really interested in is the syntactic words and their relations, but we don't want to count any word twice. We can use the character ranges in the surface forms of the gold standard to delimit the character ranges in which the syntactic words should be found. So for example,

Gold standard "bu|ev|mavidi|."

0-2  [(bu, DET, 2, det)]
2-4 [(ev, NOUN, 3, nsubj)]
4-10 [(mavi, ADJ, 0, root), (i, VERB, 3, cop)]
10-11 [(., PUNCT, 3, punct)]

(b) "bu|ev|mavi|di|."

0-2  [(bu, DET, 2, det)]
2-4 [(ev, NOUN, 3, nsubj)]
4-8 [(mavi, ADJ, 0, root)]
8-10 [(i, VERB, 3, cop)]
10-11 [(., PUNCT, 3, punct)]

As 4-8 and 8-10 fall within the range 4-10, both of those syntactic words would match, without having to rely on substring matching of the surface form.

Caveats:

The heads would also need to be expressed with character ranges, but could be done the same way
It would be necessary for the lemma field to be there to do the matching (but I don't think this is entirely unreasonable, for those treebanks that don't have lemmas, they could be added automatically, or the surface form could be used).

Apologies to Turkish speakers if this is unidiomatic.

Chinese word segmentation

Just to clarify (I am not going to put it in the proposal but we will have to decide it later):

Are we going to require that people do word segmentation in Chinese (and Japanese, Thai etc. if these languages are added to UD)? It would be in line with our End-to-End philosophy but it is obviously harder than learning that "aux" = "à le".

UDPipe is probably not going to help here, right, @foxik? I think there are neither SpaceAfter=No, nor multi-word tokens, so UDPipe would need an option to consider every sentence a huge multi-word token. But even then I suspect the accuracy will not be great, unless it does something Chinese-specific.

Parallel data

I'll open an issue for this. Our suggestion here in Turku is to provide the participants with parallel data and/or dictionaries and/or multilingual embeddings to support the development of lexicalized transfer methods in the Shared Task. Our argument is that if we do not allow any such resources, we limit people to delexicalized methods which is 1) not a realistic setting and 2) does not use UD to its full potential.

Opinions?

The TIRA platform

The TIRA platform, recommended for evaluation, is at http://www.tira.io/. We need to know how it works. And it would be nice to know it even before we submit the proposal. Unfortunately, I don't see documentation at the website, and I don't see any playground to just create a toy task and explore. Instead, one has to write an e-mail to [email protected] and ask them first to host the task. Should we do the step now?

I don't know whether

the participant creates a virtual machine containing a main entry script, then I will run the virtual machine, add test data and run the script, then collect results and evaluate them; OR
the participant has to get the blind test data from me and include it with his software in the VM, then I will run it, collect results and evaluate; OR
endless other possibilities, with varying degree of the participant's ability to see and influence things.

Languages with multiple treebanks

I think it was left undecided what to do with languages that have multiple treebanks. My relatively strong opinion is to treat them separately, i.e. not to pool the test sets. Participants can pool the training data any way they like, naturally. My primary reason is that it is well known the treebanks are not necessarily all that compatible and that will not be fully remedied before the ST. If we pool the test sets, the systems will be punished for differences between the individual treebanks. This would have the effect of 1) underestimating the learning capability of the parsers 2) underestimating the actual performance of the parsers by superimposing test data noise over the results.

I think getting a good estimate of the real performance of state-of-the-art parsers on the data would be an important outcome of the task, and we should try to avoid external measurement errors. I think it would not be good if we have to write into the ST report paper that "numbers for language X seem to be quite low, but that is probably because treebanks X.1 and X.2 are in the end quite different in the details and so we do not really know how well we can parse language X".

Definition of "very small corpora"

(with some exceptions -- Japanese because of the license, very small corpora and corpora which cannot be reliably detokenized).

How would you define "very small corpora" would this mean that Kazakh and Buryat are excluded ?

Additional Resources?

Word embeddings are increasingly popular and provide nice accuracy gains in most parsers nowadays. I agree that we want to keep things simple (and hence there is a cost to allowing additional resources), but how about providing precomputed word2vec embeddings?
If we don't do it, we will end up with some artificially impoverished parsers...
I/We (=Google) can probably help generate these word embeddings if needed.

Token count for evaluation corpus

The 10,000 token count for the evaluation corpus, does that include punctuation tokens ? How should it be calculated ?

UDError: Cannot parse HEAD '_'

Hi. I am new to conll2017 and are trying to use evaluation scripts and udpipe for my dependency parsing task. I am experimenting with the gold and test conllu files. But I can not load test file but gold file. Do you guys know what happens? I got errors. Thank you for the help.