ufal / conll2017 Goto Github PK
View Code? Open in Web Editor NEWCoNLL 2017 Shared Task Proposal: UD End-to-End parsing
CoNLL 2017 Shared Task Proposal: UD End-to-End parsing
We need to define exactly what should be output (and scored) by the parsers. I assume that, in addition to dependencies, parsers should output part-of-speech tags and morphological features.
But what about lemmas? On the one hand, they are part of the UD representation. On the other hand, they are language-specific and not consistently defined across languages. I would be inclined to leave them out of the required output, but participants would of course be free to include them in their systems if they think it improves performance.
Another issue concerns language-specific tags, which are available for many languages and which often improve performance on those languages. Are these to be banned in the interest of universality or are they fair game for improving performance although not part of the required output?
I would strongly advocate using a UD-specific main evaluation metric that focuses on content word dependencies. Please see http://stp.lingfil.uu.se/~nivre/docs/udeval-cl.pdf for some of the arguments for such a metric and a concrete proposal.
At Berlin meeting, it was agreed that the evaluation data will be "Google-supplied parallel data, about 500 sentences per language".
Will this be the only evaluation data, or will there be two testsets for some languages? (I guess, Google will provide only some languages, so e.g. for Latin, the test set will need to be the current testset in any case.)
Google-supplied data has the advantage of being parallel (more comparable and also in future there can be cross-lingual parsing tasks) and newly created (translated), so unseen by any participants.
On the other hand, it will be a very different domain than most UD treebanks, so the task will shift to domain adaptation techniques. Most probably, it will be also a different UD annotation style than the training treebanks.
For a proper domain-adaptation task it would be nice to provide a small part of the 500 sentences in advance for tuning (or at least some unlabeled sentences from the "Google" domain).
However, shifting to domain adaptation makes the task more difficult (there may be less participants willing to take part).
We could also provide two sets of results (Google-domain adaptation and standard testset).
The original proposal computes the overall score as an arithmetic mean of individual corpus scores. That would probably motivate the participants to spend a lot of time tuning the performance on small corpora (where even 10 words can be more than 1% of the test set) -- we were even worried whether people would manually annotate more data in that language.
Currently, as discussed in #2, we propose to leave out too small corpora, which is partly motivated by this issue.
There are other possibilities how we could compute the overall score -- for one, we could use the F1 score computed from words from all corpora (i.e., analogously to microaccuracy). However, in this case the overall score would be determined by performance on biggest 5-10 corpora.
There are additional more complex ways how to compute the overall score, but in order for the proposal to be simple, we chose from the two described possibilities (and selected the macroaccuracy analogue).
However, maybe we could use some more complex way of computing the overall score -- for example, we could compute the weighted arithmetic mean, using logarithms of corpora sizes to be the weights.
The current proposal says that for surprise languages there will be "no training/dev data; possibly 10-20 sentences annotated."
There will be really no training data, no parallel data in a known languages and "und" will be used instead of the language code. So this is a kind of unsupervised parsing where we know that it should use UD labels and guidelines. Alternatively, one could try to detect the language automatically and use trained models of similar languages.
Same as A, just instead of automatically detecting the language, its real ISO code will be provided during the (automatic) testing.
Same as B, plus 20 training sentences (provided X hours/weeks/months before the testing).
The problem of both A and B is that if someone knows (or guesses) what are the surprise languages, it would be a big advantage (it is tempting to adapt the unknown-language parser e.g. for Slavic/Romance/... languages). Also the unsupervised parsing is quite a different task than usual UD parsing, but we require all participants to do it (they could produce some baseline output, but it will influence their final score).
The problem of C is that someone could spend time with last-minute tuning (annotating more sentences) of the parser, which is probably not the goal.
I am not strongly against the options A/B/C (or another option with raw sentences or translations provided for training). I am just missing a different surprise-language task:
No training data is provided in advance. The participants submit a trainable parser to the Tira platform. The surprise training data will be provided just before testing, so no manual search for hyperparameters will be possible.
Hey all! I had a thought while waiting at the airport day about how we could deal with mismatches in tokenisation and have something that will work for both Turkic and Chinese (and for other languages).
The basic idea is that instead of relying on matching segments of surface forms, we use the surface forms only to delimit the character ranges that syntactic words can be found in.
This isn't quite a concrete proposal yet (I'm going to try and do an implementation), but I thought I'd get it out there early to see if people think I might be on to something, or if it is not worth pursuing.
So, suppose you have the sentence: "Bu ev mavidi." 'This house blue was.' [1]
The gold standard annotation is:
1 Bu bu DET DET _ 2 det
2 ev ev NOUN NOUN Case=Nom 3 nsubj
3-4 mavidi _ _ _ _ _ _
3 mavi mavi ADJ ADJ _ 0 root
4 _ i VERB VERB Tense=Past 3 cop
5 . . PUNCT PUNCT _ 3 punct
But your tokeniser might produce (a):
1 Bu bu DET DET _ 2 det
2 ev ev NOUN NOUN Case=Nom 3 nsubj
3-4 mavidi _ _ _ _ _ _
3 mavi mavi ADJ ADJ _ 0 root
4 di i VERB VERB Tense=Past 3 cop
5 . . PUNCT PUNCT _ 3 punct
or even (b):
1 Bu bu DET DET _ 2 det
2 ev ev NOUN NOUN Case=Nom 3 nsubj
3 mavi mavi ADJ ADJ _ 0 root
4 di i VERB VERB Tense=Past 3 cop
5 . . PUNCT PUNCT _ 3 punct
What we are really interested in is the syntactic words and their relations, but we don't want to count any word twice. We can use the character ranges in the surface forms of the gold standard to delimit the character ranges in which the syntactic words should be found. So for example,
Gold standard "bu|ev|mavidi|."
0-2 [(bu, DET, 2, det)]
2-4 [(ev, NOUN, 3, nsubj)]
4-10 [(mavi, ADJ, 0, root), (i, VERB, 3, cop)]
10-11 [(., PUNCT, 3, punct)]
(b) "bu|ev|mavi|di|."
0-2 [(bu, DET, 2, det)]
2-4 [(ev, NOUN, 3, nsubj)]
4-8 [(mavi, ADJ, 0, root)]
8-10 [(i, VERB, 3, cop)]
10-11 [(., PUNCT, 3, punct)]
As 4-8 and 8-10 fall within the range 4-10, both of those syntactic words would match, without having to rely on substring matching of the surface form.
Caveats:
Just to clarify (I am not going to put it in the proposal but we will have to decide it later):
Are we going to require that people do word segmentation in Chinese (and Japanese, Thai etc. if these languages are added to UD)? It would be in line with our End-to-End philosophy but it is obviously harder than learning that "aux" = "ร le".
UDPipe is probably not going to help here, right, @foxik? I think there are neither SpaceAfter=No, nor multi-word tokens, so UDPipe would need an option to consider every sentence a huge multi-word token. But even then I suspect the accuracy will not be great, unless it does something Chinese-specific.
I'll open an issue for this. Our suggestion here in Turku is to provide the participants with parallel data and/or dictionaries and/or multilingual embeddings to support the development of lexicalized transfer methods in the Shared Task. Our argument is that if we do not allow any such resources, we limit people to delexicalized methods which is 1) not a realistic setting and 2) does not use UD to its full potential.
Opinions?
The TIRA platform, recommended for evaluation, is at http://www.tira.io/. We need to know how it works. And it would be nice to know it even before we submit the proposal. Unfortunately, I don't see documentation at the website, and I don't see any playground to just create a toy task and explore. Instead, one has to write an e-mail to [email protected] and ask them first to host the task. Should we do the step now?
I don't know whether
I think it was left undecided what to do with languages that have multiple treebanks. My relatively strong opinion is to treat them separately, i.e. not to pool the test sets. Participants can pool the training data any way they like, naturally. My primary reason is that it is well known the treebanks are not necessarily all that compatible and that will not be fully remedied before the ST. If we pool the test sets, the systems will be punished for differences between the individual treebanks. This would have the effect of 1) underestimating the learning capability of the parsers 2) underestimating the actual performance of the parsers by superimposing test data noise over the results.
I think getting a good estimate of the real performance of state-of-the-art parsers on the data would be an important outcome of the task, and we should try to avoid external measurement errors. I think it would not be good if we have to write into the ST report paper that "numbers for language X seem to be quite low, but that is probably because treebanks X.1 and X.2 are in the end quite different in the details and so we do not really know how well we can parse language X".
(with some exceptions -- Japanese because of the license, very small corpora and corpora which cannot be reliably detokenized).
How would you define "very small corpora" would this mean that Kazakh and Buryat are excluded ?
Word embeddings are increasingly popular and provide nice accuracy gains in most parsers nowadays. I agree that we want to keep things simple (and hence there is a cost to allowing additional resources), but how about providing precomputed word2vec embeddings?
If we don't do it, we will end up with some artificially impoverished parsers...
I/We (=Google) can probably help generate these word embeddings if needed.
The 10,000 token count for the evaluation corpus, does that include punctuation tokens ? How should it be calculated ?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.