Comments (11)
My understanding was that the Google test data will be used in addition to the test data coming directly from UD treebanks, and that is also what I would advocate for.
from conll2017.
OK. Will we concatenate the two test sets or will we report two scores?
from conll2017.
I would definitely report two scores at least in the additional evaluation we provide. It could be interesting to compare. Somewhat related issue is whether the evaluated system will know test set (treebank) id in addition to language id (i.e. if it is parsing Finnish FTB, it may want to avoid model trained on TurkuDT, and vice versa; for Google data it may want to use both or whatever).
I have currently no strong position on whether the two (or more) scores of one language shall all separately contribute to the macro-average global ranking. For this purpose we may want to concatenate, so that people do not complain that Finnish has three votes in the final score and Irish only one.
(We could as well say that we treat Turku-Finnish and FTB-Finnish as two separate languages. After all, they are probably farther apart than Ancora Spanish from Ancora Catalan. But I am not very excited about going this way, and I also do not see how Google data would fit in the picture – they would effectively become surprise "languages".)
from conll2017.
If we provide two scores per language (Google test set plus standard test set), which one will be the official one?
Another option is to concatenate the two test sets for the final results (as you suggested for the multi-treebanks languages). Google test set will be 500 sentences, standard test set will be >10,000 words. Even if the standard test set is much bigger than 500 sentences, the final score will be still influenced by the Google test set, so we should acknowledge that it is a domain-adaptation task indeed.
from conll2017.
I would treat the Google test set the same way (whatever it is) as we treat multiple treebanks per language. Either we concatenate them, or we count each separately in the macro average. But neither of them is more official than the others.
from conll2017.
If I understood this correctly, the Google data is by no means guaranteed to become available, so I would not stress it too much in the proposal. I myself would simply pool the Google data with the test set of each language for the primary measure, and then report yet separately results on the Google data, where applicable.
from conll2017.
As discussed in #12, we will have multiple test sets for certain languages, even without Google, and we will count each of them separately in the macro-average. If we have the additional Google data (I am now writing in the proposal that "we hope to be able to obtain..."), it seems natural to me to make them yet another piece in the macro-average, i.e. not concatenating them with any of the pre-existing test sets.
BTW, the existing test sets may also require domain adaptation if e.g. you train on UD_Finnish and parse UD_Finnish-FTB. We will tell the systems that the given test sets's language code is "fi". Question: are we going to also provide a treebank identifier, so that the system knows if it's Turku DT or FTB or additional Google data? I lean towards answering yes, we are. Then the system can decide whether to do domain-adaptation (and have more training data) for the first two, but it cannot avoid it for the Google data.
from conll2017.
I would also give the full UD treebank identifier (fi_ftb
) -- otherwise, people would have to do some kind of domain adaptation in any case (because if you should parse fi
, you do not know if it is fi
or fi_ftb
; you could have a same model for both these cases, or you could try recognizing which version of fi
this is and use the corresponding model, etc.). If we give full identifier, participants can decide whether they want to perform domain adaptation or not.
Note that if we provide the full identifier, the google data should get a unique one (maybe same prefix for languages, something like cs_par
or cs_gpar
etc.).
from conll2017.
- I would provide the full identifier
- I think counting the (hypothetical) Google test data into the macro-average would be problematic unless the size of the data is in the vicinity of 10K words and above. In the Berlin discussion, the main reasoning behind the 10K word test set threshold was unreliable scores on test sets smaller than this (because of the words not being independent). So if we add smaller test sets into the macro-average, we will magnify the noise a lot, partly masking the much larger test sets from the original treebanks.
- I do realize the problems of pooling the test sets with the Google data (effectively microaveraging), but I think it is the lesser evil here
from conll2017.
I was hoping that if the Google data is available, it will be large enough. If that turns out not to be the case, then I agree we should do something else. So let's wait until we have the data, and let's not delve into details in the proposal.
from conll2017.
Update: the data should be available and it should be 1000 sentences, so probably large enough (although it depends on the token-per-sentence ratio).
We now promise that the data will exist and we explicitly admit that it will not be the same domain as the UD training data. I am tentatively closing this issue.
from conll2017.
Related Issues (14)
- The TIRA platform HOT 4
- Parallel data HOT 9
- Languages with multiple treebanks HOT 4
- Chinese word segmentation HOT 2
- Suggestion for dealing with tokenisation mismatches HOT 5
- Definition of "very small corpora" HOT 4
- UDError: Cannot parse HEAD '_' HOT 4
- How to compute the overall score HOT 8
- Additional Resources? HOT 28
- Task definition HOT 4
- Evaluation metrics HOT 15
- Token count for evaluation corpus HOT 4
- Surprise languages vs. trainable parser task HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from conll2017.