At <a href="https://github.com/ufal/conll2017/blob/master/proposal/2016-08-09-berlin-n

As discussed in <a class="issue-link js-issue-link" data-error-text="Failed to load ti

I would provide the full identifier I think counting the (hypothetical)

Shifting to domain adaptation task? about conll2017 HOT 11 CLOSED

ufal commented on June 3, 2024

Shifting to domain adaptation task?

from conll2017.

Comments (11)

dan-zeman commented on June 3, 2024

My understanding was that the Google test data will be used in addition to the test data coming directly from UD treebanks, and that is also what I would advocate for.

from conll2017.

martinpopel commented on June 3, 2024

OK. Will we concatenate the two test sets or will we report two scores?

from conll2017.

dan-zeman commented on June 3, 2024

I would definitely report two scores at least in the additional evaluation we provide. It could be interesting to compare. Somewhat related issue is whether the evaluated system will know test set (treebank) id in addition to language id (i.e. if it is parsing Finnish FTB, it may want to avoid model trained on TurkuDT, and vice versa; for Google data it may want to use both or whatever).

I have currently no strong position on whether the two (or more) scores of one language shall all separately contribute to the macro-average global ranking. For this purpose we may want to concatenate, so that people do not complain that Finnish has three votes in the final score and Irish only one.

(We could as well say that we treat Turku-Finnish and FTB-Finnish as two separate languages. After all, they are probably farther apart than Ancora Spanish from Ancora Catalan. But I am not very excited about going this way, and I also do not see how Google data would fit in the picture – they would effectively become surprise "languages".)

from conll2017.

martinpopel commented on June 3, 2024

If we provide two scores per language (Google test set plus standard test set), which one will be the official one?

Another option is to concatenate the two test sets for the final results (as you suggested for the multi-treebanks languages). Google test set will be 500 sentences, standard test set will be >10,000 words. Even if the standard test set is much bigger than 500 sentences, the final score will be still influenced by the Google test set, so we should acknowledge that it is a domain-adaptation task indeed.

from conll2017.

dan-zeman commented on June 3, 2024

I would treat the Google test set the same way (whatever it is) as we treat multiple treebanks per language. Either we concatenate them, or we count each separately in the macro average. But neither of them is more official than the others.

from conll2017.

fginter commented on June 3, 2024

If I understood this correctly, the Google data is by no means guaranteed to become available, so I would not stress it too much in the proposal. I myself would simply pool the Google data with the test set of each language for the primary measure, and then report yet separately results on the Google data, where applicable.

from conll2017.

dan-zeman commented on June 3, 2024

As discussed in #12, we will have multiple test sets for certain languages, even without Google, and we will count each of them separately in the macro-average. If we have the additional Google data (I am now writing in the proposal that "we hope to be able to obtain..."), it seems natural to me to make them yet another piece in the macro-average, i.e. not concatenating them with any of the pre-existing test sets.

BTW, the existing test sets may also require domain adaptation if e.g. you train on UD_Finnish and parse UD_Finnish-FTB. We will tell the systems that the given test sets's language code is "fi". Question: are we going to also provide a treebank identifier, so that the system knows if it's Turku DT or FTB or additional Google data? I lean towards answering yes, we are. Then the system can decide whether to do domain-adaptation (and have more training data) for the first two, but it cannot avoid it for the Google data.

from conll2017.

foxik commented on June 3, 2024

I would also give the full UD treebank identifier (fi_ftb) -- otherwise, people would have to do some kind of domain adaptation in any case (because if you should parse fi, you do not know if it is fi or fi_ftb; you could have a same model for both these cases, or you could try recognizing which version of fi this is and use the corresponding model, etc.). If we give full identifier, participants can decide whether they want to perform domain adaptation or not.

Note that if we provide the full identifier, the google data should get a unique one (maybe same prefix for languages, something like cs_par or cs_gpar etc.).

from conll2017.

fginter commented on June 3, 2024

I would provide the full identifier
I think counting the (hypothetical) Google test data into the macro-average would be problematic unless the size of the data is in the vicinity of 10K words and above. In the Berlin discussion, the main reasoning behind the 10K word test set threshold was unreliable scores on test sets smaller than this (because of the words not being independent). So if we add smaller test sets into the macro-average, we will magnify the noise a lot, partly masking the much larger test sets from the original treebanks.
I do realize the problems of pooling the test sets with the Google data (effectively microaveraging), but I think it is the lesser evil here

from conll2017.

dan-zeman commented on June 3, 2024

I was hoping that if the Google data is available, it will be large enough. If that turns out not to be the case, then I agree we should do something else. So let's wait until we have the data, and let's not delve into details in the proposal.

from conll2017.

dan-zeman commented on June 3, 2024

Update: the data should be available and it should be 1000 sentences, so probably large enough (although it depends on the token-per-sentence ratio).

We now promise that the data will exist and we explicitly admit that it will not be the same domain as the UD training data. I am tentatively closing this issue.

from conll2017.

Shifting to domain adaptation task? about conll2017 HOT 11 CLOSED

Comments (11)

Related Issues (14)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent