Git Product home page Git Product logo

Comments (11)

dan-zeman avatar dan-zeman commented on June 3, 2024

My understanding was that the Google test data will be used in addition to the test data coming directly from UD treebanks, and that is also what I would advocate for.

from conll2017.

martinpopel avatar martinpopel commented on June 3, 2024

OK. Will we concatenate the two test sets or will we report two scores?

from conll2017.

dan-zeman avatar dan-zeman commented on June 3, 2024

I would definitely report two scores at least in the additional evaluation we provide. It could be interesting to compare. Somewhat related issue is whether the evaluated system will know test set (treebank) id in addition to language id (i.e. if it is parsing Finnish FTB, it may want to avoid model trained on TurkuDT, and vice versa; for Google data it may want to use both or whatever).

I have currently no strong position on whether the two (or more) scores of one language shall all separately contribute to the macro-average global ranking. For this purpose we may want to concatenate, so that people do not complain that Finnish has three votes in the final score and Irish only one.

(We could as well say that we treat Turku-Finnish and FTB-Finnish as two separate languages. After all, they are probably farther apart than Ancora Spanish from Ancora Catalan. But I am not very excited about going this way, and I also do not see how Google data would fit in the picture – they would effectively become surprise "languages".)

from conll2017.

martinpopel avatar martinpopel commented on June 3, 2024

If we provide two scores per language (Google test set plus standard test set), which one will be the official one?

Another option is to concatenate the two test sets for the final results (as you suggested for the multi-treebanks languages). Google test set will be 500 sentences, standard test set will be >10,000 words. Even if the standard test set is much bigger than 500 sentences, the final score will be still influenced by the Google test set, so we should acknowledge that it is a domain-adaptation task indeed.

from conll2017.

dan-zeman avatar dan-zeman commented on June 3, 2024

I would treat the Google test set the same way (whatever it is) as we treat multiple treebanks per language. Either we concatenate them, or we count each separately in the macro average. But neither of them is more official than the others.

from conll2017.

fginter avatar fginter commented on June 3, 2024

If I understood this correctly, the Google data is by no means guaranteed to become available, so I would not stress it too much in the proposal. I myself would simply pool the Google data with the test set of each language for the primary measure, and then report yet separately results on the Google data, where applicable.

from conll2017.

dan-zeman avatar dan-zeman commented on June 3, 2024

As discussed in #12, we will have multiple test sets for certain languages, even without Google, and we will count each of them separately in the macro-average. If we have the additional Google data (I am now writing in the proposal that "we hope to be able to obtain..."), it seems natural to me to make them yet another piece in the macro-average, i.e. not concatenating them with any of the pre-existing test sets.

BTW, the existing test sets may also require domain adaptation if e.g. you train on UD_Finnish and parse UD_Finnish-FTB. We will tell the systems that the given test sets's language code is "fi". Question: are we going to also provide a treebank identifier, so that the system knows if it's Turku DT or FTB or additional Google data? I lean towards answering yes, we are. Then the system can decide whether to do domain-adaptation (and have more training data) for the first two, but it cannot avoid it for the Google data.

from conll2017.

foxik avatar foxik commented on June 3, 2024

I would also give the full UD treebank identifier (fi_ftb) -- otherwise, people would have to do some kind of domain adaptation in any case (because if you should parse fi, you do not know if it is fi or fi_ftb; you could have a same model for both these cases, or you could try recognizing which version of fi this is and use the corresponding model, etc.). If we give full identifier, participants can decide whether they want to perform domain adaptation or not.

Note that if we provide the full identifier, the google data should get a unique one (maybe same prefix for languages, something like cs_par or cs_gpar etc.).

from conll2017.

fginter avatar fginter commented on June 3, 2024
  • I would provide the full identifier
  • I think counting the (hypothetical) Google test data into the macro-average would be problematic unless the size of the data is in the vicinity of 10K words and above. In the Berlin discussion, the main reasoning behind the 10K word test set threshold was unreliable scores on test sets smaller than this (because of the words not being independent). So if we add smaller test sets into the macro-average, we will magnify the noise a lot, partly masking the much larger test sets from the original treebanks.
  • I do realize the problems of pooling the test sets with the Google data (effectively microaveraging), but I think it is the lesser evil here

from conll2017.

dan-zeman avatar dan-zeman commented on June 3, 2024

I was hoping that if the Google data is available, it will be large enough. If that turns out not to be the case, then I agree we should do something else. So let's wait until we have the data, and let's not delve into details in the proposal.

from conll2017.

dan-zeman avatar dan-zeman commented on June 3, 2024

Update: the data should be available and it should be 1000 sentences, so probably large enough (although it depends on the token-per-sentence ratio).

We now promise that the data will exist and we explicitly admit that it will not be the same domain as the UD training data. I am tentatively closing this issue.

from conll2017.

Related Issues (14)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.