Git Product home page Git Product logo

Comments (11)

unDocUMeantIt avatar unDocUMeantIt commented on August 20, 2024

thanks for reporting!

well, apart from the really strange tags in the error message (numbers as tags?) this is not how you should tag dutch texts. if you set the preset to "en" (which is "english") but use the dutch tagging script, TreeTag will likely return incompatible tags.

there is a dutch language package: https://reaktanz.de/R/pckg/koRpus.lang.nl/

please let me know if you need assistance to get it working.

from korpus.

basbaccarne avatar basbaccarne commented on August 20, 2024

With the koRpus.lang.nl package, it is indeed possbile to select nl as a language, but the output still produces the same error

library("koRpus")
library("koRpus.lang.nl")
set.kRp.env(TT.cmd="c://treetagger/bin/tag-dutch.bat", lang="nl", preset="nl")
output <- treetag(text, format="obj", TT.options=list(path="c://TreeTagger", preset="nl"))


token tag lemma
1 Ze 300 ze
2 zien 254 zien
3 wat 400 wat
4 we 300 we
5 liever 154 lief
6 verborgen 219 verborgen
7 houden 210 houden
8 , PUNCT ,
9 brengen 254 brengen
10 orde 000 orde
11 in 600 in
12 onze 333 onze
13 chaos 000 chaos
14 en 700 en
15 zetten 256 zetten
16 ons 303 ons
17 al 500 al
18 eens 500 eens
19 een 450 een
20 spiegel 000 spiegel
21 voor 6105 voor
23 We 300 we
24 geven 254 geven
25 weinigen 441 weinig
26 meer 454 meer
27 inkijk 000 inkijk
28 dan 720 dan
29 de 370 de
30 poetshulp* 010 poetshulp*
32 Ik 000 Ik
33 kijk 247 kijken
35 hun 330 hun
36 ogen 001 oog
38 ik 300 ik
39 weet 251 weten
40 meteen 500 meteen
41 welk 410 welk
42 vlees 000 vlees
46 kuip 000 kuip
47 heb 252 hebben
Error: Invalid tag(s) found: 300, 254, 400, 154, 219, 210, PUNCT, 000, 600, 333, 700, 256, 303, 500, 450, 6105, 441, 454, 720, 370, 010, 247, 330, 001, 251, 410, 252
This is probably due to a missing tag in kRp.POS.tags() and needs to be fixed. It would be nice if you could forward the above error dump as a bug report to the package maintaner!

from korpus.

unDocUMeantIt avatar unDocUMeantIt commented on August 20, 2024

try this (i.e., omit the batch script):
set.kRp.env(TT.cmd="manual", TT.options=list(path="c://treetagger", preset="nl"), lang="nl")

and also omit the "TT.options" in your treetag() call, they're already set by set.kRp.env.

from korpus.

basbaccarne avatar basbaccarne commented on August 20, 2024

This produces

Error in paste(TT.splitter, "perl ", TT.tokenizer, TT.tknz.opts, TT.call.file, :
object 'TT.call.file' not found

from korpus.

unDocUMeantIt avatar unDocUMeantIt commented on August 20, 2024

ouch, now you've discovered a genuine bug in treetag() :-D at least, in the windows version.

i hope i fixed it, see commit 98c4059. can you test the package from the "develop" branch as described in this section:
https://github.com/unDocUMeantIt/koRpus/tree/develop#installation-via-github
?

if that's not an option i could also build a windows package for testing, if you send me your e-mail address (mine's in the package description).

from korpus.

basbaccarne avatar basbaccarne commented on August 20, 2024

The object 'TT.call.file' is now found, but the output produces the first error again:

library(devtools)
install_github("unDocUMeantIt/koRpus", ref="develop", force = TRUE)
library("koRpus")
library("koRpus.lang.nl")
set.kRp.env(TT.cmd="manual", TT.options=list(path="c://treetagger", preset="nl"), lang="nl")
output <- treetag(testobject, format="obj", TT.options=list(path="c://TreeTagger", preset="nl"))

        token   tag     lemma
1          Ze   300        ze
2        zien   254      zien
3         wat   400       wat
4          we   300        we
5      liever   154      lief
6   verborgen   219 verborgen
7      houden   210    houden
8           , PUNCT         ,
9     brengen   254   brengen
[...]
41       welk   410      welk
42      vlees   000     vlees
46       kuip   000 <unknown>
47        heb   252    hebben
Error: Invalid tag(s) found: 300, 254, 400, 154, 219, 210, PUNCT, 000, 600, 333, 700, 256, 303, 500, 450, 6105, 441, 454, 720, 370, 010, 247, 330, 001, 251, 410, 252
  This is probably due to a missing tag in kRp.POS.tags() and
  needs to be fixed. It would be nice if you could forward the
  above error dump as a bug report to the package maintaner!

from korpus.

unDocUMeantIt avatar unDocUMeantIt commented on August 20, 2024

thank you for testing, good to know we're one bug down.

now, that error is really odd. two things:

  1. can you send me the file you're tagging for debugging purposes? i would like to replicate the problem on my side.
  2. can you set debug=TRUE in your treetag() call and post the output (that is treetag(testobject, format="obj", debug=TRUE), you don't need to repeat TT.options)? it should include the full TreeTagger command that is being executed in the background. you should be able to copy&paste all of this command and run it in a windows cmd.exe shell -- if that returns numbers as tags already, then the problem could be on TreeTagger's side (i.e. your local TreeTagger configuration)

from korpus.

unDocUMeantIt avatar unDocUMeantIt commented on August 20, 2024

oh wait, i think i found the root of the problem -- are you using the dutch2 parameter set trained on the eindhoven corpus? i don't speak dutch, but it looks to me like that one uses a totally different tagset definition: http://tst-centrale.org/images/stories/producten/documentatie/ehc_handleiding_nl.pdf

to check this, can you temporarily replace your parameter file with first alternative from the TreeTagger webpage: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/dutch-par-linux-3.2-utf8.bin.gz

from korpus.

basbaccarne avatar basbaccarne commented on August 20, 2024

That was it!
This version of dutch-utf8.par works like charm.
What a relief, thank you for solving this problem and the quick responses.

from korpus.

unDocUMeantIt avatar unDocUMeantIt commented on August 20, 2024

ok, that's a relief :-) i'm closing the issue, then.

btw, if you feel up for it, you could contribute the missing tag definitions to the koRpus.lang.nl package, so users can use both parameter files. it's not really complicated -- let me know and i'll walk you through the process.

from korpus.

basbaccarne avatar basbaccarne commented on August 20, 2024

from korpus.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.