Comments (11)
thanks for reporting!
well, apart from the really strange tags in the error message (numbers as tags?) this is not how you should tag dutch texts. if you set the preset to "en" (which is "english") but use the dutch tagging script, TreeTag will likely return incompatible tags.
there is a dutch language package: https://reaktanz.de/R/pckg/koRpus.lang.nl/
please let me know if you need assistance to get it working.
from korpus.
With the koRpus.lang.nl package, it is indeed possbile to select nl as a language, but the output still produces the same error
library("koRpus")
library("koRpus.lang.nl")
set.kRp.env(TT.cmd="c://treetagger/bin/tag-dutch.bat", lang="nl", preset="nl")
output <- treetag(text, format="obj", TT.options=list(path="c://TreeTagger", preset="nl"))
token tag lemma
1 Ze 300 ze
2 zien 254 zien
3 wat 400 wat
4 we 300 we
5 liever 154 lief
6 verborgen 219 verborgen
7 houden 210 houden
8 , PUNCT ,
9 brengen 254 brengen
10 orde 000 orde
11 in 600 in
12 onze 333 onze
13 chaos 000 chaos
14 en 700 en
15 zetten 256 zetten
16 ons 303 ons
17 al 500 al
18 eens 500 eens
19 een 450 een
20 spiegel 000 spiegel
21 voor 6105 voor
23 We 300 we
24 geven 254 geven
25 weinigen 441 weinig
26 meer 454 meer
27 inkijk 000 inkijk
28 dan 720 dan
29 de 370 de
30 poetshulp* 010 poetshulp*
32 Ik 000 Ik
33 kijk 247 kijken
35 hun 330 hun
36 ogen 001 oog
38 ik 300 ik
39 weet 251 weten
40 meteen 500 meteen
41 welk 410 welk
42 vlees 000 vlees
46 kuip 000 kuip
47 heb 252 hebben
Error: Invalid tag(s) found: 300, 254, 400, 154, 219, 210, PUNCT, 000, 600, 333, 700, 256, 303, 500, 450, 6105, 441, 454, 720, 370, 010, 247, 330, 001, 251, 410, 252
This is probably due to a missing tag in kRp.POS.tags() and needs to be fixed. It would be nice if you could forward the above error dump as a bug report to the package maintaner!
from korpus.
try this (i.e., omit the batch script):
set.kRp.env(TT.cmd="manual", TT.options=list(path="c://treetagger", preset="nl"), lang="nl")
and also omit the "TT.options" in your treetag()
call, they're already set by set.kRp.env
.
from korpus.
This produces
Error in paste(TT.splitter, "perl ", TT.tokenizer, TT.tknz.opts, TT.call.file, :
object 'TT.call.file' not found
from korpus.
ouch, now you've discovered a genuine bug in treetag()
:-D at least, in the windows version.
i hope i fixed it, see commit 98c4059. can you test the package from the "develop" branch as described in this section:
https://github.com/unDocUMeantIt/koRpus/tree/develop#installation-via-github
?
if that's not an option i could also build a windows package for testing, if you send me your e-mail address (mine's in the package description).
from korpus.
The object 'TT.call.file' is now found, but the output produces the first error again:
library(devtools)
install_github("unDocUMeantIt/koRpus", ref="develop", force = TRUE)
library("koRpus")
library("koRpus.lang.nl")
set.kRp.env(TT.cmd="manual", TT.options=list(path="c://treetagger", preset="nl"), lang="nl")
output <- treetag(testobject, format="obj", TT.options=list(path="c://TreeTagger", preset="nl"))
token tag lemma
1 Ze 300 ze
2 zien 254 zien
3 wat 400 wat
4 we 300 we
5 liever 154 lief
6 verborgen 219 verborgen
7 houden 210 houden
8 , PUNCT ,
9 brengen 254 brengen
[...]
41 welk 410 welk
42 vlees 000 vlees
46 kuip 000 <unknown>
47 heb 252 hebben
Error: Invalid tag(s) found: 300, 254, 400, 154, 219, 210, PUNCT, 000, 600, 333, 700, 256, 303, 500, 450, 6105, 441, 454, 720, 370, 010, 247, 330, 001, 251, 410, 252
This is probably due to a missing tag in kRp.POS.tags() and
needs to be fixed. It would be nice if you could forward the
above error dump as a bug report to the package maintaner!
from korpus.
thank you for testing, good to know we're one bug down.
now, that error is really odd. two things:
- can you send me the file you're tagging for debugging purposes? i would like to replicate the problem on my side.
- can you set
debug=TRUE
in yourtreetag()
call and post the output (that istreetag(testobject, format="obj", debug=TRUE)
, you don't need to repeatTT.options
)? it should include the full TreeTagger command that is being executed in the background. you should be able to copy&paste all of this command and run it in a windows cmd.exe shell -- if that returns numbers as tags already, then the problem could be on TreeTagger's side (i.e. your local TreeTagger configuration)
from korpus.
oh wait, i think i found the root of the problem -- are you using the dutch2 parameter set trained on the eindhoven corpus? i don't speak dutch, but it looks to me like that one uses a totally different tagset definition: http://tst-centrale.org/images/stories/producten/documentatie/ehc_handleiding_nl.pdf
to check this, can you temporarily replace your parameter file with first alternative from the TreeTagger webpage: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/dutch-par-linux-3.2-utf8.bin.gz
from korpus.
That was it!
This version of dutch-utf8.par
works like charm.
What a relief, thank you for solving this problem and the quick responses.
from korpus.
ok, that's a relief :-) i'm closing the issue, then.
btw, if you feel up for it, you could contribute the missing tag definitions to the koRpus.lang.nl package, so users can use both parameter files. it's not really complicated -- let me know and i'll walk you through the process.
from korpus.
from korpus.
Related Issues (20)
- Missing tags for Danish HOT 6
- incomplete import of LCC corpus HOT 7
- Error: Specified directory cannot be found: ~/bin/treetagger/bin HOT 4
- URLs and sequences of punctuation in documents cause some readability measures to fail HOT 1
- How can I extract proper nouns? HOT 8
- Issue on Windows HOT 1
- Error in path.expand(path) : argument 'path' incorrect HOT 24
- Error: english-lexicon.txt not found HOT 3
- Working in Python? HOT 1
- option lexicon HOT 6
- Treetagger do not worh in both koRpus and teststem packages HOT 1
- Incorrect calculation of MTLD? HOT 3
- readability() returns error message HOT 9
- treetegger working with a dataset in R HOT 6
- Error in reading corpus database HOT 5
- Can I use the udpipe annotated results? HOT 2
- character vector "measure" seems to be ignored by lex.div; Fehler in x[["end"]] : Indizierung außerhalb der Grenzen; Fehler in 1:lastValidIndex : Resultat wäre zu langer Vektor HOT 2
- Getting "Awww, this should not happen" error even though the sys.tt.call runs sucessfully HOT 6
- TT.tokenizer not found HOT 3
- Flesch Formula multiplier HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from korpus.