Comments (22)
i have started working on a compatibility package: https://github.com/unDocUMeantIt/tm.plugin.koRpus/tree/develop
the actual migration between koRpus
and tm
objects is not well tested at the moment, i myself am using the package mostly to call koRpus
methods on full corpora instead of single texts. but i think that package would be a good place to start. feel free to report issues and feature requests. i can't promise anything, especially in the near future, but i'll sure try. koRpus
and tm
both have a totally different philosophy with regards to text/object handling, and different technical solutions as well (S4 vs. S3), so it's not really a trivial task getting them to communicate with each other.
you will have to update koRpus
to a more recent version (=> 0.07-1) to be able to use it. but i recommend that anyway, becase there's tons of improvements (i haven't had the time to go through the CRAN release procedure yet, but you can find up-to-date releases in my own repository: https://reaktanz.de/R/ )
from korpus.
Thank you so much.
Definitely I'll try it
from korpus.
Hi, I don't understand how to import from a tm corpus and how to export into a tm corpus... Any syntax suggestion?
Thank you
from korpus.
there currently is a stub function called kRpSource()
that was supposed to turn a koRpus
text object into a tm
Source
object. however, kRpSource()
seems to be defunct for the time being. but you can use the following to achieve something similar:
kRp2VCorpus <- function(obj){
thisText <- VCorpus(
VectorSource(
kRp.text.paste(obj)
),
readerControl=list(language=language(obj))
)
return(thisText)
}
# then use the function like this on a tagged text object:
tmCorpusObject <- kRp2VCorpus(koRpusTaggedTextObject)
for the other way around, you could try to use the text "content" of tm
corpus objects, e.g. treetag(content(tmCorpusObject[["1"]]), format="obj")
. in the mid term, i'm planning to write a wrapper that does this internally so you can use tm
methods on koRpus
objects intuitively.
from korpus.
Great! kRp2VCorpus
works.
Thank so much
But now I have another problem:
I tried:
tmCorpusObject1<-treetag(content(tmCorpusObject0[["1"]]), format="obj", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"),
TT.options=list(path="C:/TreeTagger", preset="it", no.unknown=T))
but this is the answer:
Error in paste(TT.splitter, "perl ", TT.tokenizer, TT.tknz.opts, TT.call.file, :
object 'TT.call.file' not found
and now my syntax doesn't work also using as input a csv file:
tagged.korpus <- treetag(".\\TotPOS16.csv", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"),
TT.options=list(path="C:/TreeTagger", preset="it-utf8", no.unknown=T))
Are there any syntax changes with the 0.07-2 update of Korpus
? (with version 0.06-5 it worked)
Thank You
from korpus.
...another thing..
the function:
kRp2VCorpus <- function(obj){
thisText <- VCorpus(
VectorSource(
kRp.text.paste(obj)
),
readerControl=list(language=language(obj))
)
return(thisText)
}
tmCorpusObject.TotPOS16 <- kRp2VCorpus(tagged.TotPOS16results)
works very well but it eliminates some blanks before or after punctuation or other characters like "-" giving some problem to my analisys.
Have you any solutions?
Thank You
from korpus.
right, there's a bug that was introduced with 0.07-1 only to the windows version of koRpus
. it slipped through with changes needed to support portuguese, was discovered in january and is fixed in the develop branch.
i'll release a fixed version 0.10-1 as soon as i get roxygen2 running again (i have problems with roxygen2 6.0.1). see here how you can install the develop version directly from github:
https://github.com/unDocUMeantIt/koRpus/tree/develop#installation-via-github
i'm sorry for all the trouble -- i don't use windows and most windows users only run the CRAN versions of the package, OS specific bugs are sometimes hard to see.
from korpus.
the main problem with regards to kRp.text.paste()
is this: when you give a text to TreeTagger
, what you get back is a table with three columns, where the first column is the vector of all tokens in the original text. during this step, you lose information about spaces, paragraphs etc. -- have a look at taggedText(tagged.TotPOS16results)
.
kRp.text.paste()
tries to recreate the original text from that vector of tokens, which of course can't be perfect because it doesn't know how many spaces there were. i have not yet found a better solution for this.
from korpus.
Thank you, I undestand.
Now kRp2VCorpus
works!
I Have just a problem due to my inexperince with r code:
with
tmCorpusObject0<-treetag(content(Corpus.TotPOS16[["1"]]), format="obj", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"),
TT.options=list(path="C:/TreeTagger", preset="it", no.unknown=T))
I have only the first document tagged.
how can I change (Corpus.TotPOS16[["1"]])
to have all the corpus document tagged?
For the problem of spaces for me It would be enough to put a space between each lemma (also punctuation) and leave intact expressions as "value-for-money" without any break between the words.
but if it's impossible I will continue to do it with a text editor
from korpus.
I noticed another problem with
treetag(content(Corpus.TotPOS16[["1"]])...
It seems not to be able to manage the Italian accented characters (à, è, ...) that become something like "rapporto-qualit�-prezzo"
If I use
treetag(".\\TotPOS16.csv",...
the problem does not exist
Thanks in advance
from korpus.
have you triel looping through the tm
object with lapply()
? that should get you a list of results, e.g.
myList <- lappy(Corpus.TotPOS16, function(x){
return(treetag(content(x), format="obj"))
})
as for the encoding issue, you will have to try to find the exact step where special characters are being messed up.
from korpus.
Thank you for the tip, I'll give it a try.
Re the encoding issue, it happens when I use "treetag" command with a corpus object (not with a csv file).
from korpus.
it happens when I use "treetag" command with a corpus object (not with a csv file).
yes, but the question remains when exactly the character errors occur. e.g., are the characters already corrupted in the tm
corpus? if so, what about the material used to make that object? and so on. at some point, things go wrong. we must find that specific point first, or we have little chance of fixing it.
from korpus.
Ok now I understand.
The characters in tm
were ok . Indeed the csv file was product starting from the tm corpus object with:
#from VCORPUS to DATAFRAME
dataframeD610P<-data.frame(text=unlist(sapply(Corpus.TotPOS, `[`, "content")), stringsAsFactors=F)
#from DATAFRAME to XLSX
#library(xlsx)
write.xlsx(dataframeD610P$text, ".\\mycorpus.xlsx")
#open with excel
#save in csv (UTF-8)
#import in KORPUS and lemmatization with KORPUS/TREETAGGER
tagged.results <- treetag(".\\mycorpus.csv", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"),
TT.options=list(path="C:/TreeTagger", preset="it-utf8", no.unknown=T))
but if I use directly treetag with tm object
tmCorpusObject0<-treetag(content(Corpus.TotPOS16[["1"]]), format="obj", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"),
TT.options=list(path="C:/TreeTagger", preset="it", no.unknown=T))
the problem occurs
from korpus.
ok, i then suspect the internal workflow of treetag()
to be the reason for the character glitches. a problem the function has to deal with is that TreeTagger can't use R character vectors directly. it needs a file to do the analysis. therefore what treetag(..., format="obj")
does is first write the text to a temporary file, let TreeTagger analyse the file, and remove the temp file again. the "write text to file" part could be the problem here, if input and output encoding don't match.
does it change anything if you use enc2utf8(content(Corpus.TotPOS16[["1"]]))
instead of just content(Corpus.TotPOS16[["1"]]))
, to force the text input into UTF-8?
from korpus.
no changes :( ...
[email protected]$lemma
[1] "qualit�" "scarso" "qualit�" "disinteressare" "pericoloso" "."
from korpus.
i've changed the way temp files are written a bit in the develop branch. could you please try the following:
- with your current installation, does it help explicitly using
treetag(..., encoding="UTF-8")
? it shouldn't have that effect, but i want to make sure that is the case. - install the current develop version:
devtools::install_github("unDocUMeantIt/koRpus", ref="develop")
(restart R afterwards to ensure your using the new version) - try with the new
treetag()
, both withencoding="UTF-8"
and without.
does this at least change anything, if not fix it?
what i've tried here is now to force writing the temporary files with UTF-8 encoding if no other encoding is set. so the using of encoding="UTF-8"
shouldn't really have an effect (but should you see different results, i'll have to check the code again...).
you could then also set debug=TRUE
, which prevents the tempfile from being deleted automatically, so you can inspect it -- is it UTF-8 what you find in that file?
from korpus.
with the standard version of korpus the addition of encoding="UTF-8"
:
tmCorpusObject0<-treetag(content(Corpus.TotPOS16[["2355"]]), format="obj", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"), encoding="UTF-8",
TT.options=list(path="C:/TreeTagger", preset="it", no.unknown=T))
doesn't work resulting in this error
Error in nchar(txt) : invalid multibyte string, element 1
with the dev version the addition of `encoding="UTF-8" works and It seems to recognize accented letters:
[email protected]$lemma
[1] "spesso" "alcuni" "del" "prodotto" "migliore" "non" "venire" "più" "riassortiti"
[10] "e" "si" "faticare" "a" "trovare" "di" "simile" "per" "colore"
[19] "e" "o" "qualità " "," "alcun" "colore" "vistare" "da" "catalogo"
[28] "differire" "dal" "prodotto" "reale" "," "a" "volta" "per" "la"
[37] "non" "curanza" "del" "imballaggio" "e" "o" "del" "corriere" "arrivare"
[46] "prodotto" "con" "la" "scatola" "rovinare" "e" "se" "essere" "regale"
[55] "per" "altro" "persona" "non" "essere" "molto" "presentabile" "," "parlare"
[64] "anche" "del" "prodotto" "mancare" "che" "a" "volta" "non" "arrivare"
[73] "perché" "esaurito" "o" "arrivare" "in" "un" "secondo" "momento" "perché"
[82] "al" "momento" "non" "disponbili" "in" "magazzino" "se" "servire" "con"
[91] "urgenza" "bisgona" "sempre" "preparare" "un" "piano" "b" "." "."
[100] "INTERRUPTw" "."
>
the accented letters are reprinted with combinations of characters but they should be right
ù = ù
à = à
é = é
(However, the result is the same even if not added encoding="UTF-8"
)
but now the function
kRp2VCorpus <- function(obj){
thisText <- VCorpus(
VectorSource(
kRp.text.paste(obj)
),
readerControl=list(language=language(obj))
)
return(thisText)
}
# then use the function like this on a tagged text object:
tmCorpusObject1 <- kRp2VCorpus(tmCorpusObject0)
does not return the lemma but the token
lapply(tmCorpusObject1[1], as.character)
$1
[1] "spesso alcuni dei prodotti migliori non vengono più riassortiti e si fatica a trovarne di simili per colore e o qualità , alcuni colori visti da catalogo differiscono dal prodotto reale, a volte per la non curanza degli imballaggi e o del corriere arrivano prodotti con le scatole rovinate e se sono regali per altre persone non è molto presentabile, parlando anche dei prodotti mancanti che a volte non arrivano perché esauriti o arrivano in un secondo momento perché al momento non disponbili in magazzino se servono con urgenza bisgona sempre prepararsi un piano b. . INTERRUPTw. "
from korpus.
the accented letters are reprinted with combinations of characters but they should be right
does this mean they look funny here on gitHub, or even in your R session? if R doesn't show them correctly, i'm afraid i'm not finished fixing this ;-) could be i've now fixed the output file, but that on windows, getting the tagged input back into koRpus is still broken.
(However, the result is the same even if not added encoding="UTF-8")
yes, that's the way it should be.
but now the function [...] does not return the lemma but the token
hm, i suppose it always has. because kRp.text.paste()
always returns tokens (and i haven't touched that function or any object classes). if you only want the lemmata back, you could replace kRp.text.paste()
with something like taggedText(obj)[["lemma"]]
or paste(taggedText(obj)[["lemma"]])
.
from korpus.
the accented letters are reprinted with combinations of characters but they should be right
even in my R session..
but I think the accents have been kept because if I transform the kRp.tagged object into a txt file:
write.table([email protected]$lemma, ".\\tmCorpusObject.txt")
I get:
"x"
"1" "spesso"
"2" "alcuni"
"3" "del"
"4" "prodotto"
"5" "migliore"
"6" "non"
"7" "venire"
"8" "più"
"9" "riassortiti"
"10" "e"
"11" "si"
"12" "faticare"
"13" "a"
"14" "trovare"
"15" "di"
"16" "simile"
"17" "per"
"18" "colore"
"19" "e"
"20" "o"
"21" "qualità"
with the rigth accented letters
but now the function [...] does not return the lemma but the token
hm, i suppose it always has
You are right.
and now with:
kRp3VCorpus <- function(obj){
thisText <- VCorpus(
VectorSource(
paste(taggedText(obj)[["lemma"]])
),
readerControl=list(language=language(obj))
)
return(thisText)
}
I have le lemmas..
but still with combinations of characters in place of accented letters and every token is a document (Is it possible to separate the phrases knowing that at the end of each sentence I added the word ""INTERRUPTw"?)
from korpus.
sorry i didn't reply earlier!
when you're using tokenize()
or treetag()
, you shouldn't have to mark sentences manually. you can use the POS tags indicating sentence ending punctuation for that (try kRp.POS.tags("it", tags="sentc")
or kRp.POS.tags("it", tags="sentc", list.tags=TRUE)
to get the tags you need for this). adding your own token for that will probably only invalidate all statistics for the text, because it is counted as a word belonging to the next sentence.
but this seems to be a different issue than the one this started off with. can we close this ticket?
from korpus.
Yes of course. Thanks
from korpus.
Related Issues (20)
- Missing tags for Danish HOT 6
- incomplete import of LCC corpus HOT 7
- Error: Specified directory cannot be found: ~/bin/treetagger/bin HOT 4
- URLs and sequences of punctuation in documents cause some readability measures to fail HOT 1
- How can I extract proper nouns? HOT 8
- Issue on Windows HOT 1
- Error in path.expand(path) : argument 'path' incorrect HOT 24
- Error: english-lexicon.txt not found HOT 3
- Working in Python? HOT 1
- option lexicon HOT 6
- Treetagger do not worh in both koRpus and teststem packages HOT 1
- Incorrect calculation of MTLD? HOT 3
- readability() returns error message HOT 9
- treetegger working with a dataset in R HOT 6
- Error in reading corpus database HOT 5
- Can I use the udpipe annotated results? HOT 2
- character vector "measure" seems to be ignored by lex.div; Fehler in x[["end"]] : Indizierung außerhalb der Grenzen; Fehler in 1:lastValidIndex : Resultat wäre zu langer Vektor HOT 2
- Getting "Awww, this should not happen" error even though the sys.tt.call runs sucessfully HOT 6
- TT.tokenizer not found HOT 3
- Flesch Formula multiplier HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from korpus.