I have a problem moving from a tm object to a koRpus object. I have to normalize a cor

i have started working on a compatibility package: <a href="https://github.com/unDocUM

...another thing.. the function: <div class="snippet-clipboard-content notrans

Moving from tm object to koRpus object and vice versa,about undocumeantit/korpus

Comments (22)

unDocUMeantIt commented on July 19, 2024

i have started working on a compatibility package: https://github.com/unDocUMeantIt/tm.plugin.koRpus/tree/develop

the actual migration between koRpus and tm objects is not well tested at the moment, i myself am using the package mostly to call koRpus methods on full corpora instead of single texts. but i think that package would be a good place to start. feel free to report issues and feature requests. i can't promise anything, especially in the near future, but i'll sure try. koRpus and tm both have a totally different philosophy with regards to text/object handling, and different technical solutions as well (S4 vs. S3), so it's not really a trivial task getting them to communicate with each other.

you will have to update koRpus to a more recent version (=> 0.07-1) to be able to use it. but i recommend that anyway, becase there's tons of improvements (i haven't had the time to go through the CRAN release procedure yet, but you can find up-to-date releases in my own repository: https://reaktanz.de/R/ )

from korpus.

giorjet commented on July 19, 2024

Thank you so much.
Definitely I'll try it

from korpus.

giorjet commented on July 19, 2024

Hi, I don't understand how to import from a tm corpus and how to export into a tm corpus... Any syntax suggestion?
Thank you

from korpus.

unDocUMeantIt commented on July 19, 2024

there currently is a stub function called kRpSource() that was supposed to turn a koRpus text object into a tm Source object. however, kRpSource() seems to be defunct for the time being. but you can use the following to achieve something similar:

kRp2VCorpus <- function(obj){
  thisText <- VCorpus(
    VectorSource(
      kRp.text.paste(obj)
    ),
    readerControl=list(language=language(obj))
  )
  return(thisText)
}

# then use the function like this on a tagged text object:
tmCorpusObject <- kRp2VCorpus(koRpusTaggedTextObject)

for the other way around, you could try to use the text "content" of tm corpus objects, e.g. treetag(content(tmCorpusObject[["1"]]), format="obj"). in the mid term, i'm planning to write a wrapper that does this internally so you can use tm methods on koRpus objects intuitively.

from korpus.

giorjet commented on July 19, 2024

Great! kRp2VCorpus works.
Thank so much
But now I have another problem:
I tried:

tmCorpusObject1<-treetag(content(tmCorpusObject0[["1"]]), format="obj", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"),
                         TT.options=list(path="C:/TreeTagger", preset="it", no.unknown=T))

but this is the answer:

Error in paste(TT.splitter, "perl ", TT.tokenizer, TT.tknz.opts, TT.call.file,  : 
  object 'TT.call.file' not found

and now my syntax doesn't work also using as input a csv file:

tagged.korpus <- treetag(".\\TotPOS16.csv", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"),
                              TT.options=list(path="C:/TreeTagger", preset="it-utf8", no.unknown=T))

Are there any syntax changes with the 0.07-2 update of Korpus? (with version 0.06-5 it worked)

Thank You

from korpus.

giorjet commented on July 19, 2024

...another thing..
the function:

kRp2VCorpus <- function(obj){
  thisText <- VCorpus(
    VectorSource(
      kRp.text.paste(obj)
    ),
    readerControl=list(language=language(obj))
  )
  return(thisText)
}

tmCorpusObject.TotPOS16 <- kRp2VCorpus(tagged.TotPOS16results)

works very well but it eliminates some blanks before or after punctuation or other characters like "-" giving some problem to my analisys.
Have you any solutions?
Thank You

from korpus.

unDocUMeantIt commented on July 19, 2024

right, there's a bug that was introduced with 0.07-1 only to the windows version of koRpus. it slipped through with changes needed to support portuguese, was discovered in january and is fixed in the develop branch.

i'll release a fixed version 0.10-1 as soon as i get roxygen2 running again (i have problems with roxygen2 6.0.1). see here how you can install the develop version directly from github:
https://github.com/unDocUMeantIt/koRpus/tree/develop#installation-via-github

i'm sorry for all the trouble -- i don't use windows and most windows users only run the CRAN versions of the package, OS specific bugs are sometimes hard to see.

from korpus.

unDocUMeantIt commented on July 19, 2024

the main problem with regards to kRp.text.paste() is this: when you give a text to TreeTagger, what you get back is a table with three columns, where the first column is the vector of all tokens in the original text. during this step, you lose information about spaces, paragraphs etc. -- have a look at taggedText(tagged.TotPOS16results).

kRp.text.paste() tries to recreate the original text from that vector of tokens, which of course can't be perfect because it doesn't know how many spaces there were. i have not yet found a better solution for this.

from korpus.

giorjet commented on July 19, 2024

Thank you, I undestand.
Now kRp2VCorpus works!
I Have just a problem due to my inexperince with r code:
with

tmCorpusObject0<-treetag(content(Corpus.TotPOS16[["1"]]), format="obj", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"),
                         TT.options=list(path="C:/TreeTagger", preset="it", no.unknown=T))

I have only the first document tagged.
how can I change (Corpus.TotPOS16[["1"]]) to have all the corpus document tagged?

For the problem of spaces for me It would be enough to put a space between each lemma (also punctuation) and leave intact expressions as "value-for-money" without any break between the words.
but if it's impossible I will continue to do it with a text editor

from korpus.

giorjet commented on July 19, 2024

I noticed another problem with
treetag(content(Corpus.TotPOS16[["1"]])...
It seems not to be able to manage the Italian accented characters (à, è, ...) that become something like "rapporto-qualitï¿½-prezzo"
If I use
treetag(".\\TotPOS16.csv",...
the problem does not exist

Thanks in advance

from korpus.

unDocUMeantIt commented on July 19, 2024

have you triel looping through the tm object with lapply()? that should get you a list of results, e.g.

myList <- lappy(Corpus.TotPOS16, function(x){
  return(treetag(content(x), format="obj"))
})

as for the encoding issue, you will have to try to find the exact step where special characters are being messed up.

from korpus.

giorjet commented on July 19, 2024

Thank you for the tip, I'll give it a try.
Re the encoding issue, it happens when I use "treetag" command with a corpus object (not with a csv file).

from korpus.

unDocUMeantIt commented on July 19, 2024

it happens when I use "treetag" command with a corpus object (not with a csv file).

yes, but the question remains when exactly the character errors occur. e.g., are the characters already corrupted in the tm corpus? if so, what about the material used to make that object? and so on. at some point, things go wrong. we must find that specific point first, or we have little chance of fixing it.

from korpus.

giorjet commented on July 19, 2024

Ok now I understand.
The characters in tm were ok . Indeed the csv file was product starting from the tm corpus object with:

#from VCORPUS to DATAFRAME 
dataframeD610P<-data.frame(text=unlist(sapply(Corpus.TotPOS, `[`, "content")), stringsAsFactors=F)

#from DATAFRAME to XLSX 
#library(xlsx)
write.xlsx(dataframeD610P$text, ".\\mycorpus.xlsx")

#open with excel 
#save in csv (UTF-8)

#import in KORPUS and lemmatization with KORPUS/TREETAGGER 

tagged.results <- treetag(".\\mycorpus.csv", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"),
                          TT.options=list(path="C:/TreeTagger", preset="it-utf8", no.unknown=T))

but if I use directly treetag with tm object

tmCorpusObject0<-treetag(content(Corpus.TotPOS16[["1"]]), format="obj", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"),
                         TT.options=list(path="C:/TreeTagger", preset="it", no.unknown=T))

the problem occurs

from korpus.

unDocUMeantIt commented on July 19, 2024

ok, i then suspect the internal workflow of treetag() to be the reason for the character glitches. a problem the function has to deal with is that TreeTagger can't use R character vectors directly. it needs a file to do the analysis. therefore what treetag(..., format="obj") does is first write the text to a temporary file, let TreeTagger analyse the file, and remove the temp file again. the "write text to file" part could be the problem here, if input and output encoding don't match.

does it change anything if you use enc2utf8(content(Corpus.TotPOS16[["1"]])) instead of just content(Corpus.TotPOS16[["1"]])), to force the text input into UTF-8?

from korpus.

giorjet commented on July 19, 2024

no changes :( ...

[email protected]$lemma
[1] "qualitï¿½"      "scarso"         "qualitï¿½"      "disinteressare" "pericoloso"     "."

from korpus.

unDocUMeantIt commented on July 19, 2024

i've changed the way temp files are written a bit in the develop branch. could you please try the following:

with your current installation, does it help explicitly using treetag(..., encoding="UTF-8")? it shouldn't have that effect, but i want to make sure that is the case.
install the current develop version: devtools::install_github("unDocUMeantIt/koRpus", ref="develop") (restart R afterwards to ensure your using the new version)
try with the new treetag(), both with encoding="UTF-8" and without.

does this at least change anything, if not fix it?

what i've tried here is now to force writing the temporary files with UTF-8 encoding if no other encoding is set. so the using of encoding="UTF-8" shouldn't really have an effect (but should you see different results, i'll have to check the code again...).

you could then also set debug=TRUE, which prevents the tempfile from being deleted automatically, so you can inspect it -- is it UTF-8 what you find in that file?

from korpus.

giorjet commented on July 19, 2024

with the standard version of korpus the addition of encoding="UTF-8":

tmCorpusObject0<-treetag(content(Corpus.TotPOS16[["2355"]]), format="obj", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"), encoding="UTF-8",
                         TT.options=list(path="C:/TreeTagger", preset="it", no.unknown=T))

doesn't work resulting in this error
Error in nchar(txt) : invalid multibyte string, element 1

with the dev version the addition of `encoding="UTF-8" works and It seems to recognize accented letters:

[email protected]$lemma
  [1] "spesso"       "alcuni"       "del"          "prodotto"     "migliore"     "non"          "venire"       "piÃ¹"         "riassortiti" 
 [10] "e"            "si"           "faticare"     "a"            "trovare"      "di"           "simile"       "per"          "colore"      
 [19] "e"            "o"            "qualitÃ "     ","            "alcun"        "colore"       "vistare"      "da"           "catalogo"    
 [28] "differire"    "dal"          "prodotto"     "reale"        ","            "a"            "volta"        "per"          "la"          
 [37] "non"          "curanza"      "del"          "imballaggio"  "e"            "o"            "del"          "corriere"     "arrivare"    
 [46] "prodotto"     "con"          "la"           "scatola"      "rovinare"     "e"            "se"           "essere"       "regale"      
 [55] "per"          "altro"        "persona"      "non"          "essere"       "molto"        "presentabile" ","            "parlare"     
 [64] "anche"        "del"          "prodotto"     "mancare"      "che"          "a"            "volta"        "non"          "arrivare"    
 [73] "perchÃ©"      "esaurito"     "o"            "arrivare"     "in"           "un"           "secondo"      "momento"      "perchÃ©"     
 [82] "al"           "momento"      "non"          "disponbili"   "in"           "magazzino"    "se"           "servire"      "con"         
 [91] "urgenza"      "bisgona"      "sempre"       "preparare"    "un"           "piano"        "b"            "."            "."           
[100] "INTERRUPTw"   "."           
>

the accented letters are reprinted with combinations of characters but they should be right

Ã¹ = ù
Ã  = à
Ã© = é

(However, the result is the same even if not added encoding="UTF-8" )

but now the function

kRp2VCorpus <- function(obj){
  thisText <- VCorpus(
    VectorSource(
      kRp.text.paste(obj)
    ),
    readerControl=list(language=language(obj))
  )
  return(thisText)
}

# then use the function like this on a tagged text object:
tmCorpusObject1 <- kRp2VCorpus(tmCorpusObject0)

does not return the lemma but the token

lapply(tmCorpusObject1[1], as.character)
$1
[1] "spesso alcuni dei prodotti migliori non vengono piÃ¹ riassortiti e si fatica a trovarne di simili per colore e o qualitÃ , alcuni colori visti da catalogo differiscono dal prodotto reale, a volte per la non curanza degli imballaggi e o del corriere arrivano prodotti con le scatole rovinate e se sono regali per altre persone non Ã¨ molto presentabile, parlando anche dei prodotti mancanti che a volte non arrivano perchÃ© esauriti o arrivano in un secondo momento perchÃ© al momento non disponbili in magazzino se servono con urgenza bisgona sempre prepararsi un piano b. . INTERRUPTw. "

from korpus.

unDocUMeantIt commented on July 19, 2024

the accented letters are reprinted with combinations of characters but they should be right

does this mean they look funny here on gitHub, or even in your R session? if R doesn't show them correctly, i'm afraid i'm not finished fixing this ;-) could be i've now fixed the output file, but that on windows, getting the tagged input back into koRpus is still broken.

(However, the result is the same even if not added encoding="UTF-8")

yes, that's the way it should be.

but now the function [...] does not return the lemma but the token

hm, i suppose it always has. because kRp.text.paste() always returns tokens (and i haven't touched that function or any object classes). if you only want the lemmata back, you could replace kRp.text.paste() with something like taggedText(obj)[["lemma"]] or paste(taggedText(obj)[["lemma"]]).

from korpus.

giorjet commented on July 19, 2024

the accented letters are reprinted with combinations of characters but they should be right

even in my R session..
but I think the accents have been kept because if I transform the kRp.tagged object into a txt file:
write.table([email protected]$lemma, ".\\tmCorpusObject.txt")
I get:

"x"
"1" "spesso"
"2" "alcuni"
"3" "del"
"4" "prodotto"
"5" "migliore"
"6" "non"
"7" "venire"
"8" "più"
"9" "riassortiti"
"10" "e"
"11" "si"
"12" "faticare"
"13" "a"
"14" "trovare"
"15" "di"
"16" "simile"
"17" "per"
"18" "colore"
"19" "e"
"20" "o"
"21" "qualità"

with the rigth accented letters

but now the function [...] does not return the lemma but the token
hm, i suppose it always has

You are right.
and now with:

kRp3VCorpus <- function(obj){
  thisText <- VCorpus(
    VectorSource(
      paste(taggedText(obj)[["lemma"]])
    ),
    readerControl=list(language=language(obj))
  )
  return(thisText)
}

I have le lemmas..
but still with combinations of characters in place of accented letters and every token is a document (Is it possible to separate the phrases knowing that at the end of each sentence I added the word ""INTERRUPTw"?)

from korpus.

unDocUMeantIt commented on July 19, 2024

sorry i didn't reply earlier!

when you're using tokenize() or treetag(), you shouldn't have to mark sentences manually. you can use the POS tags indicating sentence ending punctuation for that (try kRp.POS.tags("it", tags="sentc") or kRp.POS.tags("it", tags="sentc", list.tags=TRUE) to get the tags you need for this). adding your own token for that will probably only invalidate all statistics for the text, because it is counted as a word belonging to the next sentence.

but this seems to be a different issue than the one this started off with. can we close this ticket?

from korpus.

giorjet commented on July 19, 2024

Yes of course. Thanks

from korpus.

Moving from tm object to koRpus object and vice versa about korpus HOT 22 CLOSED

Comments (22)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent