undocumeantit / korpus Goto Github PK

An R Package for Text Analysis

License: GNU General Public License v3.0

R 97.75% JavaScript 2.21% HTML 0.03%

korpus's Introduction

koRpus

koRpus is an R package for text analysis. This includes, amongst others, a wrapper for the POS tagger TreeTagger, functions for automatic language detection, hyphenation, several indices of lexical diversity (e.g., type token ratio, HD-D/vocd-D, MTLD) and readability (e.g., Flesch, SMOG, LIX, Dale-Chall, Tuldava).

koRpus also includes a plugin for RKWard, a powerful GUI and IDE for R, providing graphical dialogs for its basic features. To make full use of this feature, please install RKWard (plugins are detected automatically).

More information on koRpus is available on the project homepage.

Installation

There are three easy ways of getting koRpus:

Stable releases via CRAN

The latest release that is considered stable for productive work can be found on the CRAN mirrors, which means you can install it from a running R session like this:

install.packages("koRpus")

The CRAN packages are usually a bit behind the recent state of the package, and only updated after a significant amount of changes or important bug fixes.

Development releases via the project repository

Inbetween stable CRAN releases there's usually several testing or development versions released on the project's own repository. These releases should also work without problems, but they are also intended to test new features or supposed bug fixes, and get feedback before the next release goes to CRAN.

Installation is fairly easy, too:

install.packages("koRpus", repo=c(getOption("repos"), reaktanz="https://reaktanz.de/R"))

To automatically get updates, consider adding the repository to your R configuration. You might also want to subscribe to the package's RSS feed to get notified of new releases.

If you're running a Debian based operating system, you might be interested in the precompiled *.deb packages.

Installation via GitHub

To install it directly from GitHub, you can use install_github() from the devtools package:

devtools::install_github("unDocUMeantIt/koRpus") # stable release
devtools::install_github("unDocUMeantIt/koRpus", ref="develop") # development release

Installing language support

koRpus does not support any particular language out-of-the-box. Therefore, after installing the package you'll have to also install at least one language support package to really make use of it. You can find these in the l10n repository, they're called koRpus.lang.*.

The most straight forward way to get these packages is to use the function install.koRpus.lang(). Here's an example how to install support for English and German:

library(koRpus)
install.koRpus.lang(lang=c("en", "de"))

There's also precompiled Debian packages.

Contributing

To ask for help, report bugs, suggest feature improvements, or discuss the global development of the package, please either subscribe to the koRpus-dev mailing list, or use the issue tracker on GitHub.

Branches

Please note that all development happens in the develop branch. Pull requests against the master branch will be rejected, as it is reserved for the current stable release.

Licence

koRpus is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

koRpus is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with koRpus. If not, see https://www.gnu.org/licenses/.

korpus's People

Contributors

Stargazers

Watchers

Forkers

kenzoakira adamspannbauer gtk-lab freitagsalice janbren minghao2016

korpus's Issues

download of package ‘koRpus.lang.de’ failed

I am trying to run the following script:

tagged = treetag("message.txt", treetagger=".../treetagger/cmd/tree-tagger-german", lang="de", TT.options=list(tokenizer = "tree-tagger-german" , tagger = "tree-tagger" , params = "german-utf8.par" ) , format = "file" , stopwords=stopwords::stopwords("de", source = "stopwords-iso"))

Since today, I get the following error message:

Error: Unknown tag definition requested: de

The command "available.koRpus.lang()" tells me that I have not installed "koRpus.lang.de"

When running "install.koRpus.lang("de")" or "install.packages("koRpus.lang.de", repos="https://undocumeantit.github.io/repos/l10n/")": I get the following error message:

There are binary versions available but the source versions are later:
binary source needs_compilation
sylly.de 0.1-2 0.1-3 FALSE
koRpus.lang.de 0.1-1 0.1-2 FALSE
installing the source packages ‘sylly.de’, ‘koRpus.lang.de’
trying URL 'https://undocumeantit.github.io/repos/l10n/src/contrib/sylly.de_0.1-3.tar.gz'
Warning in install.packages :
cannot open URL 'https://undocumeantit.github.io/repos/l10n/src/contrib/sylly.de_0.1-3.tar.gz': HTTP status was '404 Not Found'
Error in download.file(url, destfile, method, mode = "wb", ...) :
cannot open URL 'https://undocumeantit.github.io/repos/l10n/src/contrib/sylly.de_0.1-3.tar.gz'
Warning in install.packages :
download of package ‘sylly.de’ failed
trying URL 'https://undocumeantit.github.io/repos/l10n/src/contrib/koRpus.lang.de_0.1-2.tar.gz'
Warning in install.packages :
cannot open URL 'https://undocumeantit.github.io/repos/l10n/src/contrib/koRpus.lang.de_0.1-2.tar.gz': HTTP status was '404 Not Found'
Error in download.file(url, destfile, method, mode = "wb", ...) :
cannot open URL 'https://undocumeantit.github.io/repos/l10n/src/contrib/koRpus.lang.de_0.1-2.tar.gz'
Warning in install.packages :
download of package ‘koRpus.lang.de’ failed

I also tried installing other languages like Spanish, French or Italian, but I always get the same error message.

I work with R version 3.5.1 on a Mac (macOS Mojave Version 10.14)

Incorrect calculation of MTLD?

Thank you for a useful package.

I noticed a discrepancy in the calculation of MTLD between the algorithm explained in McCarthy and Jarvis's (2010) paper and the way koRpus::MTLD.calc() calculates it.

On McCarthy and Jarvis (2010, p.385):

The total number of words in the text is divided by the total factor count. For example, if the text = 340 words and the factor count = 4.404, then the MTLD value is 77.203. Two such MTLD values are calculated, one for forward processing and one for reverse processing. The mean of the two values is the final MTLD value.

So it is the two MTLD values (i.e., forward and reverse) whose mean should be calculated.

On the other hand, the relevant part in koRpus::MTLD.calc() calculates the mean of two factors (i.e., denominators), and the number of tokens is divided by the mean factor:

mtld.res.mean <- mean(c(mtld.res.forw[["factors"]], mtld.res.back[["factors"]]))
mtld.res.value <- num.tokens/mtld.res.mean

The two methods yield different results as mean(c(N/f1, N/f2)) and N/mean(c(f1, f2)) are different.

If this is not by design, would you mind fixing the issue when you can?

Error: Manual TreeTagger configuration: "en" is not a valid preset!

For some reason, it says that "en" is not a valid preset for a manual treetagger configuration. The same problem was reported for issue #15. I tried loading the library(koRpus.lang.en), but this didn't help me. Furthermore, I made sure to set the path to where I have the folder for TreeTagger in the batches and environmental variables. (Windows 7 and R 3.4.4)

Originally posted by @JasonScottSchneider in #15 (comment)

Working in Python?

It looked liked koRpus had really good lemmatization functionality, but I'm working in Python. Is there a good python wrapper for this? Or is it derived from something that might have a python wrapper?

Error: english-lexicon.txt not found

Hello, I am having trouble using Treetagger via koRpus. After looking for a while, I cannot seem to find the .txt file listed in the error below nor have I found a work around. Thanks for your help.

I am using the following versions -
R: 3.6.1
koRpus: 0.13-3
Treetagger: Mac OSX 3.2.3

tagged.text <- treetag("unabomber_manifesto.txt", treetagger="manual", lang="en",
TT.options=list( path="~/Downloads/mytreetagger/bin ",preset="en"))

Error: None of the following files were found, please check your TreeTagger installation!
/Users/tuc50262/Downloads/mytreetagger/bin/lib/english-lexicon.txt
/Users/tuc50262/Downloads/mytreetagger/bin/lib/english-lexicon

No S3 method for read.udhr

Function guess.lang attempts to call read.udhr to read the zip file. There is no read.udhr S3 method, so it fails.

Error in reading corpus database

I donwloaded the following file from this link (https://wortschatz.uni-leipzig.de/en/download/English)

eng_news_2020_1M.tar.gz

Then, I run the following code; however, it returns an error message.

LCC.en <- read.corp.LCC(here('data','eng_news_2020_1M.tar.gz'))

Fetching needed files from LCC archive... done.
Warning messages:
1: In readLines(LCC.file.con, n = n) :
  invalid input found on input connection 'C:\Users\cengiz\AppData\Local\Temp\RtmpWiHf2g\koRpus.LCC575c134d78b1/eng_news_2020_1M/eng_news_2020_1M-words.txt'
2: In readLines(LCC.file.con, n = n) :
  incomplete final line found on 'C:\Users\cengiz\AppData\Local\Temp\RtmpWiHf2g\koRpus.LCC575c134d78b1/eng_news_2020_1M/eng_news_2020_1M-words.txt'
3: In matrix(unlist(strsplit(rL.words, "\t")), ncol = 3, byrow = TRUE,  :
  data length [133] is not a sub-multiple or multiple of the number of rows [45]
4: In create.corp.freq.object(matrix.freq = table.words, num.running.words = num.running.words,  :
  NAs introduced by coercion

Is there a way to fix this? I am using Windows PC.

Thank you.

URLs and sequences of punctuation in documents cause some readability measures to fail

Our corpus contains some documents that contain URLs and more rarely sequences of punctuation that cause some of the readability measures to fail.

For a minimum example, I present a length 3 text vector with the first element an example that should work to make sure I am calling the functions correctly. The second example contains a couple of words and a URL and the third example contains a couple of words and a sequence of punctuation. Our actual documents are much longer, but these examples show the issues we have encountered. I then show output from these examples for ARI, flesch.kincaid and FOG, where ARI works in all cases and the last two fail in different ways. flesch.kincaid fails on the first two examples, and works on the third. FOG works the first and third examples.

The transcript:

require("koRpus.lang.en", quietly=T)
koRpus::set.kRp.env(lang="en")
responses <- c("hi mom", "this fails http://www.thedailybeast.com/articles/2012/06/28/did-chief-justice-roberts-take-a-cue-from-two-centuries-ago.html", "this fails ,,..$%#@")
tt <- lapply(responses, koRpus::tokenize, format="obj")
ARI(tt[[1]])

Automated Readability Index (ARI)
Parameters: default
Grade: -8.65