Git Product home page Git Product logo

korpus's Introduction

koRpus

koRpus is an R package for text analysis. This includes, amongst others, a wrapper for the POS tagger TreeTagger, functions for automatic language detection, hyphenation, several indices of lexical diversity (e.g., type token ratio, HD-D/vocd-D, MTLD) and readability (e.g., Flesch, SMOG, LIX, Dale-Chall, Tuldava).

koRpus also includes a plugin for RKWard, a powerful GUI and IDE for R, providing graphical dialogs for its basic features. To make full use of this feature, please install RKWard (plugins are detected automatically).

More information on koRpus is available on the project homepage.

Installation

There are three easy ways of getting koRpus:

Stable releases via CRAN

The latest release that is considered stable for productive work can be found on the CRAN mirrors, which means you can install it from a running R session like this:

install.packages("koRpus")

The CRAN packages are usually a bit behind the recent state of the package, and only updated after a significant amount of changes or important bug fixes.

Development releases via the project repository

Inbetween stable CRAN releases there's usually several testing or development versions released on the project's own repository. These releases should also work without problems, but they are also intended to test new features or supposed bug fixes, and get feedback before the next release goes to CRAN.

Installation is fairly easy, too:

install.packages("koRpus", repo=c(getOption("repos"), reaktanz="https://reaktanz.de/R"))

To automatically get updates, consider adding the repository to your R configuration. You might also want to subscribe to the package's RSS feed to get notified of new releases.

If you're running a Debian based operating system, you might be interested in the precompiled *.deb packages.

Installation via GitHub

To install it directly from GitHub, you can use install_github() from the devtools package:

devtools::install_github("unDocUMeantIt/koRpus") # stable release
devtools::install_github("unDocUMeantIt/koRpus", ref="develop") # development release

Installing language support

koRpus does not support any particular language out-of-the-box. Therefore, after installing the package you'll have to also install at least one language support package to really make use of it. You can find these in the l10n repository, they're called koRpus.lang.*.

The most straight forward way to get these packages is to use the function install.koRpus.lang(). Here's an example how to install support for English and German:

library(koRpus)
install.koRpus.lang(lang=c("en", "de"))

There's also precompiled Debian packages.

Contributing

To ask for help, report bugs, suggest feature improvements, or discuss the global development of the package, please either subscribe to the koRpus-dev mailing list, or use the issue tracker on GitHub.

Branches

Please note that all development happens in the develop branch. Pull requests against the master branch will be rejected, as it is reserved for the current stable release.

Licence

Copyright 2012-2021 Meik Michalke [email protected]

koRpus is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

koRpus is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with koRpus. If not, see https://www.gnu.org/licenses/.

korpus's People

Contributors

adamspannbauer avatar undocumeantit avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

korpus's Issues

download of package ‘koRpus.lang.de’ failed

I am trying to run the following script:

tagged = treetag("message.txt", treetagger=".../treetagger/cmd/tree-tagger-german", lang="de", TT.options=list(tokenizer = "tree-tagger-german" , tagger = "tree-tagger" , params = "german-utf8.par" ) , format = "file" , stopwords=stopwords::stopwords("de", source = "stopwords-iso"))

Since today, I get the following error message:

Error: Unknown tag definition requested: de

The command "available.koRpus.lang()" tells me that I have not installed "koRpus.lang.de"

When running "install.koRpus.lang("de")" or "install.packages("koRpus.lang.de", repos="https://undocumeantit.github.io/repos/l10n/")": I get the following error message:

There are binary versions available but the source versions are later:
binary source needs_compilation
sylly.de 0.1-2 0.1-3 FALSE
koRpus.lang.de 0.1-1 0.1-2 FALSE
installing the source packages ‘sylly.de’, ‘koRpus.lang.de’
trying URL 'https://undocumeantit.github.io/repos/l10n/src/contrib/sylly.de_0.1-3.tar.gz'
Warning in install.packages :
cannot open URL 'https://undocumeantit.github.io/repos/l10n/src/contrib/sylly.de_0.1-3.tar.gz': HTTP status was '404 Not Found'
Error in download.file(url, destfile, method, mode = "wb", ...) :
cannot open URL 'https://undocumeantit.github.io/repos/l10n/src/contrib/sylly.de_0.1-3.tar.gz'
Warning in install.packages :
download of package ‘sylly.de’ failed
trying URL 'https://undocumeantit.github.io/repos/l10n/src/contrib/koRpus.lang.de_0.1-2.tar.gz'
Warning in install.packages :
cannot open URL 'https://undocumeantit.github.io/repos/l10n/src/contrib/koRpus.lang.de_0.1-2.tar.gz': HTTP status was '404 Not Found'
Error in download.file(url, destfile, method, mode = "wb", ...) :
cannot open URL 'https://undocumeantit.github.io/repos/l10n/src/contrib/koRpus.lang.de_0.1-2.tar.gz'
Warning in install.packages :
download of package ‘koRpus.lang.de’ failed

I also tried installing other languages like Spanish, French or Italian, but I always get the same error message.

I work with R version 3.5.1 on a Mac (macOS Mojave Version 10.14)

Incorrect calculation of MTLD?

Thank you for a useful package.

I noticed a discrepancy in the calculation of MTLD between the algorithm explained in McCarthy and Jarvis's (2010) paper and the way koRpus::MTLD.calc() calculates it.

On McCarthy and Jarvis (2010, p.385):

The total number of words in the text is divided by the total factor count. For example, if the text = 340 words and the factor count = 4.404, then the MTLD value is 77.203. Two such MTLD values are calculated, one for forward processing and one for reverse processing. The mean of the two values is the final MTLD value.

So it is the two MTLD values (i.e., forward and reverse) whose mean should be calculated.

On the other hand, the relevant part in koRpus::MTLD.calc() calculates the mean of two factors (i.e., denominators), and the number of tokens is divided by the mean factor:

mtld.res.mean <- mean(c(mtld.res.forw[["factors"]], mtld.res.back[["factors"]]))
mtld.res.value <- num.tokens/mtld.res.mean

The two methods yield different results as mean(c(N/f1, N/f2)) and N/mean(c(f1, f2)) are different.

If this is not by design, would you mind fixing the issue when you can?

Working in Python?

It looked liked koRpus had really good lemmatization functionality, but I'm working in Python. Is there a good python wrapper for this? Or is it derived from something that might have a python wrapper?

Error: english-lexicon.txt not found

Hello, I am having trouble using Treetagger via koRpus. After looking for a while, I cannot seem to find the .txt file listed in the error below nor have I found a work around. Thanks for your help.

I am using the following versions -
R: 3.6.1
koRpus: 0.13-3
Treetagger: Mac OSX 3.2.3

tagged.text <- treetag("unabomber_manifesto.txt", treetagger="manual", lang="en",
TT.options=list( path="~/Downloads/mytreetagger/bin ",preset="en"))

Error: None of the following files were found, please check your TreeTagger installation!
/Users/tuc50262/Downloads/mytreetagger/bin/lib/english-lexicon.txt
/Users/tuc50262/Downloads/mytreetagger/bin/lib/english-lexicon

No S3 method for read.udhr

Function guess.lang attempts to call read.udhr to read the zip file. There is no read.udhr S3 method, so it fails.

Error in reading corpus database

I donwloaded the following file from this link (https://wortschatz.uni-leipzig.de/en/download/English)

eng_news_2020_1M.tar.gz

Then, I run the following code; however, it returns an error message.

LCC.en <- read.corp.LCC(here('data','eng_news_2020_1M.tar.gz'))

Fetching needed files from LCC archive... done.
Warning messages:
1: In readLines(LCC.file.con, n = n) :
  invalid input found on input connection 'C:\Users\cengiz\AppData\Local\Temp\RtmpWiHf2g\koRpus.LCC575c134d78b1/eng_news_2020_1M/eng_news_2020_1M-words.txt'
2: In readLines(LCC.file.con, n = n) :
  incomplete final line found on 'C:\Users\cengiz\AppData\Local\Temp\RtmpWiHf2g\koRpus.LCC575c134d78b1/eng_news_2020_1M/eng_news_2020_1M-words.txt'
3: In matrix(unlist(strsplit(rL.words, "\t")), ncol = 3, byrow = TRUE,  :
  data length [133] is not a sub-multiple or multiple of the number of rows [45]
4: In create.corp.freq.object(matrix.freq = table.words, num.running.words = num.running.words,  :
  NAs introduced by coercion

Is there a way to fix this? I am using Windows PC.

Thank you.

URLs and sequences of punctuation in documents cause some readability measures to fail

Our corpus contains some documents that contain URLs and more rarely sequences of punctuation that cause some of the readability measures to fail.

For a minimum example, I present a length 3 text vector with the first element an example that should work to make sure I am calling the functions correctly. The second example contains a couple of words and a URL and the third example contains a couple of words and a sequence of punctuation. Our actual documents are much longer, but these examples show the issues we have encountered. I then show output from these examples for ARI, flesch.kincaid and FOG, where ARI works in all cases and the last two fail in different ways. flesch.kincaid fails on the first two examples, and works on the third. FOG works the first and third examples.

The transcript:

require("koRpus.lang.en", quietly=T)
koRpus::set.kRp.env(lang="en")
responses <- c("hi mom", "this fails http://www.thedailybeast.com/articles/2012/06/28/did-chief-justice-roberts-take-a-cue-from-two-centuries-ago.html", "this fails ,,..$%#@")
tt <- lapply(responses, koRpus::tokenize, format="obj")
ARI(tt[[1]])

Automated Readability Index (ARI)
Parameters: default
Grade: -8.65

Text language: en
Warning message:
Text is relatively short (<100 tokens), results are probably not reliable!

ARI(tt[[2]])

Automated Readability Index (ARI)
Parameters: default
Grade: 19.71

Text language: en
Warning message:
Text is relatively short (<100 tokens), results are probably not reliable!

ARI(tt[[3]])

Automated Readability Index (ARI)
Parameters: default
Grade: -5.8

Text language: en
Warning message:
Text is relatively short (<100 tokens), results are probably not reliable!

flesch.kincaid(tt[[1]])
Hyphenation (language: en)
Error in validObject(.Object) :
invalid class “kRp.hyphen” object: invalid object for slot "hyphen" in class "kRp.hyphen": got class "NULL", should be or extend class "data.frame"
In addition: Warning message:
In mean.default(hyph.df$syll, na.rm = TRUE) :
argument is not numeric or logical: returning NA
flesch.kincaid(tt[[2]])
Hyphenation (language: en)
|============================================================= | 88%Error in all.patterns[[word.length]] : subscript out of bounds
flesch.kincaid(tt[[3]])
Hyphenation (language: en)
|======================================================================| 100%

Flesch-Kincaid Grade Level
Parameters: default
Grade: -2.62
Age: 2.38

Text language: en
Warning message:
Text is relatively short (<100 tokens), results are probably not reliable!

FOG(tt[[1]])
Hyphenation (language: en)

Gunning Frequency of Gobbledygook (FOG)
Parameters: default
Grade: 0.8

Text language: en
Warning message:
Text is relatively short (<100 tokens), results are probably not reliable!

FOG(tt[[2]])
Hyphenation (language: en)
|========================================================== | 83%Error in all.patterns[[word.length]] : subscript out of bounds
FOG(tt[[3]])
Hyphenation (language: en)

Gunning Frequency of Gobbledygook (FOG)
Parameters: default
Grade: 1.2

Text language: en
Warning message:
Text is relatively short (<100 tokens), results are probably not reliable!

sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS

Matrix products: default
BLAS: r-project/R-3.6.2/lib/libRblas.so
LAPACK: r-project/R-3.6.2/lib/libRlapack.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] koRpus.lang.en_0.1-3 koRpus_0.11-5 sylly_0.1-5

loaded via a namespace (and not attached):
[1] compiler_3.6.2 tools_3.6.2 data.table_1.12.8 sylly.en_0.1-3

I hope this information is helpful. Thank you for your time.

Jen

Error in path.expand(path) : argument 'path' incorrect

Dear koRpus team,

I have been trying to run the treetag function in R for over 2 days and cannot go past the following error:
Error in path.expand(path) : argument 'path' incorrect

I used the following code:

library("koRpus");library("koRpus.lang.fr");
TEST = treetag("GNGResp.txt", treetagger = "manual", lang = "fr", TT.options = list(path="C:/TreeTagger", preset="fr"))

I installed Treetagger strictly following their INSTALL.txt explanations and TreeTagger is working when I call it from windows CMD.

Also this installation was done as recommended in the root directory of C:/ drive:
image

Best wishes,

Corentin

- Microsoft Windows 10 Family v. 10.0.18363 
-  R version 4.0.2 
- koRpus version 0.13.1 

Issue on Windows

Hi,

I'm using R and the textstem package in Windows (10, 64-bit), which relies on koRpus. I spotted what appears to be two bugs:

  1. In the shell(sys.tt.call) for when it's not unix.OS, the translate = TRUE converts the /'s in the regex substitution in TT.filter.command's "| perl -pe ..." to become backslashes, which breaks perl. The path to TreeTagger by that time contains double backslashes due to normalizePath called earlier, so using translate=FALSE keeps the substition's /'s as is and the path is still fine.
  2. The line 'cat(paste(tknz.results, collapse = "\n"), "\n", file = tknz.tempfile)' requires a sep="" or else there's a hidden space appended to the last line/token, resulting in TreeTagger's tagging that last token as .

I am using R from a Cygwin environment, but I believe these errors would occur from a normal Windows environment.

Thanks!

Best,
Jay

option lexicon

I'm trying to apply an auxiliary lexicon (lexicon = "namefile.txt"), but there is a bug in the function (I suppose), because the path is written as follows:

C:\TreeTagger\lib/file.txt

Therefore, in Windows, when the file does exist in the folder, I receive the error;

Error in file.info(x, extra_cols = FALSE) : argomento 'filename' non valido (non valid filename).

Thank you

Improperly created matrix due to bad strsplit() when processing certain texts with unusual UTF-8 characters

R version 3.2.3
koRpus version 0.06-4 (latest dev branch)

When processing certain texts, koRpus::treetag returns

"Error: Invalid tag(s) found: ¶¥¥, , @card@, *ÿÀ3xÿÀ3z, Mac This is probably due to a missing tag in kRp.POS.tags() and needs to be fixed. It would be nice if you could forward the above error dump as a bug report to the package maintainer!"

The specific "invalid tags" vary based on the text itself. Prior to the error, it throws a warning:

Warning in matrix(unlist(strsplit(tagged.text, "\t")), :
data length [6736] is not a sub-multiple or multiple of the number of rows [2246]

which seems to suggest that strsplit is not properly splitting the results from the external tree-tagger program, leading to the matrix call building the columns incorrectly (see details below and attached file for examples of wrong columns in returned matrix).

I first checked to ensure that the external tree-tagger program was returning a valid tagged output with these texts. It was fine.

Upon further testing, I was able to reproduce it with 3 different texts, all with similar strsplit-related warnings screwing up which columns are token, tag, and lemma. Unsurprisingly, that then screws up the tag reading since it's reading a value in a column that was populated incorrectly.

The statement is on line 425 of treetag.R in the koRpus source code:
tagged.mtrx <- matrix(unlist(strsplit(tagged.text, "\t")), ncol=3, byrow=TRUE, dimnames=list(c(),c("token","tag","lemma")))

strsplit is almost certainly the culprit, probably mis-splitting certain special cases. The texts in question all have bug report data in them including special characters (UTF-8). I confirmed that the text is being passed to koRpus::treetag with UTF-8 encoding. I also experimented adding the useBytes=TRUE and/or perl=TRUE options to strsplit and recompiling. No change.

In searching for an alternative to strsplit, I came across package stringi, which appears to be specifically targetted at proper UTF-8 handling, possibly better (and almost certainly faster) than strsplit: Stringi homepage. In particular its stri_split functions look like they may be able to provide an inline replacement for the whole matrix line as they can return a matrix constructed by row in a single command. I haven't yet experimented with it, but I wonder if it may be a solution.

In summary, these warnings & resulting errors appear only on a very small portion of the bug reports that I'm processing (maybe 0.1% or so, of around 1M--the three provided are a tiny subset; there are actually thousands that won't process in my whole dataset), which leads me to suspect it's related to strsplit not properly handling a rare/unused UTF-8 character that only appears in a small portion of bug reports, possibly when the bug reports include hex-encoded binary dumps for troubleshooting (though not even in all such cases).

Attached you will find a file with the 3 texts that trigger the error/warning, organized as bug_id, full text content enclosed in ========= for endpoint clarity, followed by the warning + error messages. Each bug is separated by ---------------- for start/endpoint clarity. If you wish, I can also provide similar texts that process fine and don't result in the warning + error.

Please let me know if I can provide any other useful information for debugging.

Thanks once again for the excellent package! ^_^

bug descriptions triggering koRpus_treetag warning & error.txt

Flesch Formula multiplier

I noticed that the multiplier of the average syllable length in the Documentation of koRpus (page 25 and 87) is 84.6. On Wikipedia and in the original 1948 paper by Flesch it was 85.6.

Incorrect lemma and tag/wclass for final word when using koRpus::treetag

When I run koRpus::treetag, the final word always results in an "<unknown>" lemma even if it should be known. Also, the word doesn't appear to be consistently classified correctly according to POS.

text = "This is a test"
test = treetag(text, treetagger = "manual", format = "obj", TT.tknz=FALSE, lang="en", TT.options = list(path="./TreeTagger", preset="en"))

#   doc_id token tag     lemma lttr     wclass desc stop stem idx sntc
# 1   <NA>  This  DT      this    4 determiner   NA   NA   NA   1   NA
# 2   <NA>    is VBZ        be    2       verb   NA   NA   NA   2   NA
# 3   <NA>     a  DT         a    1 determiner   NA   NA   NA   3   NA
# 4   <NA> test   NN <unknown>    5       noun   NA   NA   NA   4   NA

If the final word is replaced with something else, the same result happens to the new final word (whereas the previous final word now works fine). In this example, again is classified as a noun.

text = "This is a test again"

#   doc_id  token tag     lemma lttr     wclass desc stop stem idx sntc
# 1   <NA>   This  DT      this    4 determiner   NA   NA   NA   1   NA
# 2   <NA>     is VBZ        be    2       verb   NA   NA   NA   2   NA
# 3   <NA>      a  DT         a    1 determiner   NA   NA   NA   3   NA
# 4   <NA>   test  NN      test    4       noun   NA   NA   NA   4   NA
# 5   <NA> again   NN <unknown>    6       noun   NA   NA   NA   5   NA

I'm using:

sessionInfo()
# R version 3.5.2 (2018-12-20)
# Platform: x86_64-apple-darwin15.6.0 (64-bit)
# Running under: macOS Mojave 10.14.2

packageVersion("koRpus")
# [1] ‘0.11.5’

It looks like it may be adding an extra space to the end of the final word (based on the token output and lttr result), but I don't know if this is causing the issue.

Error: Specified directory cannot be found: ~/bin/treetagger/bin

It seems that no matter where I place the document files, I continue to receive the same error in R:

Error: Specified directory cannot be found: ~/bin/treetagger/bin

The code I am using is below. I have installed TreeTagger using the terminal with no errors.

tagged.text <- treetag( "sample_text.txt", treetagger="manual", lang="en", TT.options=list( path="~/bin/treetagger", preset="en" ), doc_id="sample" )

Can anyone help with this issue?

last character being truncated in koRpus::treetag

The last character is being truncated from the input object of koRpus::treetag() when TT.tknz is FALSE. (see "dog" being truncated to "do" in the last row of the output table). This issue is not present when TT.tknz is set to TRUE.

example

doc <- "The quick brown fox jumped over the lazy dog"

# pre bug fix in R/treetag.R
koRpus::treetag(doc, treetagger = "manual", format = "obj", 
                encoding = "UTF-8", lang = "en", TT.tknz = FALSE, 
                TT.options = list(path = "/u/application/TreeTagger", preset = "en"))
#    token tag lemma lttr      wclass                                             desc stop stem
# 1    The  DT   the    3  determiner                                       Determiner   NA   NA
# 2  quick  JJ quick    5   adjective                                        Adjective   NA   NA
# 3  brown  JJ brown    5   adjective                                        Adjective   NA   NA
# 4    fox  NN   fox    3        noun                           Noun, singular or mass   NA   NA
# 5 jumped VBD  jump    6        verb                      Verb, past tense of "to be"   NA   NA
# 6   over  IN  over    4 preposition         Preposition or subordinating conjunction   NA   NA
# 7    the  DT   the    3  determiner                                       Determiner   NA   NA
# 8   lazy  JJ  lazy    4   adjective                                        Adjective   NA   NA
# 9     do VBP    do    2        verb Verb, non-3rd person singular present of "to be"   NA   NA

sessionInfo()
# R version 3.3.3 (2017-03-06)
# Platform: x86_64-redhat-linux-gnu (64-bit)
# Running under: Red Hat Enterprise Linux Server 7.3 (Maipo)

packageVersion("koRpus")
# [1] ‘0.10.2’

The issue seems to be stemming from this line in R/treetag.R because there is no return after the contents of the file cat out. Adjusting the function to cat a new line after the user's input object appears to fix the issue. I will submit a pull request with the fix I implemented for review.

Missing tags for Danish

Hi,

I'm trying to apply POS-tagging on a sample of Danish text with TreeTagger and the "official" Danish parameter file. However, it seems that tags produced are not recognized by koRpus.

I'm wondering if this is simply due to the fact that there is no Danish language pack?

library("koRpus")
library("koRpus.lang.en")
  
treetag("~/bin/treetagger/test.txt",
        treetagger = "manual",
        lang = "en",
        TT.options = list(path = "~/treetagger",
                          tokenizer = "tree-tagger-danish",
                          tagger = "tree-tagger",
                          params = "danish.par",
                          abbrev = "danish-abbreviations")
        )
#> Warning: Invalid tag(s) found: PM:s-un:--:----, VF:----:sa:----, PI:s-uc:--:----, AD:----:--:p---, AC:siu§:--:p---, NC:siuc:--:----, T-:----:--:----, NC:siun:--:----
#>   This is probably due to a missing tag in kRp.POS.tags() and
#>   needs to be fixed. It would be nice if you could forward the
#>   above warning dump as a bug report to the package maintaner!
#>   doc_id token             tag lemma lttr   wclass desc stop stem idx sntc
#> 1   <NA> Dette PM:s-un:--:---- denne    5  unknown   NA   NA   NA   1    1
#> 2   <NA>    er VF:----:sa:----  være    2  unknown   NA   NA   NA   2    1
#> 3   <NA>    en PI:s-uc:--:----    en    2  unknown   NA   NA   NA   3    1
#> 4   <NA> meget AD:----:--:p--- megen    5  unknown   NA   NA   NA   4    1
#> 5   <NA>  kort AC:siu§:--:p---  kort    4  unknown   NA   NA   NA   5    1
#> 6   <NA> tekst NC:siuc:--:---- tekst    5  unknown   NA   NA   NA   6    1
#> 7   <NA>    på T-:----:--:----    på    2  unknown   NA   NA   NA   7    1
#> 8   <NA> dansk NC:siun:--:---- dansk    5  unknown   NA   NA   NA   8    1
#> 9   <NA>     .            SENT     .    1 fullstop   NA   NA   NA   9    1

incomplete import of LCC corpus

Hey,

I want to do a frequency analysis with my text, using an LCC corpus [http://wortschatz.uni-leipzig.de/en/download/].
Unfortunately koRpus only reads in some of the first hundred lines, the frequency analysis then fails (indices are mostly "NA"). I've tried different copora, the import is always incomplete, the number of lines that are read in varies with each corpus, I don't see systematics.
When reading in the corpus in I get the following output.

LCC.data <- read.corp.LCC("deu_news_2015_1M.tar")
output:
Fetching needed files from LCC archive... done.
Warning messages:
1: In readLines(LCC.file.con, n = n) :
invalid input found on input connection 'C:\Users\rieke\AppData\Local\Temp\RtmpKwiXOi\koRpus.LCC3d9049ac4d87/deu_news_2015_1M/deu_news_2015_1M-words.txt'
2: In readLines(LCC.file.con, n = n) :
incomplete final line found on 'C:\Users\rieke\AppData\Local\Temp\RtmpKwiXOi\koRpus.LCC3d9049ac4d87/deu_news_2015_1M/deu_news_2015_1M-words.txt'
3: In matrix(unlist(strsplit(rL.words, "\t")), ncol = 4, byrow = TRUE, :
data length [77246] is not a sub-multiple or multiple of the number of rows [19312]
4: In read.corp.LCC("deu_news_2015_1M.tar") :
This looks like a newer LCC archive with four columns in the *-words.txt file.
The two word columns did not match, but we'll only use the first one!
5: In create.corp.freq.object(matrix.freq = table.words, num.running.words = num.running.words, :
NAs introduced by coercion

From this corpus, R always reads in the first 19312 lines. The curpus actually has about 700 thousand lines. Any word after the 19312th line can't be found.

Example:
query(LCC.data, "word", "Proton")

[1] num word lemma tag wclass lttr freq pct pmio log10 rank.avg rank.min
[13] rank.rel.avg rank.rel.min inDocs idf
<0 Zeilen> (oder row.names mit Länge 0)

I can't imagine that the corpus is too big. Michalke used a corpus of 1mio sentences in his Manual as well...

I am thankful for any hint!

Friedi

How can I extract proper nouns?

I know how to extract proper nouns from a corpus in quanteda with spacyr. But for another corpus I need to use treetagger. I was able to lemmatize the corpus with koRpus and treetagger, but I don't know how to further analyze word forms and parts of speech. For instance, I would like to get a list of all proper nouns within the corpus. How can I do that in koRpus?

TT.tokenizer not found

When running the following code:

set.kRp.env(TT.cmd="manual", TT.options=list(path="c://treetagger", preset="nl"), lang="nl")
res <- treetag(
  file=words,
  treetagger="kRp.env",
  format="obj",
  debug = TRUE
)

I get the following error:

Error in preset.definition[["preset"]](TT.cmd = TT.cmd, TT.bin = TT.bin, : 
object 'TT.tokenizer' not found

The words variable is a vector containing dutch words that are to be lemmatized. I get the same result on both the stable and the development versions. The above error is on Windows 10, perhaps it has something to do with Windows since the above code does work on Linux. When I run the same code with words set to an vector of English words an preset and lang set to "en", no error is given.

Can't really seem to find the problem, hopefully you can help. Thanks in advance!

character vector "measure" seems to be ignored by lex.div; Fehler in x[["end"]] : Indizierung außerhalb der Grenzen; Fehler in 1:lastValidIndex : Resultat wäre zu langer Vektor

I want to calculate lexical diversity with koRpus' lex.div function for different texts. I am using the options "keep.tokens=TRUE, type.index=TRUE"; the texts are relatively short (10-150 words). From time to time I get error messages of this kind:

MTLDMA.char: Calculate MTLD-MA values
  |=====================================                                     |  50%Fehler in 1:lastValidIndex : Resultat wäre zu langer Vektor
Zusätzlich: Warnmeldungen:
1: Text is relatively short (<100 tokens), results are probably not reliable! 
2: MSTTR: Skipped calculation, segment size is 100, but the text has only 70 tokens! 
3: MATTR: Skipped calculation, window size is 100, but the text has only 70 tokens! 
4: In min(which(all.factorEnds > curr.token)) :
  kein nicht-fehlendes Argument für min; gebe Inf zurück

The affected file is here: uF04.txt. It was tagged with TreeTagger before feeding in tag results into lex.div.

While trying to avoid these errors, I run the the same analysis on failed caluclations with different parameters, like these:

keep.tokens=TRUE, type.index=TRUE,measure =c("TTR", "MSTTR", "MATTR", "C", "R", "CTTR", "U", "S", "K", "Maas", "HD-D", "MTLD"))

which results in a different error (even with additional window and segment sizes reduced to 20):

MTLD.char: Calculate MTLD values
  |==========================================================================| 100%
Fehler in x[["end"]] : Indizierung außerhalb der Grenzen
Zusätzlich: Warnmeldungen:
1: Text is relatively short (<100 tokens), results are probably not reliable! 
2: MSTTR: Skipped calculation, segment size is 100, but the text has only 70 tokens! 
3: MATTR: Skipped calculation, window size is 100, but the text has only 70 tokens! 

Reducing the set of measures to a minimal set (even just "TTR") still gives the same error messages and all the progress bars for measures, which should not be included.

Unfortunately I can't trace the error, so I need you help. Thanks a lot in advance!

Getting "Awww, this should not happen" error even though the sys.tt.call runs sucessfully

I am using koRpus to read a list of textfiles to analyze thoses textes.
Running on Windows 10 with R 4.0.5 x64 using R-Studio 1.4.1717 and koRpus v0.13-8
When I run my R script I allways get this error message:

sys.tt.call: perl C:\TreeTagger\cmd\utf8-tokenize.perl -a C:/TreeTagger/lib/german-abbreviations "S:\XX\Politik\tt\ID_15010101.txt" | C:\TreeTagger\bin\tree-tagger.exe C:\TreeTagger\lib\german.par -token -lemma -sgml -pt-with-lemma -quiet

Error: Awww, this should not happen: TreeTagger didn't return any useful data.
This can happen if the local TreeTagger setup is incomplete or different from what presets expected.
You should re-run your command with the option 'debug=TRUE'. That will print all relevant configuration.
Look for a line starting with 'sys.tt.call:' and try to execute the full command following it in a
command line terminal. Do not close this R session in the meantime, as 'debug=TRUE' will keep temporary
files that might be needed.
If running the command after 'sys.tt.call:' does fail, you'll need to fix the TreeTagger setup.
If it does not fail but produce a table with proper results, please contact the author!

When I run the sys.tt.call manually i get the following result, which looks good for me:

Die ART die
Parteien NN Partei
der ART die
Bundesrepublik NN Bundesrepublik
Deutschland NE Deutschland
[...]

This is my code:

install.packages("koRpus")
install.koRpus.lang("de")
library(koRpus.lang.de)

setwd("S:/XX/txt-Dateien_utf8/Politik/tt")

LCC.en <- read.corp.LCC("S:/XX/deu_news_2020_100K.tar.gz")

df.results <- data.frame ("0", "0")
names(df.results)<-c("ID", "pmio.mean")
fileNames <- Sys.glob("*.txt")

for (fileName in fileNames) {
#text-ID als Konstante
k.1 <- fileName
tagged.text1 <- treetag(
  fileName,
  treetagger="manual",
  debug=TRUE,
  lang="de",
  TT.options=list(
    path="C:/TreeTagger/",
    preset="de"
  ),
  doc_id=fileName
)
# Frequenzanalyse 
freq.analysis.res <- freq.analysis(tagged.text1, corp.freq=LCC.en)
# Statistische Analyse der Daten, Mean von pmio extrahieren
as(freq.analysis.res, "kRp.text")
l <- corpusFreq(freq.analysis.res)
l.1 <- (l$frq.pmio$summary.all$summary)
l.1 <- unname(l.1)
print(k.1)
k <- print(l.1[4])
k.2 <- toString(k, width = NULL)
# die Ergebnisse zum Ergebnissdokument hinzufügen
de<-data.frame(k.1, k.2)
names(de)<-c("ID","pmio.mean")
df.results <- rbind(df.results, de)
}

print(df.results)

Do you have any idea what could be the problem?

Error with Russian - This is probably due to a missing tag in kRp.POS.tags() and needs to be fixed

I'm trying to run an analysis of Russian literary texts. TreeTagger is installed and works from the command line:

cat ~/Projects/R_RussianNLP/Texts/Chechov.txt | cmd/tree-tagger-russian

When I run it through korPus however, I get the following error messages:

tagged.text <- treetag("./Texts/Chechov.txt", lang="ru", treetagger="manual", TT.options=list(path="~/Downloads/TreeTagger", preset="ru"))

Error: Invalid tag(s) found: P--nsaa, P--msga, P--fsla, Vmip3s-a-e, P--fsaa, P-3msdn, P--nsin, P--fsia, P--nsnn, P--msda, P-----r, Vmip3p-m-e, Afpmpnf, Afpmpgf, Vmip3s-m-e, P--fsna, P-3msnn, P-2-snn, Mc---d, P-2-sgn, P--nsna, P-2-sdn, P-----a, P--msia, P-2-san, P-3nsnn, Vmif3s-m-p, P--fsga, P-1-snn, P-2-pin, Ncmsnnp, Vmgp---a-e, P----dn, P-2-pdn, P---pna, Afpmpaf, Afcmsnf, P-1-sgn, Vmip1p-a-e, P-2-pan, Vmip2p-a-e, Vmip1s-a-e, P--msna, P-2-pnn, P-3-pdn, P---pda, P-1-pan, P---paa, Rc, P-1-san, P-1-pdn, P-2-pgn, P--msaa, P-3msin, P----an, P--nsgn, Vmps-smpfpg, Vmpp-smafeg, Vmpp-smmfeg, P---pga, Vmip3p-a-e, P-3fsnn, Vmpp-p-afen, P---pla, Afpmplf, Vmip1s-m-e, Vmip2s-a-e, P-3fsdn, Afpnsns, Vmgp---m-e, Vmps-sfpfpn, P--msla, Afpfsns, P-3fsin, Afpmsns, P--nsdn, P-3msan, P--nsln, P-3-pan, Vmgs---a-p, P----in, Afpmpif
This is probably due to a missing tag in kRp.POS.tags() and
needs to be fixed. It would be nice if you could forward the
above error dump as a bug report to the package maint

Am I doing something wrong here?

Treetagger do not worh in both koRpus and teststem packages

I installed treetagger but I get an error message in both textstem and KoRpus packages for below codes:
Would you please help me.

#first example:
Error for koRpus package is as below:

Error in is_grouped_df(tbl) : "TT.res" adında bir slot yok ("kRp.text" sınıfında bir nesne için)
TRANSLATION in English=
Error in is_grouped_df(tbl) : "TT.res" slot missing (for "kRp.text" class object)

#second example
library(textstem)
x <- c(
'the dirtier dog has eaten the pies',
'that shameful pooch is tricky and sneaky',
"He opened and then reopened the food bag",
'There are skies of blue and red roses too!')
lemma_dictionary2 <- make_lemma_dictionary(x, engine = 'treetagger')

ERROR: Error in dplyr::filter([email protected][c("token", "lemma")], !lemma %in% :
"TT.res" adında bir slot yok ("kRp.text" sınıfında bir nesne için)

Error in russian pos tags: Invalid tag(s) found: Mc---d

Hello, I was trying to analyse a russian text by using koRpus:
tagged.text <- treetag("sample_text.txt",treetagger="manual",lang="en",TT.options=list(path="~/bin/treetagger/",preset="ru"),doc_id="sample")
However, there was an error messege when I ran the command:
"Invalid tag(s) found: Mc---d
This is probably due to a missing tag in kRp.POS.tags() and
needs to be fixed. It would be nice if you could forward the
above warning dump as a bug report to the package maintaner!"

I've read about a previous issue concerning the russian tags and downloaded the newest release as suggested. However, the problem still exists and I got the above message again.

Is there any solution? Thanks ahead.

treetegger working with a dataset in R

Hello! I have a dataset which contains tweets in Russian and id of the tweets. I want to lemmatize the text of the tweets and get a dataset with id of tweets and lemmas of these tweets. The problem is that if I understand it correctly, treetagger works only with files but not R datasets. So, I downloaded my dataset as a txt file with only text of tweets but not their IDs because if I download the dataset with both text and ids, nothing works. So, with the code below I got this result (in the picture). With this result, I cannot identify the id of a tweet. What can I do?

perform POS tagging

set.kRp.env(TT.cmd="C://TreeTagger/bin/tag-russian.bat", lang="ru", encoding = "UTF-8")
postagged <- treetag("C:/Users/Ольга/Documents/new16_20.txt", treetagger="manual",
lang="ru",
TT.options=list(
path=file.path("C://TreeTagger"),
preset="ru"))
data = postagged@tokens

w

POS tagging of a corpus object

First of all, thanks for developing this package!

I am currently working on POS tagging of text corpora in several languages. For German and English I use a combination of spacyr and quanteda..

For additional languages, I would like to use a koRpus and the TreeTagger. Is there a way to perform POS tagging directly on the text field in a corpus object? Or do you have a script that extracts the text field of a corpus for each document and applies POS tagging in a loop?

Thanks a lot for your help.
Stefan

Moving from tm object to koRpus object and vice versa

I have a problem moving from a tm object to a koRpus object. I have to normalize a corpus with tm tools, lemmatize the results with koRpus and return to tm to categorize the results. In order to do this I have to transform the tm object into a R dataframe, which I then transform into an excel file, then into a txt file, and finally into a koRpus object. This is the code:

#from VCORPUS to DATAFRAME 
dataframeD610P<-data.frame(text=unlist(sapply(Corpus.TotPOS, `[`, "content")), stringsAsFactors=F)

#from DATAFRAME to XLSX 
#library(xlsx)
write.xlsx(dataframeD610P$text, ".\\mycorpus.xlsx")

#open with excel 
#save in csv (UTF-8)

#import in KORPUS and lemmatization with KORPUS/TREETAGGER 

tagged.results <- treetag(".\\mycorpus.csv", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"),
                          TT.options=list(path="C:/TreeTagger", preset="it-utf8", no.unknown=T)) 

Then I need to do it all backwards to get back to tm. This is the code:

#from KORPUS to TXT 
write.table([email protected]$lemma, ".\\mycorpusLEMMATIZED.txt")

#open with a text editor and formatting of the text

#from TXT to R
Lemma1.POS<- readLines(".\\mycorpusLEMMATIZEDfrasi.txt", encoding = "UTF-8")

#from R object to DATAFRAME
Lemma2.POS<-as.data.frame(Lemma1.POS, encoding = "UTF-8")

#from DATAFRAME to CORPUS
CorpusPOSlemmaFINAL = Corpus(VectorSource(Lemma2.POS$Lemma1.POS))

Is there a more elegant solution to do this without leaving R? I’d really appreciate any help or feedback.

readability() returns error message

Hi there,

first, thanks for this great package! I have been using it for more than a year for my research and it is very useful.

However, since I updated to the latest version, I can't get readability() to run anymore.

I understand that some data structure was modified. For example, treetag()'s output used to be named [email protected], and it has been changed to tagged.txt@tokens. So maybe, I am simply not calling the kRp.text object properly when using readability()...

Would you have some insight??

Below is a simple reproducible version of my code

Thank you very much for your help!

  • Aurelie

`

installs and loads Fr package

install.koRpus.lang("fr")
library("koRpus.lang.fr")

sets environment for treetagger - FRENCH VERSION

set.kRp.env(TT.cmd = "cmd/tree-tagger-french", lang="fr",
format = "obj", encoding = "Latin1")

runs TreeTagger on a short text

tagged.results <- treetag("La salive est un liquide produit par des glandes spéciales situées à plusieurs endroits dans la bouche. Elle permet d'enrober les aliments d'eau afin de permettre un passage facile dans l'œsophage et de faciliter le travail de digestion dans l'estomac. La salive contient aussi une enzyme qui permet de commencer la digestion de l'amidon des plantes. Mais il faut que l'aliment qui contient l'amidon soit cuit car il n'est pas possible de digérer l'amidon cru. L'amidon est une forme de sucre trop gros pour être absorbé directement dans les intestins. Il faut donc séparer un à un ses composants ; c'est le travail de cette enzyme appelée amylase. En mastiquant du pain, qui contient de l'amidon, pendant quelques instants, on s'aperçoit que le goût devient sucré : l'amylase a commencé son travail et des molécules de sucre sont libérées. ,
format = "obj",
apply.sentc.end = TRUE, sentc.end = ".", add.desc = TRUE)

runs readability indexes estimation

readability (tagged.results)
`

Here is the ERROR MESSAGE that I get:
Error in if (raw >= 90) { : argument is of length zero

Invalid tag(s)

When I try to run the following code on a (Dutch) character object I get the error below:

library("koRpus")
set.kRp.env(TT.cmd="c://treetagger/bin/tag-dutch.bat", lang="en", preset="en")
output <- treetag(text, format="obj", TT.options=list(path="c://TreeTagger", preset="en"))


Error: Invalid tag(s) found: 300, 254, 400, 154, 219, 210, PUNCT, 000, 600, 333, 700, 256, 303, 500, 450, 6105, 441, 454, 720, 370, 010, 247, 330, 001, 251, 410, 252
This is probably due to a missing tag in kRp.POS.tags() and needs to be fixed.
It would be nice if you could forward the above error dump as a bug report to the package maintaner!


I tried different solutions but none of them got me closer to a solution. I don't know if this is a problem related to my own configuration or the package code, but since the error message asked to forward the error dump as a bug report, I'm putting it here.

Error in matrix(unlist(strsplit(tagged.text, "\t")), ncol = 3, byrow = TRUE, : 'data' must be of a vector type, was 'NULL'

Does this debug=TRUE help you to understand what is the cause of the error execution?

> tagged.results <- treetag(c("run", "ran", "running"), treetagger="manual", format="obj",
+                           TT.tknz=FALSE , lang="en",
+                           debug = TRUE,
+                           TT.options=list(path="TreeTagger", preset="en"))
split=[[:space:]]
ign.comp=-
heuristics=abbr
heur.fix=c("’", "'"), c("’", "'")
sentc.end=., !, ?, ;, :
detect=FALSE, FALSE
clean.raw=
perl=FALSE
stopwords=
stemmer=
Assuming 'UTF-8' as encoding for the input file. If the results turn out to be erroneous, check the file for invalid characters, e.g. em.dashes or fancy quotes, and/or consider setting 'encoding' manually.
 
        TT.tokenizer:  koRpus::tokenize() 
				tempfile: C:\Users\Marcin\AppData\Local\Temp\Rtmp2PQ5Ts\tokenizef94305e2e24.txt 
        file:  C:\Users\Marcin\AppData\Local\Temp\Rtmp2PQ5Ts\tempTextFromObjectf942ee2415d.txt 
        TT.lookup.command:   
        TT.pre.tagger:   
        TT.tagger:  TreeTagger/bin/tree-tagger.exe 
        TT.opts:  -token -lemma -sgml -pt-with-lemma -quiet 
        TT.params:  TreeTagger/lib/english-utf8.par 
        TT.filter.command:  | perl -pe 's/\tV[BDHV]/\tVB/;s/IN\/that/\tIN/;' 

        sys.tt.call:  type  C:\Users\Marcin\AppData\Local\Temp\Rtmp2PQ5Ts\tokenizef94305e2e24.txt |   TreeTagger/bin/tree-tagger.exe TreeTagger/lib/english-utf8.par -token -lemma -sgml -pt-with-lemma -quiet | perl -pe 's/\tV[BDHV]/\tVB/;s/IN\/that/\tIN/;' 

Error in matrix(unlist(strsplit(tagged.text, "\t")), ncol = 3, byrow = TRUE,  : 
  'data' must be of a vector type, was 'NULL'
In addition: Warning message:
running command 'C:\Windows\system32\cmd.exe /c type  C:\Users\Marcin\AppData\Local\Temp\Rtmp2PQ5Ts\tokenizef94305e2e24.txt |   TreeTagger\bin\tree-tagger.exe TreeTagger\lib\english-utf8.par -token -lemma -sgml -pt-with-lemma -quiet | perl -pe 's\\tV[BDHV]\\tVB\;s\IN\\that\\tIN\;'' had status 9 
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250    LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C                  
[5] LC_TIME=Polish_Poland.1250    

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] koRpus_0.10-2       data.table_1.9.6    gradientr_0.0.1     RWeka_0.4-33        tm_0.7-1            NLP_0.1-10         
 [7] stringi_1.1.5       NbClust_3.0         cluster_2.0.5       factoextra_1.0.4    foreach_1.4.3       openxlsx_3.0.0     
[13] networkD3_0.3       VennDiagram_1.6.17  futile.logger_1.4.3 Boruta_5.2.0        ranger_0.6.0        scales_0.4.1       
[19] ggmosaic_0.1.2      productplots_0.1.1  corrplot_0.77       stringr_1.2.0       magrittr_1.5        dplyr_0.5.0        
[25] purrr_0.2.2         readr_1.0.0         tidyr_0.6.1         tibble_1.2          tidyverse_1.0.0     readxl_0.1.1       
[31] haven_1.0.0         plyr_1.8.4          tables_0.8          Hmisc_4.0-2         ggplot2_2.2.1       Formula_1.2-1      
[37] survival_2.40-1     lattice_0.20-34    

loaded via a namespace (and not attached):
 [1] devtools_1.12.0      RColorBrewer_1.1-2   httr_1.2.1           tools_3.3.1          backports_1.0.4      R6_2.2.0            
 [7] rpart_4.1-10         DBI_0.5-1            lazyeval_0.2.0       colorspace_1.3-1     nnet_7.3-12          withr_1.0.2         
[13] sp_1.2-4             gridExtra_2.2.1      compiler_3.3.1       chron_2.3-47         htmlTable_1.9        flashClust_1.01-2   
[19] plotly_4.5.6         labeling_0.3         slam_0.1-40          checkmate_1.8.2      digest_0.6.10        foreign_0.8-67      
[25] ca_0.70              base64enc_0.1-3      jpeg_0.1-8           htmltools_0.3.5      maps_3.1.1           RWekajars_3.9.1-3   
[31] FactoMineR_1.35      htmlwidgets_0.8      jsonlite_1.1         acepack_1.4.1        wordcloud_2.5        leaps_3.0           
[37] geosphere_1.5-5      Matrix_1.2-7.1       Rcpp_0.12.8          munsell_0.4.3        proto_1.0.0          scatterplot3d_0.3-39
[43] MASS_7.3-45          parallel_3.3.1       ggrepel_0.6.5        splines_3.3.1        mapproj_1.2-4        knitr_1.15          
[49] igraph_1.0.1         rjson_0.2.15         reshape2_1.4.2       codetools_0.2-15     futile.options_1.0.0 kohonen_3.0.2       
[55] latticeExtra_0.6-28  lambda.r_1.1.9       spam_1.4-0           png_0.1-7            RgoogleMaps_1.4.1    gtable_0.2.0        
[61] assertthat_0.1       viridisLite_0.1.3    rJava_0.9-8          iterators_1.0.8      memoise_1.0.0        fields_8.10         
[67] ggmap_2.6.1 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.