quanteda / quanteda Goto Github PK

View Code? Open in Web Editor NEW

826.0 826.0 186.0 759.16 MB

An R package for the Quantitative Analysis of Textual Data

Home Page: https://quanteda.io

License: GNU General Public License v3.0

R 89.49% C++ 9.68% CSS 0.12% JavaScript 0.04% TeX 0.40% C 0.01% Shell 0.01% M4 0.25%

corpus natural-language-processing quanteda r text-analytics

quanteda's Introduction

About

quanteda is an R package for managing and analyzing text, created and maintained by Kenneth Benoit and Kohei Watanabe. Its creation was funded by the European Research Council grant ERC-2011-StG 283794-QUANTESS and its continued development is supported by the Quanteda Initiative CIC.

For more details, see https://quanteda.io.

quanteda version 4

The quanteda 4.0 is a major release that improves functionality and performance and further improves function consistency by removing previously deprecated functions. It also includes significant new tokeniser rules that make the default tokeniser smarter than ever, with new Unicode and ICU-compliant rules enabling it to work more consistently with even more languages.

We describe more fully these significant changes in:

an article about the new external pointer tokens objects;
an article showing performance benchmarks for the new external pointer tokens objects, as well as some of the tokeniser improvements in v4; and
the changelog for v4 a full listing of the changes, improvements, and deprecations in v4.

The quanteda family of packages

We completed the trend of splitting quanteda into modular packages with the release of v3. The quanteda family of packages includes the following:

quanteda: contains all of the core natural language processing and textual data management functions
quanteda.textmodels: contains all of the text models and supporting functions, namely the textmodel_*() functions. This was split from the main package with the v2 release
quanteda.textstats: statistics for textual data, namely the textstat_*() functions, split with the v3 release
quanteda.textplots: plots for textual data, namely the textplot_*() functions, split with the v3 release

We are working on additional package releases, available in the meantime from our GitHub pages:

quanteda.sentiment: Functions and lexicons for sentiment analysis using dictionaries
quanteda.tidy: Extensions for manipulating document variables in core quanteda objects using your favourite tidyverse functions

and more to come.

How To…

Install (binaries) from CRAN

The normal way from CRAN, using your R GUI or

install.packages("quanteda")

(New for quanteda v4.0) For Linux users: Because all installations on Linux are compiled, Linux users will first need to install the Intel oneAPI Threading Building Blocks for parallel computing for installation to work.

To install TBB on Linux:

# Fedora, CentOS, RHEL
sudo yum install tbb-devel

# Debian and Ubuntu
sudo apt install libtbb-dev

Windows or macOS users do not have to install TBB or any other packages to enable parallel computing when installing quanteda from CRAN.

Compile from source (macOS and Windows)

Because this compiles some C++ and Fortran source code, you will need to have installed the appropriate compilers to build the development version.

You will also need to install TBB:

macOS:

First, you will need to install XCode command line tools.

xcode-select --install

Then install the TBB libraries and the pkg-config utility: (after installing Homebrew):

brew install tbb pkg-config

Finally, you will need to install gfortran.

Windows:

Install RTools, which includes the TBB libraries.

Use quanteda

See the quick start guide to learn how to use quanteda.

Get Help

Read out documentation at https://quanteda.io.
Check out the quanteda cheatsheet.
Submit a question on the quanteda channel on StackOverflow.
See our tutorial site.

Cite the package

Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. (2018) “quanteda: An R package for the quantitative analysis of textual data”. Journal of Open Source Software 3(30), 774. https://doi.org/10.21105/joss.00774.

For a BibTeX entry, use the output from citation(package = "quanteda").

Leave Feedback

If you like quanteda, please consider leaving feedback or a testimonial here.

Contribute

Contributions in the form of feedback, comments, code, and bug reports are most welcome. How to contribute:

Fork the source code, modify, and issue a pull request through the project GitHub page. See our Contributor Code of Conduct and the all-important quanteda Style Guide.
Issues, bug reports, and wish lists: File a GitHub issue.
Contact the maintainer by email.

quanteda's People

Contributors

Stargazers

Watchers

Forkers

libardo1 behrica pablobarbera saldaihani longlingmichael parthasen valeriastay pjsio aileenmedina oskilab ky822 pmyteh matthewjdenny chmue mrclaude tpaskhalis schinria wfuu olgabesp neviera lgreski aknn xinchoubiology guanlongtianzi yewchoong adamobeng kadriumay bekterra imanojkumar qsq-dm hofaichan wizardshowing conjugateprior viboy abi7888 barbaramaseda bcgozun sdorius iabrady mhossain16 haiyanlw jay134679 fc1315 tbs08 sinemetu1 stas-malavin iammartinsimon xiaoyuezhang christophergandrud henri-lo waterbear1 anavaldi patperry nickhkor petrichorcode wjcdenis plablo09 pariyat haowang666 strategist922 leeper deyugo liammonsell stefan-mueller pokhrelj jonifranc peidyen mpadge nanaakwasiabayieboateng t0tem snsoroka luigic72 leighd2008 datascience2017 dafululiu thrinu rui-lian ben-aaron188 axxonback amatsuo pszewach cschwem2er talkstats trinker lizl90 lindbrook skoluguri30 kylehaynes nsm120 rajesh16702 nicmer sanjahajdinjak jiongweilua adash92 arianbarakat redwa praneet460 vviers ncs1 vriezer

quanteda's Issues

more options for clean()

Case, punctuation, and digits should be options with defaults

wordstem() should work with corpus data type as well

wordstem() works only with array of words. It should work with corpus too. if you need to stem words before doing dfm (that supports only stemming of the second word) then you have to transform data from corpus/text to array and then back to corpus.

topicmodels packages in quanteda

Just fyi: when installing the master distribution, you seem to have a dependency/suggestion for the 'topicsmodels' package. This won't quite install fully on my Yosemite 10.10.1

Warning message:
packages ‘quantedaData’, ‘topicmodels’ are not available (for R version 3.1.2)

Separate binary install fails

> install.packages('topicmodels')
Installing package into ‘/Users/will/Library/R/3.1/library’
(as ‘lib’ is unspecified)

   package ‘topicmodels’ is available as a source package but not as a binary

Warning in install.packages :
  package ‘topicmodels’ is not available (for R version 3.1.2)

and also from source

> install.packages('topicmodels', type='source')
Installing package into ‘/Users/will/Library/R/3.1/library’
(as ‘lib’ is unspecified)
also installing the dependency ‘modeltools’

trying URL 'http://cran.rstudio.com/src/contrib/modeltools_0.2-21.tar.gz'
Content type 'application/x-gzip' length 14794 bytes (14 Kb)
opened URL
==================================================
downloaded 14 Kb

trying URL 'http://cran.rstudio.com/src/contrib/topicmodels_0.2-1.tar.gz'
Content type 'application/x-gzip' length 847889 bytes (828 Kb)
opened URL
==================================================
downloaded 828 Kb

* installing *source* package ‘modeltools’ ...
** package ‘modeltools’ successfully unpacked and MD5 sums checked
** R
** inst
** preparing package for lazy loading
Creating a generic function for ‘na.fail’ from package ‘stats’ in package ‘modeltools’
Creating a generic function for ‘na.pass’ from package ‘stats’ in package ‘modeltools’
Creating a generic function for ‘na.omit’ from package ‘stats’ in package ‘modeltools’
Creating a generic function from function ‘MEapply’ in package ‘modeltools’
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (modeltools)
* installing *source* package ‘topicmodels’ ...
** package ‘topicmodels’ successfully unpacked and MD5 sums checked
** libs
clang -I/Library/Frameworks/R.framework/Resources/include     -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include    -fPIC  -Wall -mtune=core2 -g -O2  -c cokus.c -o cokus.o
clang -I/Library/Frameworks/R.framework/Resources/include     -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include    -fPIC  -Wall -mtune=core2 -g -O2  -c common.c -o common.o
clang -I/Library/Frameworks/R.framework/Resources/include     -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include    -fPIC  -Wall -mtune=core2 -g -O2  -c ctm.c -o ctm.o
ctm.c:29:10: fatal error: 'gsl/gsl_rng.h' file not found
#include <gsl/gsl_rng.h>
         ^
1 error generated.
make: *** [ctm.o] Error 1
ERROR: compilation failed for package ‘topicmodels’
* removing ‘/Users/will/Library/R/3.1/library/topicmodels’
Warning in install.packages :
  installation of package ‘topicmodels’ had non-zero exit status

The downloaded source packages are in
    ‘/private/var/folders/gl/ds8cxcyj07x0f553zt__04mm0000gn/T/RtmplAdulj/downloaded_packages’

That's their problem not yours, of course, but figured you might want to know.

Targets for Refactoring

Accessor functions:

texts()
words()
data() - (return only the attribs or texts + attribs?)
tokenizedTexts() - I suggest that when we run tokenize() we should store the result in the corpus object and simply retrieve the tokenized texts afterwards

Generic Functions:
clean() corpus, text, (dfm?)
tokenize() corpus, text
stopwords() corpus, text, dfm
sample() corpus, dfm

Error getting a directory of texts in to quanteda corpus

I've tried to get a directory of texts in to a quanteda corpus with some issues. First, I make a VCorpus using the DirSource function in tm. Second, I try to make the object a quanteda corpus. However, I get the error "no applicable method for 'corpus' applied to an object of class "list."" But it's not a list; I've checked the files and everything seems sound.

library(quanteda)
library(tm)
Loading required package: NLP

Attaching package: ‘tm’

The following objects are masked from ‘package:quanteda’:

as.DocumentTermMatrix, stopwords

ds <- VCorpus(DirSource("~/Desktop/Speeches/House/2000/"))

make it a quanteda object

txts <- corpus(ds)
Error in UseMethod("corpus") :
no applicable method for 'corpus' applied to an object of class "list"
class(ds)
[1] "VCorpus" "Corpus"

tokenize(x, what = "sentence") not working.

Problem with tokenize()

Tokenize(x, what = "sentence") is not working when the next sentence starts with lowercase.

Test with Problem

# spaces (or not) after the punctuation do not affect result
> tokenize("Hello! This is sentence one. really? and this is two. and this is three.are you sure?yes.", 
           what = "sentence", simplify = TRUE)
[1] "Hello!"                                          
[2] "This is sentence one. really?" # "one." does not force a break; "really?" looks OK
[3] "and this is two. and this is three.are you sure?" # "two. " does not force a break
[4] "yes."

Test OK

# They ALL look OK
# spaces (or not) after the punctuation do not affect result
> tokenize("Hello! This is sentence one. Really? And this is two. And this is three.Are you sure?Yes.", 
           what = "sentence", simplify = TRUE)
[1] "Hello!"                "This is sentence one." "Really?"               "And this is two."
[5] "And this is three."    "Are you sure?"         "Yes."

Test with segmentSentence()

I did run the internal function segmentSentence(), and it looks OK in all cases.

> segmentSentence("Hello! This is sentence one. really? and this is two. and this is three.are you sure?yes.")
[1] "Hello!"                "This is sentence one." "really?"               "and this is two."     
[5] "and this is three?"    "are you sure."         "yes!"

> segmentSentence("Hello! This is sentence one. Really? And this is two. And this is three.Are you sure?Yes.")
[1] "Hello!"                "This is sentence one." "Really?"               "And this is two."     
[5] "And this is three?"    "Are you sure."         "Yes!"

sessionInfo()

Note: I did not upgrade to quanteda_0.8.0-4 as it does not have a binary version, and I did not want to compile locally.

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.4 (Yosemite)

locale:
[1] en_US/en_US/en_US/C/en_US/en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] quanteda_0.8.0-3

loaded via a namespace (and not attached):
 [1] magrittr_1.5     plyr_1.8.3       Matrix_1.2-2     parallel_3.2.1   tools_3.2.1     
 [6] reshape2_1.4.1   Rcpp_0.12.0      stringi_0.5-5    grid_3.2.1       data.table_1.9.4
[11] stringr_1.0.0    chron_2.3-47     lattice_0.20-33  ca_0.58

Tutorial: How to get texts into a quanteda corpus

A guide to all the ways of importing texts into quanteda - getTextDir, getTextGui, corpusFromHeaders, corpusFromFilenames, twitter, etc..

attributes of documents can be specified in document headers

JSON document headers containing attribute-value pairs will be detected and incorporated into corpus object when texts are added

Outline/draft the main vignette/tutorial

How do I turn a dfm into a sparseMatrix?

dfm already looks a lot like a sparse Matrix:

> str(dfm(inaugTexts, verbose=FALSE))
Formal class 'dfmSparse' [package "quanteda"] with 9 slots
  ..@ settings :List of 1
  .. ..$ : NULL
  ..@ weighting: chr "frequency"
  ..@ smooth   : num 0
  ..@ Dim      : int [1:2] 57 9214
  ..@ Dimnames :List of 2
  .. ..$ docs    : chr [1:57] "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson" ...
  .. ..$ features: chr [1:9214] "14th" "15th" "18th" "19th" ...
  ..@ i        : int [1:43719] 0 27 30 52 52 45 46 8 52 51 ...
  ..@ p        : int [1:9215] 0 1 3 4 5 7 9 11 12 13 ...
  ..@ x        : num [1:43719] 1 1 1 1 2 1 2 1 3 1 ...
  ..@ factors  : list()

I want to be able to look at the matrix elements and pass it to modeling functions that take Matrix inputs.

Tutorial: scaling documents with quanteda

How to scale documents in quanteda using correspondence analysis (ca) and wordfish.

how to stem both words of bigrams

dfm with bigrams=TRUE and stem=TRUE creates bigrams, where only the last word of bigram is stemmed. How to stem both words of the bigram?

dfm(c("banking industry"), clean=FALSE, stem=TRUE, bigrams=TRUE, verbose=FALSE)
Document-feature matrix of: 1 document, 3 features.
1 x 3 sparse Matrix of class "dfmSparse"
features
docs bank banking_industri industri
text1 1

I need "bank_industri" instead of "banking_industri"

predict.dfm

It would be really useful to be able project new texts into the same feature space as an existing dfm. This would be particularly useful if you're using texts as inputs to a predictive model.

dfm2ldaformat example broken

use readr instead of read.csv for text import functions

bigrams() does not work correctly with text that includes only one word

bigrams("banking", include.unigrams = FALSE)
[[1]]
[1] "banking_" "_banking"

Paul to redo the stop words implementation

I created a named list object data/stopwords.Rdata, taken from tm (which took it from other sources including SnoBall). Please write a function that replicates tm's stop words(kind="catalan") etc. to retrieve the stopwords based on the named list element (with error checking of course). We can then make English the default and rewrite the dfm() argument accordingly. Note that this still allows us to send a character vector to dfm for home-grown stop word lists (as Kohei likes).

POS feature selection

Add the ability to extract parts of speech (using OpenNLP) as features, as an option to dfm. This means we should think of modularising the objects that define dfm "features". Currently we have:
word tokens
stemmed word tokens
dictionary entries

Adding parts of speech would mean either selecting ONLY specific POS types, or extracting POS counts as features.

Implement indexing for a corpus

Including dfm and other methods that use dfms

Documentation/tutorials in Rmarkdown

I've begun a document to experiment with rmarkdown for extensive documentation/tutorials

Tutorial: using the twitter API with R

A tutorial on using twitter APIs with R, from getting access tokens to working with tweet text in quanteda. Draft deadline: 15th May

Update wordfish C++ methods

To print warning and/or ignore extreme outliers;
To return maximum likelihood standard errors;
To document starting values better.

ignoredFeatures=stopwords("english") not working

Hi,

I'm using quanteda to generate ngrams for word prediction. Try the following:

test <- "this is just a test text i'm using"
my_dtm <- quanteda::dfm(test, ngrams = 1:3, concatenator = " ", ignoredFeatures = stopwords("english"))

"i'm" should have been removed, but the dtm still contains it.

quanteda version is: 0.8.0-3

Error in twitterTerms()

Not coercing twitter output to corpus

docvars <- and docnames <- do not take indexes

Partial index assignment causes recycling, instead of assigning just to the indexed elements. Not what the user would be expecting.

Think about corpus structure

Currently the corpus object is a list of class "corpus" that contains:
attribs: a data frame where the first column is "texts", and the remaining columns are variables ("attributes") whose values are specific to each text.
attributes labels: an optional user-supplied list of descriptions of each attribute.
meta-data: character vector consisting of
source (user-supplied or default is full directory path and system)
creation date (automatic)
notes (default is NULL, can be user-supplied)

We could think about modifying this to include:

setting the Encoding(attribs$texts) to indicate the text encoding
indexing the texts
including additional objects for replication, such as dictionaries or dfm's

clean example not working on Ubuntu 14

KWIC with keywords of length>1 (multiple keywords)

It would nice to be nice to be able to use your KWIC function with keywords of length longer than 1. Maybe this is already possible, but I tried using a regex workaround (i.e. kwic(df$text,"(united)\s(states),window = 3) to no avail.

remove swiss docs and clean up check() warnings

Naming conventions for RData files

These are all really inconsistently named. Changing them now.

tokenize() running slow?

fullTest.R taking a long time to run on amicus texts and the culprit is tokenize(). I will look into it, commenting it out of fullTest for now.

Translator function for dfm object to tm's documentTermMatrix

tm's TermDocumentMatrix and DocumentTermMatrix class objects are for sparse matrixes. Because these are used by many other libraries, e.g. lda and topicmodels, so we need to be able to translate our dfm into those formats.

Maybe show top corner of the dfm when printing

> print(dfm(inaugTexts, verbose=FALSE))
Document-feature matrix of: 57 documents, 9214 features.

It might be nice to also see the top "corner" of the matrix (the first 5 rows and 5 columns)

dfm should support include.unigrams = FALSE too

bigrams(.., include.unigrams = FALSE,..) is possible. the same feature should be supported by dfm as well.

one-pass collocation detection

for size=2, 3 as the same function
to remove bigrams that are parts of trigrams, trigrams that contain just a bigram

ignoredFreatu

Implement word2vec and related methods for corpus objects

Will need to index the tokens/sentences first

creating corpus from tm VCorpus fails

class(corp)
[1] "VCorpus" "Corpus"
corp2 <- corpus(corp)
Error in [.data.frame(metad, , c(1, 3:15)) : undefined columns selected

Reduce use of austin functions

This would consider defining a dfm object as an austin wfm, to make it possible to use the austin word/doc/trim etc functions.

dfm crashes on blank documents

if a corpus contains a document with only whitespace characters, dfm crashes.

clean(x) returns as "character" when x has class "corpus"

The documentation states that

"If x is a corpus, ‘clean’ returns the corpus containing the cleaned texts".

But

class(clean(inaugCorpus)) gives character, instead of corpus.

Document data objects

and clean them up

dfm error reported by Alex

dfm() crashes in the (admittedly very unlikely) case that all
documents have the same number of words:

This doesn't work:

dfm(createCorpus(c("a b","c d")))
Creating dfm: ...Error in rep.default(textnames,
sapply(tokenizedTexts, length)) :
invalid 'times' argument

This works:

dfm(createCorpus(c("a b","c d e")))
Creating dfm: ... done.
a b c d e
text1 1 1 0 0 0
text2 0 0 1 1 1

error in extracting local zip file

unzip("irishbudgets2010.zip")
Warning message:
In unzip("irishbudgets2010.zip") : error 1 in extracting from zip file

Need better options for user to import .docx, .pdf, csv, etc.

We should have either functions for importing other filetypes, or a userguide explaining good methods for getting these other filetypes into plaintext.

Return to the issue of opening a QDAMiner library

Move corpus object class to S4

Work on collocates function

a) make a user-friendly wrapper function collocates() that mimics the existing collocation()
b) fully document
c) implement methods for text vectors and for a corpus object
d) make sure all examples work

Issues in ngrams.R

Ken's notes
should use tokeniser instead of just splitting on space delimiters
help function needs arguments explained, and returns
we should generalize this to ngrams(text, n=2, window=1, unordered=FALSE)
-- meaning we need a new, more general version to supercede this
-- Note the suggested defaults above - default window should be 1, not 2
-- we probably never really want to make unordered=TRUE since these are not
naturally occurring pairs, just like 'savings bank' is not the same as
'bank savings'
-- possibly define a version for a corpus
-- right now it treats a text vector as a single text, which is fine,
but we will want to note this in the help file. To apply to a vector
of texts we will need to apply() it or define a corpus method.