Git Product home page Git Product logo

quanteda's Introduction

quanteda: quantitative analysis of textual data

CRAN Version Downloads Total Downloads R-CMD-check codecov DOI DOI

About

quanteda is an R package for managing and analyzing text, created and maintained by Kenneth Benoit and Kohei Watanabe. Its creation was funded by the European Research Council grant ERC-2011-StG 283794-QUANTESS and its continued development is supported by the Quanteda Initiative CIC.

For more details, see https://quanteda.io.

quanteda version 4

The quanteda 4.0 is a major release that improves functionality and performance and further improves function consistency by removing previously deprecated functions. It also includes significant new tokeniser rules that make the default tokeniser smarter than ever, with new Unicode and ICU-compliant rules enabling it to work more consistently with even more languages.

We describe more fully these significant changes in:

The quanteda family of packages

We completed the trend of splitting quanteda into modular packages with the release of v3. The quanteda family of packages includes the following:

  • quanteda: contains all of the core natural language processing and textual data management functions
  • quanteda.textmodels: contains all of the text models and supporting functions, namely the textmodel_*() functions. This was split from the main package with the v2 release
  • quanteda.textstats: statistics for textual data, namely the textstat_*() functions, split with the v3 release
  • quanteda.textplots: plots for textual data, namely the textplot_*() functions, split with the v3 release

We are working on additional package releases, available in the meantime from our GitHub pages:

  • quanteda.sentiment: Functions and lexicons for sentiment analysis using dictionaries
  • quanteda.tidy: Extensions for manipulating document variables in core quanteda objects using your favourite tidyverse functions

and more to come.

How To…

Install (binaries) from CRAN

The normal way from CRAN, using your R GUI or

install.packages("quanteda") 

(New for quanteda v4.0) For Linux users: Because all installations on Linux are compiled, Linux users will first need to install the Intel oneAPI Threading Building Blocks for parallel computing for installation to work.

To install TBB on Linux:

# Fedora, CentOS, RHEL
sudo yum install tbb-devel

# Debian and Ubuntu
sudo apt install libtbb-dev

Windows or macOS users do not have to install TBB or any other packages to enable parallel computing when installing quanteda from CRAN.

Compile from source (macOS and Windows)

Because this compiles some C++ and Fortran source code, you will need to have installed the appropriate compilers to build the development version.

You will also need to install TBB:

macOS:

First, you will need to install XCode command line tools.

xcode-select --install

Then install the TBB libraries and the pkg-config utility: (after installing Homebrew):

brew install tbb pkg-config

Finally, you will need to install gfortran.

Windows:

Install RTools, which includes the TBB libraries.

Use quanteda

See the quick start guide to learn how to use quanteda.

Get Help

Cite the package

Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. (2018) “quanteda: An R package for the quantitative analysis of textual data”. Journal of Open Source Software 3(30), 774. https://doi.org/10.21105/joss.00774.

For a BibTeX entry, use the output from citation(package = "quanteda").

Leave Feedback

If you like quanteda, please consider leaving feedback or a testimonial here.

Contribute

Contributions in the form of feedback, comments, code, and bug reports are most welcome. How to contribute:

quanteda's People

Contributors

adamobeng avatar amatsuo avatar chainsawriot avatar christophergandrud avatar conjugateprior avatar etienne-s avatar haiyanlw avatar hofaichan avatar jiongweilua avatar jtatria avatar kbenoit avatar khughitt avatar koheiw avatar leeper avatar lindbrook avatar michaelchirico avatar mkearney avatar mmzmm avatar mpadge avatar nicmer avatar odelmarcelle avatar pablobarbera avatar pnulty avatar reuning avatar robitalec avatar rsbivand avatar stas-malavin avatar stefan-mueller avatar tpaskhalis avatar trinker avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

quanteda's Issues

wordstem() should work with corpus data type as well

wordstem() works only with array of words. It should work with corpus too. if you need to stem words before doing dfm (that supports only stemming of the second word) then you have to transform data from corpus/text to array and then back to corpus.

topicmodels packages in quanteda

Just fyi: when installing the master distribution, you seem to have a dependency/suggestion for the 'topicsmodels' package. This won't quite install fully on my Yosemite 10.10.1

Warning message:
packages ‘quantedaData’, ‘topicmodels’ are not available (for R version 3.1.2) 

Separate binary install fails

> install.packages('topicmodels')
Installing package into ‘/Users/will/Library/R/3.1/library’
(as ‘lib’ is unspecified)

   package ‘topicmodels’ is available as a source package but not as a binary

Warning in install.packages :
  package ‘topicmodels’ is not available (for R version 3.1.2)

and also from source

> install.packages('topicmodels', type='source')
Installing package into ‘/Users/will/Library/R/3.1/library’
(as ‘lib’ is unspecified)
also installing the dependency ‘modeltools’

trying URL 'http://cran.rstudio.com/src/contrib/modeltools_0.2-21.tar.gz'
Content type 'application/x-gzip' length 14794 bytes (14 Kb)
opened URL
==================================================
downloaded 14 Kb

trying URL 'http://cran.rstudio.com/src/contrib/topicmodels_0.2-1.tar.gz'
Content type 'application/x-gzip' length 847889 bytes (828 Kb)
opened URL
==================================================
downloaded 828 Kb

* installing *source* package ‘modeltools’ ...
** package ‘modeltools’ successfully unpacked and MD5 sums checked
** R
** inst
** preparing package for lazy loading
Creating a generic function for ‘na.fail’ from package ‘stats’ in package ‘modeltools’
Creating a generic function for ‘na.pass’ from package ‘stats’ in package ‘modeltools’
Creating a generic function for ‘na.omit’ from package ‘stats’ in package ‘modeltools’
Creating a generic function from function ‘MEapply’ in package ‘modeltools’
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (modeltools)
* installing *source* package ‘topicmodels’ ...
** package ‘topicmodels’ successfully unpacked and MD5 sums checked
** libs
clang -I/Library/Frameworks/R.framework/Resources/include     -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include    -fPIC  -Wall -mtune=core2 -g -O2  -c cokus.c -o cokus.o
clang -I/Library/Frameworks/R.framework/Resources/include     -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include    -fPIC  -Wall -mtune=core2 -g -O2  -c common.c -o common.o
clang -I/Library/Frameworks/R.framework/Resources/include     -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include    -fPIC  -Wall -mtune=core2 -g -O2  -c ctm.c -o ctm.o
ctm.c:29:10: fatal error: 'gsl/gsl_rng.h' file not found
#include <gsl/gsl_rng.h>
         ^
1 error generated.
make: *** [ctm.o] Error 1
ERROR: compilation failed for package ‘topicmodels’
* removing ‘/Users/will/Library/R/3.1/library/topicmodels’
Warning in install.packages :
  installation of package ‘topicmodels’ had non-zero exit status

The downloaded source packages are in
    ‘/private/var/folders/gl/ds8cxcyj07x0f553zt__04mm0000gn/T/RtmplAdulj/downloaded_packages’

That's their problem not yours, of course, but figured you might want to know.

Targets for Refactoring

Accessor functions:

texts()
words()
data() - (return only the attribs or texts + attribs?)
tokenizedTexts() - I suggest that when we run tokenize() we should store the result in the corpus object and simply retrieve the tokenized texts afterwards

Generic Functions:
clean() corpus, text, (dfm?)
tokenize() corpus, text
stopwords() corpus, text, dfm
sample() corpus, dfm

Error getting a directory of texts in to quanteda corpus

I've tried to get a directory of texts in to a quanteda corpus with some issues. First, I make a VCorpus using the DirSource function in tm. Second, I try to make the object a quanteda corpus. However, I get the error "no applicable method for 'corpus' applied to an object of class "list."" But it's not a list; I've checked the files and everything seems sound.

library(quanteda)
library(tm)
Loading required package: NLP

Attaching package: ‘tm’

The following objects are masked from ‘package:quanteda’:

as.DocumentTermMatrix, stopwords

ds <- VCorpus(DirSource("~/Desktop/Speeches/House/2000/"))

make it a quanteda object

txts <- corpus(ds)
Error in UseMethod("corpus") :
no applicable method for 'corpus' applied to an object of class "list"
class(ds)
[1] "VCorpus" "Corpus"

tokenize(x, what = "sentence") not working.

Problem with tokenize()

Tokenize(x, what = "sentence") is not working when the next sentence starts with lowercase.

Test with Problem

# spaces (or not) after the punctuation do not affect result
> tokenize("Hello! This is sentence one. really? and this is two. and this is three.are you sure?yes.", 
           what = "sentence", simplify = TRUE)
[1] "Hello!"                                          
[2] "This is sentence one. really?" # "one." does not force a break; "really?" looks OK
[3] "and this is two. and this is three.are you sure?" # "two. " does not force a break
[4] "yes."

Test OK

# They ALL look OK
# spaces (or not) after the punctuation do not affect result
> tokenize("Hello! This is sentence one. Really? And this is two. And this is three.Are you sure?Yes.", 
           what = "sentence", simplify = TRUE)
[1] "Hello!"                "This is sentence one." "Really?"               "And this is two."
[5] "And this is three."    "Are you sure?"         "Yes."

Test with segmentSentence()

I did run the internal function segmentSentence(), and it looks OK in all cases.

> segmentSentence("Hello! This is sentence one. really? and this is two. and this is three.are you sure?yes.")
[1] "Hello!"                "This is sentence one." "really?"               "and this is two."     
[5] "and this is three?"    "are you sure."         "yes!"

> segmentSentence("Hello! This is sentence one. Really? And this is two. And this is three.Are you sure?Yes.")
[1] "Hello!"                "This is sentence one." "Really?"               "And this is two."     
[5] "And this is three?"    "Are you sure."         "Yes!"

sessionInfo()

Note: I did not upgrade to quanteda_0.8.0-4 as it does not have a binary version, and I did not want to compile locally.

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.4 (Yosemite)

locale:
[1] en_US/en_US/en_US/C/en_US/en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] quanteda_0.8.0-3

loaded via a namespace (and not attached):
 [1] magrittr_1.5     plyr_1.8.3       Matrix_1.2-2     parallel_3.2.1   tools_3.2.1     
 [6] reshape2_1.4.1   Rcpp_0.12.0      stringi_0.5-5    grid_3.2.1       data.table_1.9.4
[11] stringr_1.0.0    chron_2.3-47     lattice_0.20-33  ca_0.58         

How do I turn a dfm into a sparseMatrix?

dfm already looks a lot like a sparse Matrix:

> str(dfm(inaugTexts, verbose=FALSE))
Formal class 'dfmSparse' [package "quanteda"] with 9 slots
  ..@ settings :List of 1
  .. ..$ : NULL
  ..@ weighting: chr "frequency"
  ..@ smooth   : num 0
  ..@ Dim      : int [1:2] 57 9214
  ..@ Dimnames :List of 2
  .. ..$ docs    : chr [1:57] "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson" ...
  .. ..$ features: chr [1:9214] "14th" "15th" "18th" "19th" ...
  ..@ i        : int [1:43719] 0 27 30 52 52 45 46 8 52 51 ...
  ..@ p        : int [1:9215] 0 1 3 4 5 7 9 11 12 13 ...
  ..@ x        : num [1:43719] 1 1 1 1 2 1 2 1 3 1 ...
  ..@ factors  : list()

I want to be able to look at the matrix elements and pass it to modeling functions that take Matrix inputs.

how to stem both words of bigrams

dfm with bigrams=TRUE and stem=TRUE creates bigrams, where only the last word of bigram is stemmed. How to stem both words of the bigram?

dfm(c("banking industry"), clean=FALSE, stem=TRUE, bigrams=TRUE, verbose=FALSE)
Document-feature matrix of: 1 document, 3 features.
1 x 3 sparse Matrix of class "dfmSparse"
features
docs bank banking_industri industri
text1 1

I need "bank_industri" instead of "banking_industri"

predict.dfm

It would be really useful to be able project new texts into the same feature space as an existing dfm. This would be particularly useful if you're using texts as inputs to a predictive model.

Paul to redo the stop words implementation

I created a named list object data/stopwords.Rdata, taken from tm (which took it from other sources including SnoBall). Please write a function that replicates tm's stop words(kind="catalan") etc. to retrieve the stopwords based on the named list element (with error checking of course). We can then make English the default and rewrite the dfm() argument accordingly. Note that this still allows us to send a character vector to dfm for home-grown stop word lists (as Kohei likes).

POS feature selection

Add the ability to extract parts of speech (using OpenNLP) as features, as an option to dfm. This means we should think of modularising the objects that define dfm "features". Currently we have:
word tokens
stemmed word tokens
dictionary entries

Adding parts of speech would mean either selecting ONLY specific POS types, or extracting POS counts as features.

Update wordfish C++ methods

To print warning and/or ignore extreme outliers;
To return maximum likelihood standard errors;
To document starting values better.

ignoredFeatures=stopwords("english") not working

Hi,

I'm using quanteda to generate ngrams for word prediction. Try the following:

test <- "this is just a test text i'm using"
my_dtm <- quanteda::dfm(test, ngrams = 1:3, concatenator = " ", ignoredFeatures = stopwords("english"))

"i'm" should have been removed, but the dtm still contains it.

quanteda version is: 0.8.0-3

Think about corpus structure

Currently the corpus object is a list of class "corpus" that contains:
attribs: a data frame where the first column is "texts", and the remaining columns are variables ("attributes") whose values are specific to each text.
attributes labels: an optional user-supplied list of descriptions of each attribute.
meta-data: character vector consisting of
source (user-supplied or default is full directory path and system)
creation date (automatic)
notes (default is NULL, can be user-supplied)

We could think about modifying this to include:

  • setting the Encoding(attribs$texts) to indicate the text encoding
  • indexing the texts
  • including additional objects for replication, such as dictionaries or dfm's

KWIC with keywords of length>1 (multiple keywords)

It would nice to be nice to be able to use your KWIC function with keywords of length longer than 1. Maybe this is already possible, but I tried using a regex workaround (i.e. kwic(df$text,"(united)\s(states),window = 3) to no avail.

tokenize() running slow?

fullTest.R taking a long time to run on amicus texts and the culprit is tokenize(). I will look into it, commenting it out of fullTest for now.

Maybe show top corner of the dfm when printing

> print(dfm(inaugTexts, verbose=FALSE))
Document-feature matrix of: 57 documents, 9214 features.

It might be nice to also see the top "corner" of the matrix (the first 5 rows and 5 columns)

Reduce use of austin functions

This would consider defining a dfm object as an austin wfm, to make it possible to use the austin word/doc/trim etc functions.

dfm error reported by Alex

dfm() crashes in the (admittedly very unlikely) case that all
documents have the same number of words:

This doesn't work:

dfm(createCorpus(c("a b","c d")))
Creating dfm: ...Error in rep.default(textnames,
sapply(tokenizedTexts, length)) :
invalid 'times' argument

This works:

dfm(createCorpus(c("a b","c d e")))
Creating dfm: ... done.
a b c d e
text1 1 1 0 0 0
text2 0 0 1 1 1

Work on collocates function

a) make a user-friendly wrapper function collocates() that mimics the existing collocation()
b) fully document
c) implement methods for text vectors and for a corpus object
d) make sure all examples work

Issues in ngrams.R

Ken's notes
should use tokeniser instead of just splitting on space delimiters
help function needs arguments explained, and returns
we should generalize this to ngrams(text, n=2, window=1, unordered=FALSE)
-- meaning we need a new, more general version to supercede this
-- Note the suggested defaults above - default window should be 1, not 2
-- we probably never really want to make unordered=TRUE since these are not
naturally occurring pairs, just like 'savings bank' is not the same as
'bank savings'
-- possibly define a version for a corpus
-- right now it treats a text vector as a single text, which is fine,
but we will want to note this in the help file. To apply to a vector
of texts we will need to apply() it or define a corpus method.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.