quanteda / quanteda Goto Github PK

View Code? Open in Web Editor NEW

828.0 53.0 185.0 759.16 MB

An R package for the Quantitative Analysis of Textual Data

Home Page: https://quanteda.io

License: GNU General Public License v3.0

R 89.49% C++ 9.68% CSS 0.12% JavaScript 0.04% TeX 0.40% C 0.01% Shell 0.01% M4 0.25%

r text-analytics natural-language-processing corpus quanteda

quanteda's Issues

dfm error reported by Alex

dfm() crashes in the (admittedly very unlikely) case that all
documents have the same number of words:

This doesn't work:

dfm(createCorpus(c("a b","c d")))
Creating dfm: ...Error in rep.default(textnames,
sapply(tokenizedTexts, length)) :
invalid 'times' argument

This works:

dfm(createCorpus(c("a b","c d e")))
Creating dfm: ... done.
a b c d e
text1 1 1 0 0 0
text2 0 0 1 1 1

docvars <- and docnames <- do not take indexes

Partial index assignment causes recycling, instead of assigning just to the indexed elements. Not what the user would be expecting.

Tutorial: How to get texts into a quanteda corpus

A guide to all the ways of importing texts into quanteda - getTextDir, getTextGui, corpusFromHeaders, corpusFromFilenames, twitter, etc..

predict.dfm

It would be really useful to be able project new texts into the same feature space as an existing dfm. This would be particularly useful if you're using texts as inputs to a predictive model.

Ken's notes
should use tokeniser instead of just splitting on space delimiters
help function needs arguments explained, and returns
we should generalize this to ngrams(text, n=2, window=1, unordered=FALSE)
-- meaning we need a new, more general version to supercede this
-- Note the suggested defaults above - default window should be 1, not 2
-- we probably never really want to make unordered=TRUE since these are not
naturally occurring pairs, just like 'savings bank' is not the same as
'bank savings'
-- possibly define a version for a corpus
-- right now it treats a text vector as a single text, which is fine,
but we will want to note this in the help file. To apply to a vector
of texts we will need to apply() it or define a corpus method.

POS feature selection

Add the ability to extract parts of speech (using OpenNLP) as features, as an option to dfm. This means we should think of modularising the objects that define dfm "features". Currently we have:
word tokens
stemmed word tokens
dictionary entries

Adding parts of speech would mean either selecting ONLY specific POS types, or extracting POS counts as features.

Update wordfish C++ methods

To print warning and/or ignore extreme outliers;
To return maximum likelihood standard errors;
To document starting values better.

Translator function for dfm object to tm's documentTermMatrix

tm's TermDocumentMatrix and DocumentTermMatrix class objects are for sparse matrixes. Because these are used by many other libraries, e.g. lda and topicmodels, so we need to be able to translate our dfm into those formats.

Documentation/tutorials in Rmarkdown

I've begun a document to experiment with rmarkdown for extensive documentation/tutorials

Paul to redo the stop words implementation

I created a named list object data/stopwords.Rdata, taken from tm (which took it from other sources including SnoBall). Please write a function that replicates tm's stop words(kind="catalan") etc. to retrieve the stopwords based on the named list element (with error checking of course). We can then make English the default and rewrite the dfm() argument accordingly. Note that this still allows us to send a character vector to dfm for home-grown stop word lists (as Kohei likes).

one-pass collocation detection

for size=2, 3 as the same function
to remove bigrams that are parts of trigrams, trigrams that contain just a bigram

creating corpus from tm VCorpus fails

class(corp)
[1] "VCorpus" "Corpus"
corp2 <- corpus(corp)
Error in [.data.frame(metad, , c(1, 3:15)) : undefined columns selected

KWIC with keywords of length>1 (multiple keywords)

It would nice to be nice to be able to use your KWIC function with keywords of length longer than 1. Maybe this is already possible, but I tried using a regex workaround (i.e. kwic(df$text,"(united)\s(states),window = 3) to no avail.

how to stem both words of bigrams

dfm with bigrams=TRUE and stem=TRUE creates bigrams, where only the last word of bigram is stemmed. How to stem both words of the bigram?

dfm(c("banking industry"), clean=FALSE, stem=TRUE, bigrams=TRUE, verbose=FALSE)
Document-feature matrix of: 1 document, 3 features.
1 x 3 sparse Matrix of class "dfmSparse"
features
docs bank banking_industri industri
text1 1

I need "bank_industri" instead of "banking_industri"

wordstem() should work with corpus data type as well

wordstem() works only with array of words. It should work with corpus too. if you need to stem words before doing dfm (that supports only stemming of the second word) then you have to transform data from corpus/text to array and then back to corpus.

ignoredFreatu

tokenize() running slow?

fullTest.R taking a long time to run on amicus texts and the culprit is tokenize(). I will look into it, commenting it out of fullTest for now.

Work on collocates function

a) make a user-friendly wrapper function collocates() that mimics the existing collocation()
b) fully document
c) implement methods for text vectors and for a corpus object
d) make sure all examples work

use readr instead of read.csv for text import functions

Error getting a directory of texts in to quanteda corpus

I've tried to get a directory of texts in to a quanteda corpus with some issues. First, I make a VCorpus using the DirSource function in tm. Second, I try to make the object a quanteda corpus. However, I get the error "no applicable method for 'corpus' applied to an object of class "list."" But it's not a list; I've checked the files and everything seems sound.

library(quanteda)
library(tm)
Loading required package: NLP

Attaching package: ‘tm’

The following objects are masked from ‘package:quanteda’:

as.DocumentTermMatrix, stopwords

ds <- VCorpus(DirSource("~/Desktop/Speeches/House/2000/"))

make it a quanteda object

txts <- corpus(ds)
Error in UseMethod("corpus") :
no applicable method for 'corpus' applied to an object of class "list"
class(ds)
[1] "VCorpus" "Corpus"

topicmodels packages in quanteda

Just fyi: when installing the master distribution, you seem to have a dependency/suggestion for the 'topicsmodels' package. This won't quite install fully on my Yosemite 10.10.1

Warning message:
packages ‘quantedaData’, ‘topicmodels’ are not available (for R version 3.1.2)

Separate binary install fails

> install.packages('topicmodels')
Installing package into ‘/Users/will/Library/R/3.1/library’
(as ‘lib’ is unspecified)

   package ‘topicmodels’ is available as a source package but not as a binary

Warning in install.packages :
  package ‘topicmodels’ is not available (for R version 3.1.2)

and also from source

> install.packages('topicmodels', type='source')
Installing package into ‘/Users/will/Library/R/3.1/library’
(as ‘lib’ is unspecified)
also installing the dependency ‘modeltools’

trying URL 'http://cran.rstudio.com/src/contrib/modeltools_0.2-21.tar.gz'
Content type 'application/x-gzip' length 14794 bytes (14 Kb)
opened URL
==================================================
downloaded 14 Kb

trying URL 'http://cran.rstudio.com/src/contrib/topicmodels_0.2-1.tar.gz'
Content type 'application/x-gzip' length 847889 bytes (828 Kb)
opened URL
==================================================
downloaded 828 Kb

* installing *source* package ‘modeltools’ ...
** package ‘modeltools’ successfully unpacked and MD5 sums checked
** R
** inst
** preparing package for lazy loading
Creating a generic function for ‘na.fail’ from package ‘stats’ in package ‘modeltools’
Creating a generic function for ‘na.pass’ from package ‘stats’ in package ‘modeltools’
Creating a generic function for ‘na.omit’ from package ‘stats’ in package ‘modeltools’
Creating a generic function from function ‘MEapply’ in package ‘modeltools’
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (modeltools)
* installing *source* package ‘topicmodels’ ...
** package ‘topicmodels’ successfully unpacked and MD5 sums checked
** libs
clang -I/Library/Frameworks/R.framework/Resources/include     -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include    -fPIC  -Wall -mtune=core2 -g -O2  -c cokus.c -o cokus.o
clang -I/Library/Frameworks/R.framework/Resources/include     -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include    -fPIC  -Wall -mtune=core2 -g -O2  -c common.c -o common.o
clang -I/Library/Frameworks/R.framework/Resources/include     -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include    -fPIC  -Wall -mtune=core2 -g -O2  -c ctm.c -o ctm.o
ctm.c:29:10: fatal error: 'gsl/gsl_rng.h' file not found
#include <gsl/gsl_rng.h>
         ^
1 error generated.
make: *** [ctm.o] Error 1
ERROR: compilation failed for package ‘topicmodels’
* removing ‘/Users/will/Library/R/3.1/library/topicmodels’
Warning in install.packages :
  installation of package ‘topicmodels’ had non-zero exit status

The downloaded source packages are in
    ‘/private/var/folders/gl/ds8cxcyj07x0f553zt__04mm0000gn/T/RtmplAdulj/downloaded_packages’

That's their problem not yours, of course, but figured you might want to know.

tokenize(x, what = "sentence") not working.

Problem with tokenize()

Tokenize(x, what = "sentence") is not working when the next sentence starts with lowercase.

Test with Problem

# spaces (or not) after the punctuation do not affect result
> tokenize("Hello! This is sentence one. really? and this is two. and this is three.are you sure?yes.", 
           what = "sentence", simplify = TRUE)
[1] "Hello!"                                          
[2] "This is sentence one. really?" # "one." does not force a break; "really?" looks OK
[3] "and this is two. and this is three.are you sure?" # "two. " does not force a break
[4] "yes."

Test OK

# They ALL look OK
# spaces (or not) after the punctuation do not affect result
> tokenize("Hello! This is sentence one. Really? And this is two. And this is three.Are you sure?Yes.", 
           what = "sentence", simplify = TRUE)
[1] "Hello!"                "This is sentence one." "Really?"               "And this is two."
[5] "And this is three."    "Are you sure?"         "Yes."

Test with segmentSentence()

I did run the internal function segmentSentence(), and it looks OK in all cases.

> segmentSentence("Hello! This is sentence one. really? and this is two. and this is three.are you sure?yes.")
[1] "Hello!"                "This is sentence one." "really?"               "and this is two."     
[5] "and this is three?"    "are you sure."         "yes!"

> segmentSentence("Hello! This is sentence one. Really? And this is two. And this is three.Are you sure?Yes.")
[1] "Hello!"                "This is sentence one." "Really?"               "And this is two."     
[5] "And this is three?"    "Are you sure."         "Yes!"

sessionInfo()

Note: I did not upgrade to quanteda_0.8.0-4 as it does not have a binary version, and I did not want to compile locally.

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.4 (Yosemite)

locale:
[1] en_US/en_US/en_US/C/en_US/en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] quanteda_0.8.0-3

loaded via a namespace (and not attached):
 [1] magrittr_1.5     plyr_1.8.3       Matrix_1.2-2     parallel_3.2.1   tools_3.2.1     
 [6] reshape2_1.4.1   Rcpp_0.12.0      stringi_0.5-5    grid_3.2.1       data.table_1.9.4
[11] stringr_1.0.0    chron_2.3-47     lattice_0.20-33  ca_0.58

Need better options for user to import .docx, .pdf, csv, etc.

We should have either functions for importing other filetypes, or a userguide explaining good methods for getting these other filetypes into plaintext.

Maybe show top corner of the dfm when printing

> print(dfm(inaugTexts, verbose=FALSE))
Document-feature matrix of: 57 documents, 9214 features.

It might be nice to also see the top "corner" of the matrix (the first 5 rows and 5 columns)

Targets for Refactoring

Accessor functions:

texts()
words()
data() - (return only the attribs or texts + attribs?)
tokenizedTexts() - I suggest that when we run tokenize() we should store the result in the corpus object and simply retrieve the tokenized texts afterwards

Generic Functions:
clean() corpus, text, (dfm?)
tokenize() corpus, text
stopwords() corpus, text, dfm
sample() corpus, dfm

error in extracting local zip file

unzip("irishbudgets2010.zip")
Warning message:
In unzip("irishbudgets2010.zip") : error 1 in extracting from zip file

remove swiss docs and clean up check() warnings

bigrams() does not work correctly with text that includes only one word

bigrams("banking", include.unigrams = FALSE)
[[1]]
[1] "banking_" "_banking"

dfm should support include.unigrams = FALSE too

bigrams(.., include.unigrams = FALSE,..) is possible. the same feature should be supported by dfm as well.

dfm crashes on blank documents

if a corpus contains a document with only whitespace characters, dfm crashes.

Return to the issue of opening a QDAMiner library

New version of summary() in dev gives error on amicus corpus

Implement indexing for a corpus

Including dfm and other methods that use dfms

clean(x) returns as "character" when x has class "corpus"

The documentation states that

"If x is a corpus, ‘clean’ returns the corpus containing the cleaned texts".

But

class(clean(inaugCorpus)) gives character, instead of corpus.

Naming conventions for RData files

These are all really inconsistently named. Changing them now.

Outline/draft the main vignette/tutorial

attributes of documents can be specified in document headers

JSON document headers containing attribute-value pairs will be detected and incorporated into corpus object when texts are added

Tutorial: scaling documents with quanteda

How to scale documents in quanteda using correspondence analysis (ca) and wordfish.

Move corpus object class to S4

Implement word2vec and related methods for corpus objects

Will need to index the tokens/sentences first

ignoredFeatures=stopwords("english") not working

Hi,

I'm using quanteda to generate ngrams for word prediction. Try the following:

test <- "this is just a test text i'm using"
my_dtm <- quanteda::dfm(test, ngrams = 1:3, concatenator = " ", ignoredFeatures = stopwords("english"))

"i'm" should have been removed, but the dtm still contains it.

quanteda version is: 0.8.0-3

clean example not working on Ubuntu 14

Reduce use of austin functions

This would consider defining a dfm object as an austin wfm, to make it possible to use the austin word/doc/trim etc functions.

Think about corpus structure

Currently the corpus object is a list of class "corpus" that contains:
attribs: a data frame where the first column is "texts", and the remaining columns are variables ("attributes") whose values are specific to each text.
attributes labels: an optional user-supplied list of descriptions of each attribute.
meta-data: character vector consisting of
source (user-supplied or default is full directory path and system)
creation date (automatic)
notes (default is NULL, can be user-supplied)

We could think about modifying this to include:

setting the Encoding(attribs$texts) to indicate the text encoding
indexing the texts
including additional objects for replication, such as dictionaries or dfm's

Tutorial: using the twitter API with R

A tutorial on using twitter APIs with R, from getting access tokens to working with tweet text in quanteda. Draft deadline: 15th May

How do I turn a dfm into a sparseMatrix?

dfm already looks a lot like a sparse Matrix:

> str(dfm(inaugTexts, verbose=FALSE))
Formal class 'dfmSparse' [package "quanteda"] with 9 slots
  ..@ settings :List of 1
  .. ..$ : NULL
  ..@ weighting: chr "frequency"
  ..@ smooth   : num 0
  ..@ Dim      : int [1:2] 57 9214
  ..@ Dimnames :List of 2
  .. ..$ docs    : chr [1:57] "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson" ...
  .. ..$ features: chr [1:9214] "14th" "15th" "18th" "19th" ...
  ..@ i        : int [1:43719] 0 27 30 52 52 45 46 8 52 51 ...
  ..@ p        : int [1:9215] 0 1 3 4 5 7 9 11 12 13 ...
  ..@ x        : num [1:43719] 1 1 1 1 2 1 2 1 3 1 ...
  ..@ factors  : list()

I want to be able to look at the matrix elements and pass it to modeling functions that take Matrix inputs.

more options for clean()

Case, punctuation, and digits should be options with defaults

quanteda / quanteda Goto Github PK

quanteda's Issues

make it a quanteda object

Problem with tokenize()

Test with Problem

Test OK

Test with segmentSentence()

sessionInfo()

Recommend Projects

Recommend Topics

Recommend Org