quanteda / quanteda Goto Github PK
View Code? Open in Web Editor NEWAn R package for the Quantitative Analysis of Textual Data
Home Page: https://quanteda.io
License: GNU General Public License v3.0
An R package for the Quantitative Analysis of Textual Data
Home Page: https://quanteda.io
License: GNU General Public License v3.0
and clean them up
dfm() crashes in the (admittedly very unlikely) case that all
documents have the same number of words:
This doesn't work:
dfm(createCorpus(c("a b","c d")))
Creating dfm: ...Error in rep.default(textnames,
sapply(tokenizedTexts, length)) :
invalid 'times' argument
This works:
dfm(createCorpus(c("a b","c d e")))
Creating dfm: ... done.
a b c d e
text1 1 1 0 0 0
text2 0 0 1 1 1
Not coercing twitter output to corpus
Partial index assignment causes recycling, instead of assigning just to the indexed elements. Not what the user would be expecting.
A guide to all the ways of importing texts into quanteda - getTextDir, getTextGui, corpusFromHeaders, corpusFromFilenames, twitter, etc..
It would be really useful to be able project new texts into the same feature space as an existing dfm
. This would be particularly useful if you're using texts as inputs to a predictive model.
Ken's notes
should use tokeniser instead of just splitting on space delimiters
help function needs arguments explained, and returns
we should generalize this to ngrams(text, n=2, window=1, unordered=FALSE)
-- meaning we need a new, more general version to supercede this
-- Note the suggested defaults above - default window should be 1, not 2
-- we probably never really want to make unordered=TRUE since these are not
naturally occurring pairs, just like 'savings bank' is not the same as
'bank savings'
-- possibly define a version for a corpus
-- right now it treats a text vector as a single text, which is fine,
but we will want to note this in the help file. To apply to a vector
of texts we will need to apply() it or define a corpus method.
Add the ability to extract parts of speech (using OpenNLP) as features, as an option to dfm. This means we should think of modularising the objects that define dfm "features". Currently we have:
word tokens
stemmed word tokens
dictionary entries
Adding parts of speech would mean either selecting ONLY specific POS types, or extracting POS counts as features.
To print warning and/or ignore extreme outliers;
To return maximum likelihood standard errors;
To document starting values better.
tm's TermDocumentMatrix and DocumentTermMatrix class objects are for sparse matrixes. Because these are used by many other libraries, e.g. lda and topicmodels, so we need to be able to translate our dfm into those formats.
I've begun a document to experiment with rmarkdown for extensive documentation/tutorials
I created a named list object data/stopwords.Rdata, taken from tm (which took it from other sources including SnoBall). Please write a function that replicates tm's stop words(kind="catalan") etc. to retrieve the stopwords based on the named list element (with error checking of course). We can then make English the default and rewrite the dfm() argument accordingly. Note that this still allows us to send a character vector to dfm for home-grown stop word lists (as Kohei likes).
for size=2, 3 as the same function
to remove bigrams that are parts of trigrams, trigrams that contain just a bigram
class(corp)
[1] "VCorpus" "Corpus"
corp2 <- corpus(corp)
Error in[.data.frame
(metad, , c(1, 3:15)) : undefined columns selected
It would nice to be nice to be able to use your KWIC function with keywords of length longer than 1. Maybe this is already possible, but I tried using a regex workaround (i.e. kwic(df$text,"(united)\s(states),window = 3) to no avail.
dfm with bigrams=TRUE and stem=TRUE creates bigrams, where only the last word of bigram is stemmed. How to stem both words of the bigram?
dfm(c("banking industry"), clean=FALSE, stem=TRUE, bigrams=TRUE, verbose=FALSE)
Document-feature matrix of: 1 document, 3 features.
1 x 3 sparse Matrix of class "dfmSparse"
features
docs bank banking_industri industri
text1 1
I need "bank_industri" instead of "banking_industri"
wordstem() works only with array of words. It should work with corpus too. if you need to stem words before doing dfm (that supports only stemming of the second word) then you have to transform data from corpus/text to array and then back to corpus.
fullTest.R taking a long time to run on amicus texts and the culprit is tokenize(). I will look into it, commenting it out of fullTest for now.
a) make a user-friendly wrapper function collocates() that mimics the existing collocation()
b) fully document
c) implement methods for text vectors and for a corpus object
d) make sure all examples work
I've tried to get a directory of texts in to a quanteda corpus with some issues. First, I make a VCorpus using the DirSource function in tm. Second, I try to make the object a quanteda corpus. However, I get the error "no applicable method for 'corpus' applied to an object of class "list."" But it's not a list; I've checked the files and everything seems sound.
library(quanteda)
library(tm)
Loading required package: NLP
Attaching package: ‘tm’
The following objects are masked from ‘package:quanteda’:
as.DocumentTermMatrix, stopwords
ds <- VCorpus(DirSource("~/Desktop/Speeches/House/2000/"))
make it a quanteda object
txts <- corpus(ds)
Error in UseMethod("corpus") :
no applicable method for 'corpus' applied to an object of class "list"
class(ds)
[1] "VCorpus" "Corpus"
Just fyi: when installing the master distribution, you seem to have a dependency/suggestion for the 'topicsmodels' package. This won't quite install fully on my Yosemite 10.10.1
Warning message:
packages ‘quantedaData’, ‘topicmodels’ are not available (for R version 3.1.2)
Separate binary install fails
> install.packages('topicmodels')
Installing package into ‘/Users/will/Library/R/3.1/library’
(as ‘lib’ is unspecified)
package ‘topicmodels’ is available as a source package but not as a binary
Warning in install.packages :
package ‘topicmodels’ is not available (for R version 3.1.2)
and also from source
> install.packages('topicmodels', type='source')
Installing package into ‘/Users/will/Library/R/3.1/library’
(as ‘lib’ is unspecified)
also installing the dependency ‘modeltools’
trying URL 'http://cran.rstudio.com/src/contrib/modeltools_0.2-21.tar.gz'
Content type 'application/x-gzip' length 14794 bytes (14 Kb)
opened URL
==================================================
downloaded 14 Kb
trying URL 'http://cran.rstudio.com/src/contrib/topicmodels_0.2-1.tar.gz'
Content type 'application/x-gzip' length 847889 bytes (828 Kb)
opened URL
==================================================
downloaded 828 Kb
* installing *source* package ‘modeltools’ ...
** package ‘modeltools’ successfully unpacked and MD5 sums checked
** R
** inst
** preparing package for lazy loading
Creating a generic function for ‘na.fail’ from package ‘stats’ in package ‘modeltools’
Creating a generic function for ‘na.pass’ from package ‘stats’ in package ‘modeltools’
Creating a generic function for ‘na.omit’ from package ‘stats’ in package ‘modeltools’
Creating a generic function from function ‘MEapply’ in package ‘modeltools’
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (modeltools)
* installing *source* package ‘topicmodels’ ...
** package ‘topicmodels’ successfully unpacked and MD5 sums checked
** libs
clang -I/Library/Frameworks/R.framework/Resources/include -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -fPIC -Wall -mtune=core2 -g -O2 -c cokus.c -o cokus.o
clang -I/Library/Frameworks/R.framework/Resources/include -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -fPIC -Wall -mtune=core2 -g -O2 -c common.c -o common.o
clang -I/Library/Frameworks/R.framework/Resources/include -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -fPIC -Wall -mtune=core2 -g -O2 -c ctm.c -o ctm.o
ctm.c:29:10: fatal error: 'gsl/gsl_rng.h' file not found
#include <gsl/gsl_rng.h>
^
1 error generated.
make: *** [ctm.o] Error 1
ERROR: compilation failed for package ‘topicmodels’
* removing ‘/Users/will/Library/R/3.1/library/topicmodels’
Warning in install.packages :
installation of package ‘topicmodels’ had non-zero exit status
The downloaded source packages are in
‘/private/var/folders/gl/ds8cxcyj07x0f553zt__04mm0000gn/T/RtmplAdulj/downloaded_packages’
That's their problem not yours, of course, but figured you might want to know.
Tokenize(x, what = "sentence")
is not working when the next sentence starts with lowercase.
# spaces (or not) after the punctuation do not affect result
> tokenize("Hello! This is sentence one. really? and this is two. and this is three.are you sure?yes.",
what = "sentence", simplify = TRUE)
[1] "Hello!"
[2] "This is sentence one. really?" # "one." does not force a break; "really?" looks OK
[3] "and this is two. and this is three.are you sure?" # "two. " does not force a break
[4] "yes."
# They ALL look OK
# spaces (or not) after the punctuation do not affect result
> tokenize("Hello! This is sentence one. Really? And this is two. And this is three.Are you sure?Yes.",
what = "sentence", simplify = TRUE)
[1] "Hello!" "This is sentence one." "Really?" "And this is two."
[5] "And this is three." "Are you sure?" "Yes."
I did run the internal function segmentSentence()
, and it looks OK in all cases.
> segmentSentence("Hello! This is sentence one. really? and this is two. and this is three.are you sure?yes.")
[1] "Hello!" "This is sentence one." "really?" "and this is two."
[5] "and this is three?" "are you sure." "yes!"
> segmentSentence("Hello! This is sentence one. Really? And this is two. And this is three.Are you sure?Yes.")
[1] "Hello!" "This is sentence one." "Really?" "And this is two."
[5] "And this is three?" "Are you sure." "Yes!"
Note: I did not upgrade to quanteda_0.8.0-4
as it does not have a binary version, and I did not want to compile locally.
> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.4 (Yosemite)
locale:
[1] en_US/en_US/en_US/C/en_US/en_AU.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] quanteda_0.8.0-3
loaded via a namespace (and not attached):
[1] magrittr_1.5 plyr_1.8.3 Matrix_1.2-2 parallel_3.2.1 tools_3.2.1
[6] reshape2_1.4.1 Rcpp_0.12.0 stringi_0.5-5 grid_3.2.1 data.table_1.9.4
[11] stringr_1.0.0 chron_2.3-47 lattice_0.20-33 ca_0.58
We should have either functions for importing other filetypes, or a userguide explaining good methods for getting these other filetypes into plaintext.
> print(dfm(inaugTexts, verbose=FALSE))
Document-feature matrix of: 57 documents, 9214 features.
It might be nice to also see the top "corner" of the matrix (the first 5 rows and 5 columns)
Accessor functions:
texts()
words()
data() - (return only the attribs or texts + attribs?)
tokenizedTexts() - I suggest that when we run tokenize() we should store the result in the corpus object and simply retrieve the tokenized texts afterwards
Generic Functions:
clean() corpus, text, (dfm?)
tokenize() corpus, text
stopwords() corpus, text, dfm
sample() corpus, dfm
unzip("irishbudgets2010.zip")
Warning message:
In unzip("irishbudgets2010.zip") : error 1 in extracting from zip file
bigrams("banking", include.unigrams = FALSE)
[[1]]
[1] "banking_" "_banking"
bigrams(.., include.unigrams = FALSE,..) is possible. the same feature should be supported by dfm as well.
if a corpus contains a document with only whitespace characters, dfm crashes.
Including dfm and other methods that use dfms
The documentation states that
"If x is a corpus, ‘clean’ returns the corpus containing the cleaned texts".
But
class(clean(inaugCorpus))
gives character
, instead of corpus
.
These are all really inconsistently named. Changing them now.
JSON document headers containing attribute-value pairs will be detected and incorporated into corpus object when texts are added
How to scale documents in quanteda using correspondence analysis (ca) and wordfish.
Will need to index the tokens/sentences first
Hi,
I'm using quanteda to generate ngrams for word prediction. Try the following:
test <- "this is just a test text i'm using"
my_dtm <- quanteda::dfm(test, ngrams = 1:3, concatenator = " ", ignoredFeatures = stopwords("english"))
"i'm" should have been removed, but the dtm still contains it.
quanteda version is: 0.8.0-3
This would consider defining a dfm object as an austin wfm, to make it possible to use the austin word/doc/trim etc functions.
Currently the corpus object is a list of class "corpus" that contains:
attribs: a data frame where the first column is "texts", and the remaining columns are variables ("attributes") whose values are specific to each text.
attributes labels: an optional user-supplied list of descriptions of each attribute.
meta-data: character vector consisting of
source (user-supplied or default is full directory path and system)
creation date (automatic)
notes (default is NULL, can be user-supplied)
We could think about modifying this to include:
A tutorial on using twitter APIs with R, from getting access tokens to working with tweet text in quanteda. Draft deadline: 15th May
dfm already looks a lot like a sparse Matrix:
> str(dfm(inaugTexts, verbose=FALSE))
Formal class 'dfmSparse' [package "quanteda"] with 9 slots
..@ settings :List of 1
.. ..$ : NULL
..@ weighting: chr "frequency"
..@ smooth : num 0
..@ Dim : int [1:2] 57 9214
..@ Dimnames :List of 2
.. ..$ docs : chr [1:57] "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson" ...
.. ..$ features: chr [1:9214] "14th" "15th" "18th" "19th" ...
..@ i : int [1:43719] 0 27 30 52 52 45 46 8 52 51 ...
..@ p : int [1:9215] 0 1 3 4 5 7 9 11 12 13 ...
..@ x : num [1:43719] 1 1 1 1 2 1 2 1 3 1 ...
..@ factors : list()
I want to be able to look at the matrix elements and pass it to modeling functions that take Matrix inputs.
Case, punctuation, and digits should be options with defaults
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.