Comments (10)
I would expect that the stopwords are removed first, and then the ngrams are created. E.g. "this is just a test text i'm using" would result in the following e.g. bigrams (stopwords: is, a): this just, just test, test text, text using.
from quanteda.
You're right, that's a bug I will fix. What should be the behaviour for grams containing stopwords? Removing the whole ngram containing any stopword seems to be most appropriate. But we would not want to create non-adjacent ngrams as a result of this removal, e.g. "text using".
from quanteda.
This is tricky because while a unigram model can often ignore stopwords, they can have contrastive effects in ngrams, e.g. 'assistant president' vs 'assistant to the president', 'war on terror' vs 'war of terror', 'the bank of england', 'a bank in england'. On the other hand I have run bigram models without removing stopwords and many of the top features end up being uninterpretable stopword combinations like 'that_it' or 'to_their'.
I would suggest removing ngrams that contain only stopwords and possibly removing ngrams that begin and/or end with stopwords.
from quanteda.
I was going crazy trying to wrap my brain around this with a tokenize to dfm route. I realized that if I tokenized after building a corpus, I would lose the other useful features such as the docvars. I would be interested in a solution similar to what pnulty suggested.
from quanteda.
Not sure what you mean by that, because there is no other option than to tokenise after building a corpus. But you can always associate the document variables with the documents in the document-feature matrix later, by extracting them using docvars(myCorpus), even after you've tokenized myCorpus or built a dfm from it.
from quanteda.
I would create the document so that I could assign the "docvars" parameter... something like:
uk2010immigCorpus <- corpus(ukimmigTexts,
docvars=data.frame(party=names(ukimmigTexts)),
notes="Immigration-related sections of 2010 UK party manifestos",
enc="UTF-8")
If I then ran this corpus into tokenize() before dfm() (mainly so that my stopwords would not be included in bigrams during dfm()), then I would lose the ability to do this down the road:
train <- amicusDfm[!is.na(docvars(amicusCorpus, "trainclass")), ]
test <- amicusDfm[!is.na(docvars(amicusCorpus, "testclass")), ]
In the end, I just did a simple corpus() followed by two sets of dfm() for training and testing data.
from quanteda.
Yes that's how I would do it. You can always get the subset of docvars for your dfm using:
docvars(subset(amicusCorpus, trainclass == TRUE))
for the training class, assuming you have a logical variable called trainclass
, etc. It would be possible to pass through docvars as attributes of the dfm class object, to be extracted through docvars(amicusDfm)
but this would a) greatly increase the size of the dfm, and b) not be a determinate operation when groups
is not null and documents have been aggregated.
from quanteda.
Solved the original issue (not removing "it's"), now:
> test <- "this is just a test text i'm using"
> dfm(test, ngrams = 1:3, concatenator = " ", ignoredFeatures = stopwords("english"))
Creating a dfm from a character vector ...
... lowercasing
... tokenizing
... indexing documents: 1 document
... indexing features: 21 feature types
... removed 16 features, from 174 supplied feature types
... created a 1 x 5 sparse dfm
... complete.
Elapsed time: 0.038 seconds.
Document-feature matrix of: 1 document, 5 features.
1 x 5 sparse Matrix of class "dfmSparse"
features
docs just test text using test text
text1 1 1 1 1 1
Added functionality to remove ngrams that contain stopwords (see above example). To remove stop words and then create ngrams from what is left, see Details in ?dfm
where an example is provided.
from quanteda.
I reopened the issue since my "fix" for removing stopwords once ngrams have been formed turns out to be impossibly slow. e.g.
dfm(inaugTexts, ngrams=2, ignoredFeatures=stopwords('english'))
grinds to an ugly halt.
from quanteda.
Performance issue fixed. Thanks for the help from the geniuses at StackOverflow.
from quanteda.
Related Issues (20)
- Deprecations and removals for 4.0
- UBSAN issues on CRAN from tbb HOT 2
- Incompatibility Issue with docnames Function in corpus and tokens dfm Objects HOT 4
- Erreur dans if (...length() && any(...names() == "Dimnames")) .Object@Dimnames <- fixupDN(.Object@Dimnames) : valeur manquante lร oรน TRUE / FALSE est requis HOT 1
- Performance issues with quanteda.textstats and tokens
- Extend support for dfm() to accept matrix/dataframe-like objects HOT 1
- Can't plot a comparison word cloud on grouped DFM with TF-IDF weighting HOT 4
- Chapter 3 Look Up Dictionary HOT 1
- error with tokens() in v4.0 HOT 1
- Warning in asMethod(object) : sparse->dense coercion: allocating vector of size 13.0 GiB HOT 4
- Build failure, Debian Linux: both CRAN and github master branch HOT 3
- 'meta' data is lost when using '+' to concatenate corpus objects HOT 6
- Spell checker HOT 1
- Word count: is performance an issue (compared to counting sentences/characters) ? HOT 14
- Error when combining more than 3 tokens objects
- Replace %>% with |>
- Documentation issues that need solving HOT 4
- Elapsted time seems wrong
- Inconsistency in what in oject meta
- Can't install (or load) Quanteda correctly HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from quanteda.