Hi, I'm using quanteda to generate ngrams for word prediction. Try t

Solved the original issue (not removing "it's"), now: <div class="snippet-clipboar

ignoredFeatures=stopwords("english") not working about quanteda HOT 10 CLOSED

quanteda commented on June 4, 2024

ignoredFeatures=stopwords("english") not working

from quanteda.

Comments (10)

bfisseler commented on June 4, 2024 1

I would expect that the stopwords are removed first, and then the ngrams are created. E.g. "this is just a test text i'm using" would result in the following e.g. bigrams (stopwords: is, a): this just, just test, test text, text using.

from quanteda.

kbenoit commented on June 4, 2024

You're right, that's a bug I will fix. What should be the behaviour for grams containing stopwords? Removing the whole ngram containing any stopword seems to be most appropriate. But we would not want to create non-adjacent ngrams as a result of this removal, e.g. "text using".

from quanteda.

pnulty commented on June 4, 2024

This is tricky because while a unigram model can often ignore stopwords, they can have contrastive effects in ngrams, e.g. 'assistant president' vs 'assistant to the president', 'war on terror' vs 'war of terror', 'the bank of england', 'a bank in england'. On the other hand I have run bigram models without removing stopwords and many of the top features end up being uninterpretable stopword combinations like 'that_it' or 'to_their'.

I would suggest removing ngrams that contain only stopwords and possibly removing ngrams that begin and/or end with stopwords.

from quanteda.

tyokota commented on June 4, 2024

I was going crazy trying to wrap my brain around this with a tokenize to dfm route. I realized that if I tokenized after building a corpus, I would lose the other useful features such as the docvars. I would be interested in a solution similar to what pnulty suggested.

from quanteda.

kbenoit commented on June 4, 2024

Not sure what you mean by that, because there is no other option than to tokenise after building a corpus. But you can always associate the document variables with the documents in the document-feature matrix later, by extracting them using docvars(myCorpus), even after you've tokenized myCorpus or built a dfm from it.

from quanteda.

tyokota commented on June 4, 2024

I would create the document so that I could assign the "docvars" parameter... something like:

uk2010immigCorpus <- corpus(ukimmigTexts,
docvars=data.frame(party=names(ukimmigTexts)),
notes="Immigration-related sections of 2010 UK party manifestos",
enc="UTF-8")

If I then ran this corpus into tokenize() before dfm() (mainly so that my stopwords would not be included in bigrams during dfm()), then I would lose the ability to do this down the road:

train <- amicusDfm[!is.na(docvars(amicusCorpus, "trainclass")), ]
test <- amicusDfm[!is.na(docvars(amicusCorpus, "testclass")), ]

In the end, I just did a simple corpus() followed by two sets of dfm() for training and testing data.

from quanteda.

kbenoit commented on June 4, 2024

Yes that's how I would do it. You can always get the subset of docvars for your dfm using:

docvars(subset(amicusCorpus, trainclass == TRUE))

for the training class, assuming you have a logical variable called trainclass, etc. It would be possible to pass through docvars as attributes of the dfm class object, to be extracted through docvars(amicusDfm) but this would a) greatly increase the size of the dfm, and b) not be a determinate operation when groups is not null and documents have been aggregated.

from quanteda.

kbenoit commented on June 4, 2024

Solved the original issue (not removing "it's"), now:

> test <- "this is just a test text i'm using"
> dfm(test, ngrams = 1:3, concatenator = " ", ignoredFeatures = stopwords("english"))

Creating a dfm from a character vector ...
   ... lowercasing
   ... tokenizing
   ... indexing documents: 1 document
   ... indexing features: 21 feature types
   ... removed 16 features, from 174 supplied feature types
   ... created a 1 x 5 sparse dfm
   ... complete. 
Elapsed time: 0.038 seconds.
Document-feature matrix of: 1 document, 5 features.
1 x 5 sparse Matrix of class "dfmSparse"
       features
docs    just test text using test text
  text1    1    1    1     1         1

Added functionality to remove ngrams that contain stopwords (see above example). To remove stop words and then create ngrams from what is left, see Details in ?dfm where an example is provided.

from quanteda.

kbenoit commented on June 4, 2024

I reopened the issue since my "fix" for removing stopwords once ngrams have been formed turns out to be impossibly slow. e.g.

dfm(inaugTexts, ngrams=2, ignoredFeatures=stopwords('english'))

grinds to an ugly halt.

from quanteda.

kbenoit commented on June 4, 2024

Performance issue fixed. Thanks for the help from the geniuses at StackOverflow.

from quanteda.

ignoredFeatures=stopwords("english") not working about quanteda HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent