Git Product home page Git Product logo

Comments (10)

bfisseler avatar bfisseler commented on June 4, 2024 1

I would expect that the stopwords are removed first, and then the ngrams are created. E.g. "this is just a test text i'm using" would result in the following e.g. bigrams (stopwords: is, a): this just, just test, test text, text using.

from quanteda.

kbenoit avatar kbenoit commented on June 4, 2024

You're right, that's a bug I will fix. What should be the behaviour for grams containing stopwords? Removing the whole ngram containing any stopword seems to be most appropriate. But we would not want to create non-adjacent ngrams as a result of this removal, e.g. "text using".

from quanteda.

pnulty avatar pnulty commented on June 4, 2024

This is tricky because while a unigram model can often ignore stopwords, they can have contrastive effects in ngrams, e.g. 'assistant president' vs 'assistant to the president', 'war on terror' vs 'war of terror', 'the bank of england', 'a bank in england'. On the other hand I have run bigram models without removing stopwords and many of the top features end up being uninterpretable stopword combinations like 'that_it' or 'to_their'.

I would suggest removing ngrams that contain only stopwords and possibly removing ngrams that begin and/or end with stopwords.

from quanteda.

tyokota avatar tyokota commented on June 4, 2024

I was going crazy trying to wrap my brain around this with a tokenize to dfm route. I realized that if I tokenized after building a corpus, I would lose the other useful features such as the docvars. I would be interested in a solution similar to what pnulty suggested.

from quanteda.

kbenoit avatar kbenoit commented on June 4, 2024

Not sure what you mean by that, because there is no other option than to tokenise after building a corpus. But you can always associate the document variables with the documents in the document-feature matrix later, by extracting them using docvars(myCorpus), even after you've tokenized myCorpus or built a dfm from it.

from quanteda.

tyokota avatar tyokota commented on June 4, 2024

I would create the document so that I could assign the "docvars" parameter... something like:

uk2010immigCorpus <- corpus(ukimmigTexts,
docvars=data.frame(party=names(ukimmigTexts)),
notes="Immigration-related sections of 2010 UK party manifestos",
enc="UTF-8")

If I then ran this corpus into tokenize() before dfm() (mainly so that my stopwords would not be included in bigrams during dfm()), then I would lose the ability to do this down the road:

train <- amicusDfm[!is.na(docvars(amicusCorpus, "trainclass")), ]
test <- amicusDfm[!is.na(docvars(amicusCorpus, "testclass")), ]

In the end, I just did a simple corpus() followed by two sets of dfm() for training and testing data.

from quanteda.

kbenoit avatar kbenoit commented on June 4, 2024

Yes that's how I would do it. You can always get the subset of docvars for your dfm using:

docvars(subset(amicusCorpus, trainclass == TRUE))

for the training class, assuming you have a logical variable called trainclass, etc. It would be possible to pass through docvars as attributes of the dfm class object, to be extracted through docvars(amicusDfm) but this would a) greatly increase the size of the dfm, and b) not be a determinate operation when groups is not null and documents have been aggregated.

from quanteda.

kbenoit avatar kbenoit commented on June 4, 2024

Solved the original issue (not removing "it's"), now:

> test <- "this is just a test text i'm using"
> dfm(test, ngrams = 1:3, concatenator = " ", ignoredFeatures = stopwords("english"))

Creating a dfm from a character vector ...
   ... lowercasing
   ... tokenizing
   ... indexing documents: 1 document
   ... indexing features: 21 feature types
   ... removed 16 features, from 174 supplied feature types
   ... created a 1 x 5 sparse dfm
   ... complete. 
Elapsed time: 0.038 seconds.
Document-feature matrix of: 1 document, 5 features.
1 x 5 sparse Matrix of class "dfmSparse"
       features
docs    just test text using test text
  text1    1    1    1     1         1

Added functionality to remove ngrams that contain stopwords (see above example). To remove stop words and then create ngrams from what is left, see Details in ?dfm where an example is provided.

from quanteda.

kbenoit avatar kbenoit commented on June 4, 2024

I reopened the issue since my "fix" for removing stopwords once ngrams have been formed turns out to be impossibly slow. e.g.

dfm(inaugTexts, ngrams=2, ignoredFeatures=stopwords('english'))

grinds to an ugly halt.

from quanteda.

kbenoit avatar kbenoit commented on June 4, 2024

Performance issue fixed. Thanks for the help from the geniuses at StackOverflow.

from quanteda.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.