Comments (4)
Makes sense to me!
from quanteda.
We tried to make verbose messages more consistent using message_select()
. We could do something similar across methods for corpus, tokens and dfm objects.
Lines 56 to 81 in 84ecce8
However, it is not easy to provide detailed information on the operations that will be performed in C++. For example, the message below says removing more features than actually exist. Since 20215 is only the possible sequence of tokens that the pattern would match (and be removed), we only know actual number of tokens removed (4584) only after the operation.
require(quanteda)
toks <- tokens(data_corpus_inaugural, remove_numbers = TRUE)
length(types(toks))
#> [1] 10090
toks2 <- tokens_remove(toks, phrase("a *"), verbose = TRUE)
#> removed 20,180 features
#>
length(types(toks2))
#> [1] 9942
sum(ntoken(toks)) - sum(ntoken(toks2))
#> [1] 4584
Further, the repeated use of types()
is not a good idea because it triggers recompilation of tokens_xptr
, reducing the new objects' performance gain.
The best approach would be to simplify the message including only the number of documents (and/or tokens), the type of operation (remove/keep, lookup, ngrams etc) and, maybe, a main parameter (e.g. pattern
, dictionary
, n
). This make it easy to create a fit-for-all messaging function easier too.
from quanteda.
How about making message_tokens()
and message_dfm()
that can be used in all the methods?
require(quanteda)
#> Loading required package: quanteda
#> Package version: 4.0.0
#> Unicode version: 15.1
#> ICU version: 74.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
toks <- tokens(data_corpus_inaugural, remove_numbers = TRUE)
stats_tokens <- function(x) {
list(ndoc = ndoc(x),
ntoken = sum(ntoken(x, remove_padding = TRUE)))
}
message_tokens <- function(operation, pre, post) {
msg <- sprintf("%s: from %d tokens (%d documents) to %d tokens (%d documents)",
operation, pre$ntoken, pre$ndoc, post$ntoken, post$ndoc)
msg <- prettyNum(msg, big.mark = ",")
cat(msg)
}
stats_dfm <- function(x) {
list(ndoc = ndoc(x),
nfeat = nfeat(dfm_remove(x, "")))
}
message_tokens <- function(operation, pre, post) {
msg <- sprintf("%s: from %d tokens (%d documents) to %d tokens (%d documents)",
operation, pre$ntoken, pre$ndoc, post$ntoken, post$ndoc)
msg <- prettyNum(msg, big.mark = ",")
cat(msg)
}
message_dfm <- function(operation, pre, post) {
msg <- sprintf("%s: from %d features (%d documents) to %d features (%d documents)",
operation, pre$nfeat, pre$ndoc, post$nfeat, post$ndoc)
msg <- prettyNum(msg, big.mark = ",")
cat(msg)
}
before <- stats_tokens(toks)
toks <- tokens_remove(toks, stopwords())
after <- stats_tokens(toks)
message_tokens("tokens_remove()", before, after)
#> tokens_remove(): from 151,442 tokens (59 documents) to 79,535 tokens (59 documents)
before <- stats_tokens(toks)
toks <- tokens_subset(toks, Year > 2000)
after <- stats_tokens(toks)
message_tokens("tokens_subset()", before, after)
#> tokens_subset(): from 79,535 tokens (59 documents) to 7,459 tokens (6 documents)
dfmt <- dfm(toks)
before <- stats_dfm(dfmt)
dfmt <- dfm_trim(dfmt, min_termfreq = 10)
after <- stats_dfm(dfmt)
message_dfm("dfm_trim()", before, after)
#> dfm_trim(): from 2,185 features (6 documents) to 104 features (6 documents)
Created on 2024-01-12 with reprex v2.0.2
from quanteda.
These are the functions to upgrade. @kbenoit you are more than welcome to change on dfm methods. I will do tokens methods.
tokens
- tokens_chunk
- tokens_compound
- tokens_group
- tokens_lookup
- tokens_ngrams
- tokens_replace
- tokens_restore
- tokens_sample
- tokens_segment
- tokens_select
- tokens_skipgrams
- tokens_split
- tokens_subset
- tokens_tolower
- tokens_toupper
- tokens_wordstem
dfm
- dfm_compress
- dfm_group
- dfm_lookup
- dfm_match
- dfm_replace
- dfm_sample
- dfm_select
- dfm_smooth
- dfm_sort
- dfm_subset
- dfm_tfidf
- dfm_tolower
- dfm_toupper
- dfm_trim
- dfm_weight
- dfm_wordstem
from quanteda.
Related Issues (20)
- Make tokens_substitute() to replace characters in tokens?
- Add more explicit information on enabling parallelization in quanteda >v4.0.0 HOT 1
- Experiencing problem with textmodel_mlp
- Add apply_if to tokens_ngrams()
- Error in parallel computing HOT 1
- Add invert to sampling functions
- parallel computing is disabled in CRAN version HOT 2
- Keep original unigrams in tokens_compound() HOT 1
- Add only_unigram argument
- Only geneate existing sequence
- Function dfm_stem() does not exist but is required to replace dfm(stem) HOT 1
- Add tokens_trim()
- CRAN problems: documentation links
- CRAN problems: UBSAN HOT 31
- Return value of `cpp_dfm` can be invalid, non-deterministic HOT 2
- dfm_weight with the weights= option does not produce a dfm HOT 1
- Trouble creating fcm from very large tokens object HOT 6
- Error in left_join running topic model HOT 1
- Always remove paddings in `dfm()` HOT 1
- Quanteda: Can create tokens on one subset of corpus, but not the other: Error: The type of x must be character HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from quanteda.