Comments (3)
Makes sense to me!
from quanteda.
We tried to make verbose messages more consistent using message_select()
. We could do something similar across methods for corpus, tokens and dfm objects.
Lines 56 to 81 in 84ecce8
However, it is not easy to provide detailed information on the operations that will be performed in C++. For example, the message below says removing more features than actually exist. Since 20215 is only the possible sequence of tokens that the pattern would match (and be removed), we only know actual number of tokens removed (4584) only after the operation.
require(quanteda)
toks <- tokens(data_corpus_inaugural, remove_numbers = TRUE)
length(types(toks))
#> [1] 10090
toks2 <- tokens_remove(toks, phrase("a *"), verbose = TRUE)
#> removed 20,180 features
#>
length(types(toks2))
#> [1] 9942
sum(ntoken(toks)) - sum(ntoken(toks2))
#> [1] 4584
Further, the repeated use of types()
is not a good idea because it triggers recompilation of tokens_xptr
, reducing the new objects' performance gain.
The best approach would be to simplify the message including only the number of documents (and/or tokens), the type of operation (remove/keep, lookup, ngrams etc) and, maybe, a main parameter (e.g. pattern
, dictionary
, n
). This make it easy to create a fit-for-all messaging function easier too.
from quanteda.
How about making message_tokens()
and message_dfm()
that can be used in all the methods?
require(quanteda)
#> Loading required package: quanteda
#> Package version: 4.0.0
#> Unicode version: 15.1
#> ICU version: 74.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
toks <- tokens(data_corpus_inaugural, remove_numbers = TRUE)
stats_tokens <- function(x) {
list(ndoc = ndoc(x),
ntoken = sum(ntoken(x, remove_padding = TRUE)))
}
message_tokens <- function(operation, pre, post) {
msg <- sprintf("%s: from %d tokens (%d documents) to %d tokens (%d documents)",
operation, pre$ntoken, pre$ndoc, post$ntoken, post$ndoc)
msg <- prettyNum(msg, big.mark = ",")
cat(msg)
}
stats_dfm <- function(x) {
list(ndoc = ndoc(x),
nfeat = nfeat(dfm_remove(x, "")))
}
message_tokens <- function(operation, pre, post) {
msg <- sprintf("%s: from %d tokens (%d documents) to %d tokens (%d documents)",
operation, pre$ntoken, pre$ndoc, post$ntoken, post$ndoc)
msg <- prettyNum(msg, big.mark = ",")
cat(msg)
}
message_dfm <- function(operation, pre, post) {
msg <- sprintf("%s: from %d features (%d documents) to %d features (%d documents)",
operation, pre$nfeat, pre$ndoc, post$nfeat, post$ndoc)
msg <- prettyNum(msg, big.mark = ",")
cat(msg)
}
before <- stats_tokens(toks)
toks <- tokens_remove(toks, stopwords())
after <- stats_tokens(toks)
message_tokens("tokens_remove()", before, after)
#> tokens_remove(): from 151,442 tokens (59 documents) to 79,535 tokens (59 documents)
before <- stats_tokens(toks)
toks <- tokens_subset(toks, Year > 2000)
after <- stats_tokens(toks)
message_tokens("tokens_subset()", before, after)
#> tokens_subset(): from 79,535 tokens (59 documents) to 7,459 tokens (6 documents)
dfmt <- dfm(toks)
before <- stats_dfm(dfmt)
dfmt <- dfm_trim(dfmt, min_termfreq = 10)
after <- stats_dfm(dfmt)
message_dfm("dfm_trim()", before, after)
#> dfm_trim(): from 2,185 features (6 documents) to 104 features (6 documents)
Created on 2024-01-12 with reprex v2.0.2
from quanteda.
Related Issues (20)
- Word count: is performance an issue (compared to counting sentences/characters) ? HOT 14
- Error when combining more than 3 tokens objects
- Replace %>% with |>
- Documentation issues that need solving HOT 4
- Elapsted time seems wrong
- Inconsistency in what in oject meta
- Can't install (or load) Quanteda correctly HOT 7
- Add pass argument to tokens functions to return documents intact HOT 6
- Upgrading tokens_replace() to keep tokens and keys togather HOT 4
- ndiMatrix / replValueSp - definition not updated HOT 1
- Deprecate char_ngrams() HOT 3
- Warning: sparse->dense coercion: allocating vector of size 5.5 GiBWarning: Feature names cannot have underscores ('_'), replacing with dashes ('-') HOT 1
- Warning: sparse->dense coercion: allocating vector of size 5.5 GiBWarning: Feature names cannot have underscores ('_'), replacing with dashes ('-') HOT 1
- Dictionaries in Portuguese HOT 1
- Make cpp_kwic() HOT 1
- Quanteda installation: invalid permissions HOT 13
- Warning message in convert(., to = "stm") wrong HOT 1
- tbb::parallel_for crash
- Changes to generics and UseMethod in R development version breaking tokens generic
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from quanteda.