Git Product home page Git Product logo

Comments (3)

kbenoit avatar kbenoit commented on July 1, 2024 1

Makes sense to me!

from quanteda.

koheiw avatar koheiw commented on July 1, 2024

We tried to make verbose messages more consistent using message_select(). We could do something similar across methods for corpus, tokens and dfm objects.

quanteda/R/message.R

Lines 56 to 81 in 84ecce8

message_select <- function(selection, nfeats, ndocs, nfeatspad = 0, ndocspad = 0) {
catm(if (selection == "keep") "kept" else "removed", " ",
format(nfeats, big.mark = ",", scientific = FALSE),
" feature", if (nfeats != 1L) "s" else "", sep = "")
if (ndocs > 0) {
catm(" and ",
format(ndocs, big.mark = ",", scientific = FALSE),
" document", if (ndocs != 1L) "s" else "",
sep = "")
}
if ((nfeatspad + ndocspad) > 0) {
catm(", padded ", sep = "")
}
if (nfeatspad > 0) {
catm(format(nfeatspad, big.mark = ",", scientific = FALSE),
" feature", if (nfeatspad != 1L) "s" else "",
sep = "")
}
if (ndocspad > 0) {
if (nfeatspad > 0) catm(" and ", sep = "")
catm(format(ndocspad, big.mark = ",", scientific = FALSE),
" document", if (ndocspad != 1L) "s" else "",
sep = "")
}
catm("", appendLF = TRUE)
}

However, it is not easy to provide detailed information on the operations that will be performed in C++. For example, the message below says removing more features than actually exist. Since 20215 is only the possible sequence of tokens that the pattern would match (and be removed), we only know actual number of tokens removed (4584) only after the operation.

require(quanteda)
toks <- tokens(data_corpus_inaugural, remove_numbers = TRUE)
length(types(toks))
#> [1] 10090

toks2 <- tokens_remove(toks, phrase("a *"), verbose = TRUE)
#> removed 20,180 features
#> 
length(types(toks2))
#> [1] 9942

sum(ntoken(toks)) - sum(ntoken(toks2))
#> [1] 4584

Further, the repeated use of types() is not a good idea because it triggers recompilation of tokens_xptr, reducing the new objects' performance gain.

The best approach would be to simplify the message including only the number of documents (and/or tokens), the type of operation (remove/keep, lookup, ngrams etc) and, maybe, a main parameter (e.g. pattern, dictionary, n). This make it easy to create a fit-for-all messaging function easier too.

from quanteda.

koheiw avatar koheiw commented on July 1, 2024

How about making message_tokens() and message_dfm() that can be used in all the methods?

require(quanteda)
#> Loading required package: quanteda
#> Package version: 4.0.0
#> Unicode version: 15.1
#> ICU version: 74.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
toks <- tokens(data_corpus_inaugural, remove_numbers = TRUE)

stats_tokens <- function(x) {
    list(ndoc = ndoc(x),
         ntoken = sum(ntoken(x, remove_padding = TRUE)))
}

message_tokens <- function(operation, pre, post) {
    msg <- sprintf("%s: from %d tokens (%d documents) to %d tokens (%d documents)",
                   operation, pre$ntoken, pre$ndoc, post$ntoken, post$ndoc)
    msg <- prettyNum(msg, big.mark = ",")
    cat(msg)
}

stats_dfm <- function(x) {
    list(ndoc = ndoc(x),
         nfeat = nfeat(dfm_remove(x, "")))
}

message_tokens <- function(operation, pre, post) {
    msg <- sprintf("%s: from %d tokens (%d documents) to %d tokens (%d documents)",
                   operation, pre$ntoken, pre$ndoc, post$ntoken, post$ndoc)
    msg <- prettyNum(msg, big.mark = ",")
    cat(msg)
}

message_dfm <- function(operation, pre, post) {
    msg <- sprintf("%s: from %d features (%d documents) to %d features (%d documents)",
                   operation, pre$nfeat, pre$ndoc, post$nfeat, post$ndoc)
    msg <- prettyNum(msg, big.mark = ",")
    cat(msg)
}

before <- stats_tokens(toks)
toks <- tokens_remove(toks, stopwords())
after <- stats_tokens(toks)
message_tokens("tokens_remove()", before, after)
#> tokens_remove(): from 151,442 tokens (59 documents) to 79,535 tokens (59 documents)

before <- stats_tokens(toks)
toks <- tokens_subset(toks, Year > 2000)
after <- stats_tokens(toks)
message_tokens("tokens_subset()", before, after)
#> tokens_subset(): from 79,535 tokens (59 documents) to 7,459 tokens (6 documents)

dfmt <- dfm(toks)

before <- stats_dfm(dfmt)
dfmt <- dfm_trim(dfmt, min_termfreq = 10)
after <- stats_dfm(dfmt)
message_dfm("dfm_trim()", before, after)
#> dfm_trim(): from 2,185 features (6 documents) to 104 features (6 documents)

Created on 2024-01-12 with reprex v2.0.2

from quanteda.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.