We sometimes keep track of tokens matched to dictionary patters but it is not easy (se

Upgrading tokens_replace() to keep tokens and keys togather about quanteda HOT 4 CLOSED

koheiw commented on September 24, 2024

Upgrading tokens_replace() to keep tokens and keys togather

from quanteda.

Comments (4)

koheiw commented on September 24, 2024

Using tokens_compound(), we can achieve this without changing the C++ code! I did not like how tokens_replace() work with dictionary but this is useful behavior. We could also substitute tokens_lookup(exclusive = FALSE) with this.

require(quanteda)
txt <- c(d1 = "The United States is bordered by the Atlantic Ocean and the Pacific Ocean.",
         d2 = "The Supreme Court of the United States is seldom in a united state.")
toks <- tokens(txt, remove_punct = TRUE)

dict2 <- dictionary(list(Countries = c("* States", "Federal Republic of *"),
                         Oceans = c("* Ocean")), tolower = FALSE)

tokens_replace2 <- function(x, pattern, replacement = NULL,
                            concatenator = "_", add_key = FALSE) {
    
    x <- tokens_compound(x, pattern, concatenator = concatenator, join = FALSE)
    fixed <- unlist(object2fixed(pattern, types(x), concatenator = concatenator))
    if (add_key) {
        rep <- paste0(fixed, "/", names(fixed))
    } else {
        rep <- names(fixed)
    }
    tokens_replace(x, fixed, rep)
}

tokens_replace2(toks, dict2, concatenator = " ")
#> Tokens consisting of 2 documents.
#> d1 :
#>  [1] "The"       "Countries" "is"        "bordered"  "by"        "the"      
#>  [7] "Oceans"    "and"       "the"       "Oceans"   
#> 
#> d2 :
#>  [1] "The"       "Supreme"   "Court"     "of"        "the"       "Countries"
#>  [7] "is"        "seldom"    "in"        "a"         "united"    "state"
tokens_replace2(toks, dict2, concatenator = " ", add_key = TRUE)
#> Tokens consisting of 2 documents.
#> d1 :
#>  [1] "The"                     "United States/Countries"
#>  [3] "is"                      "bordered"               
#>  [5] "by"                      "the"                    
#>  [7] "Atlantic Ocean/Oceans"   "and"                    
#>  [9] "the"                     "Pacific Ocean/Oceans"   
#> 
#> d2 :
#>  [1] "The"                     "Supreme"                
#>  [3] "Court"                   "of"                     
#>  [5] "the"                     "United States/Countries"
#>  [7] "is"                      "seldom"                 
#>  [9] "in"                      "a"                      
#> [11] "united"                  "state"

^{Created on 2023-12-08 with reprex v2.0.2}

from quanteda.

kbenoit commented on September 24, 2024

What if we just added an argument to tokens_lookup() that kept the token matched and appended the dictionary key?
This could allow lookups to function as a way of annotating tokens more generally, and keep it all within one function.

For instance:

tokens_lookup(toks, dict, keep_tokens = TRUE, concatenator = "/")
#> Tokens consisting of 2 documents.
#> d1 :
#>  [1] "The"                     "United States/Countries"
#>  [3] "is"                      "bordered"               
#>  [5] "by"                      "the"                    
#>  [7] "Atlantic Ocean/Oceans"   "and"                    
#>  [9] "the"                     "Pacific Ocean/Oceans"   
#> 
#> d2 :
#>  [1] "The"                     "Supreme"                
#>  [3] "Court"                   "of"                     
#>  [5] "the"                     "United States/Countries"
#>  [7] "is"                      "seldom"                 
#>  [9] "in"                      "a"                      
#> [11] "united"                  "state"

from quanteda.

koheiw commented on September 24, 2024

I wanted to simplify tokens_lookup() but it seems the easiest to do it there. I tentatively added append and separator. If we want add only one argument, separator = NULL can means do not append.

require(quanteda)
#> Loading required package: quanteda
#> Package version: 4.0.0
#> Unicode version: 15.1
#> ICU version: 74.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- c(d1 = "The United States is bordered by the Atlantic Ocean and the Pacific Ocean.",
         d2 = "The Supreme Court of the United States is seldom in a united state.")
toks <- tokens(txt, remove_punct = TRUE)

dict <- dictionary(list(Countries = c("* States", "Federal Republic of *"),
                         Oceans = c("* Ocean")), tolower = FALSE)

tokens_lookup(toks, dict, exclusive = FALSE, append = TRUE, separator = "/")
#> Tokens consisting of 2 documents.
#> d1 :
#>  [1] "The"                     "United_States/Countries"
#>  [3] "is"                      "bordered"               
#>  [5] "by"                      "the"                    
#>  [7] "Atlantic_Ocean/Oceans"   "and"                    
#>  [9] "the"                     "Pacific_Ocean/Oceans"   
#> 
#> d2 :
#>  [1] "The"                     "Supreme"                
#>  [3] "Court"                   "of"                     
#>  [5] "the"                     "United_States/Countries"
#>  [7] "is"                      "seldom"                 
#>  [9] "in"                      "a"                      
#> [11] "united"                  "state"
tokens_lookup(toks, dict, exclusive = FALSE, append = TRUE, separator = "+")
#> Tokens consisting of 2 documents.
#> d1 :
#>  [1] "The"                     "United_States+Countries"
#>  [3] "is"                      "bordered"               
#>  [5] "by"                      "the"                    
#>  [7] "Atlantic_Ocean+Oceans"   "and"                    
#>  [9] "the"                     "Pacific_Ocean+Oceans"   
#> 
#> d2 :
#>  [1] "The"                     "Supreme"                
#>  [3] "Court"                   "of"                     
#>  [5] "the"                     "United_States+Countries"
#>  [7] "is"                      "seldom"                 
#>  [9] "in"                      "a"                      
#> [11] "united"                  "state"

The concatenator for phrases are taken from the meta field of the tokens object. This means users should specify concatenator in the upstream. For this, I prefer adding concatenator = "_" to tokens() so that tokens_compound(), tokens_ngrams(), tokens_lookup()` use the same concatenator.

quanteda/R/tokens_lookup.R

Lines 164 to 167 in c597cb0

 if (append) { 

 fixed <- sapply(ids, function(x, y) paste(type[x], collapse = y), 

 field_object(attrs, "concatenator")) 

 key <- paste0(fixed, separator, names(fixed))

from quanteda.

kbenoit commented on September 24, 2024

I can see good reasons not to force the same concatenator. What if we wanted to concatenate tokens that had a POS tag denoted by a "/" separator? This would be something like "capital/ADJ_gains/NOUN_tax/NOUN".

We use concatenator in tokens_compound() and in as.tokens.spacyr_parsed(). Since we are concatenating the dictionary key, I think we should use that argument name instead of separator.

Also it's conceivable that we would have functions later that appended other info to a token, what if we call this append_key rather than append?

from quanteda.

Upgrading tokens_replace() to keep tokens and keys togather about quanteda HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	if (append) {
	fixed <- sapply(ids, function(x, y) paste(type[x], collapse = y),
	field_object(attrs, "concatenator"))
	key <- paste0(fixed, separator, names(fixed))