Git Product home page Git Product logo

Comments (4)

koheiw avatar koheiw commented on July 19, 2024

Using tokens_compound(), we can achieve this without changing the C++ code! I did not like how tokens_replace() work with dictionary but this is useful behavior. We could also substitute tokens_lookup(exclusive = FALSE) with this.

require(quanteda)
txt <- c(d1 = "The United States is bordered by the Atlantic Ocean and the Pacific Ocean.",
         d2 = "The Supreme Court of the United States is seldom in a united state.")
toks <- tokens(txt, remove_punct = TRUE)

dict2 <- dictionary(list(Countries = c("* States", "Federal Republic of *"),
                         Oceans = c("* Ocean")), tolower = FALSE)

tokens_replace2 <- function(x, pattern, replacement = NULL,
                            concatenator = "_", add_key = FALSE) {
    
    x <- tokens_compound(x, pattern, concatenator = concatenator, join = FALSE)
    fixed <- unlist(object2fixed(pattern, types(x), concatenator = concatenator))
    if (add_key) {
        rep <- paste0(fixed, "/", names(fixed))
    } else {
        rep <- names(fixed)
    }
    tokens_replace(x, fixed, rep)
}

tokens_replace2(toks, dict2, concatenator = " ")
#> Tokens consisting of 2 documents.
#> d1 :
#>  [1] "The"       "Countries" "is"        "bordered"  "by"        "the"      
#>  [7] "Oceans"    "and"       "the"       "Oceans"   
#> 
#> d2 :
#>  [1] "The"       "Supreme"   "Court"     "of"        "the"       "Countries"
#>  [7] "is"        "seldom"    "in"        "a"         "united"    "state"
tokens_replace2(toks, dict2, concatenator = " ", add_key = TRUE)
#> Tokens consisting of 2 documents.
#> d1 :
#>  [1] "The"                     "United States/Countries"
#>  [3] "is"                      "bordered"               
#>  [5] "by"                      "the"                    
#>  [7] "Atlantic Ocean/Oceans"   "and"                    
#>  [9] "the"                     "Pacific Ocean/Oceans"   
#> 
#> d2 :
#>  [1] "The"                     "Supreme"                
#>  [3] "Court"                   "of"                     
#>  [5] "the"                     "United States/Countries"
#>  [7] "is"                      "seldom"                 
#>  [9] "in"                      "a"                      
#> [11] "united"                  "state"

Created on 2023-12-08 with reprex v2.0.2

from quanteda.

kbenoit avatar kbenoit commented on July 19, 2024

What if we just added an argument to tokens_lookup() that kept the token matched and appended the dictionary key?
This could allow lookups to function as a way of annotating tokens more generally, and keep it all within one function.

For instance:

tokens_lookup(toks, dict, keep_tokens = TRUE, concatenator = "/")
#> Tokens consisting of 2 documents.
#> d1 :
#>  [1] "The"                     "United States/Countries"
#>  [3] "is"                      "bordered"               
#>  [5] "by"                      "the"                    
#>  [7] "Atlantic Ocean/Oceans"   "and"                    
#>  [9] "the"                     "Pacific Ocean/Oceans"   
#> 
#> d2 :
#>  [1] "The"                     "Supreme"                
#>  [3] "Court"                   "of"                     
#>  [5] "the"                     "United States/Countries"
#>  [7] "is"                      "seldom"                 
#>  [9] "in"                      "a"                      
#> [11] "united"                  "state"

from quanteda.

koheiw avatar koheiw commented on July 19, 2024

I wanted to simplify tokens_lookup() but it seems the easiest to do it there. I tentatively added append and separator. If we want add only one argument, separator = NULL can means do not append.

require(quanteda)
#> Loading required package: quanteda
#> Package version: 4.0.0
#> Unicode version: 15.1
#> ICU version: 74.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- c(d1 = "The United States is bordered by the Atlantic Ocean and the Pacific Ocean.",
         d2 = "The Supreme Court of the United States is seldom in a united state.")
toks <- tokens(txt, remove_punct = TRUE)

dict <- dictionary(list(Countries = c("* States", "Federal Republic of *"),
                         Oceans = c("* Ocean")), tolower = FALSE)

tokens_lookup(toks, dict, exclusive = FALSE, append = TRUE, separator = "/")
#> Tokens consisting of 2 documents.
#> d1 :
#>  [1] "The"                     "United_States/Countries"
#>  [3] "is"                      "bordered"               
#>  [5] "by"                      "the"                    
#>  [7] "Atlantic_Ocean/Oceans"   "and"                    
#>  [9] "the"                     "Pacific_Ocean/Oceans"   
#> 
#> d2 :
#>  [1] "The"                     "Supreme"                
#>  [3] "Court"                   "of"                     
#>  [5] "the"                     "United_States/Countries"
#>  [7] "is"                      "seldom"                 
#>  [9] "in"                      "a"                      
#> [11] "united"                  "state"
tokens_lookup(toks, dict, exclusive = FALSE, append = TRUE, separator = "+")
#> Tokens consisting of 2 documents.
#> d1 :
#>  [1] "The"                     "United_States+Countries"
#>  [3] "is"                      "bordered"               
#>  [5] "by"                      "the"                    
#>  [7] "Atlantic_Ocean+Oceans"   "and"                    
#>  [9] "the"                     "Pacific_Ocean+Oceans"   
#> 
#> d2 :
#>  [1] "The"                     "Supreme"                
#>  [3] "Court"                   "of"                     
#>  [5] "the"                     "United_States+Countries"
#>  [7] "is"                      "seldom"                 
#>  [9] "in"                      "a"                      
#> [11] "united"                  "state"

The concatenator for phrases are taken from the meta field of the tokens object. This means users should specify concatenator in the upstream. For this, I prefer adding concatenator = "_" to tokens() so that tokens_compound(), tokens_ngrams(), tokens_lookup()` use the same concatenator.

quanteda/R/tokens_lookup.R

Lines 164 to 167 in c597cb0

if (append) {
fixed <- sapply(ids, function(x, y) paste(type[x], collapse = y),
field_object(attrs, "concatenator"))
key <- paste0(fixed, separator, names(fixed))

from quanteda.

kbenoit avatar kbenoit commented on July 19, 2024

I can see good reasons not to force the same concatenator. What if we wanted to concatenate tokens that had a POS tag denoted by a "/" separator? This would be something like "capital/ADJ_gains/NOUN_tax/NOUN".

We use concatenator in tokens_compound() and in as.tokens.spacyr_parsed(). Since we are concatenating the dictionary key, I think we should use that argument name instead of separator.

Also it's conceivable that we would have functions later that appended other info to a token, what if we call this append_key rather than append?

from quanteda.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.