Comments (4)
Using tokens_compound()
, we can achieve this without changing the C++ code! I did not like how tokens_replace()
work with dictionary but this is useful behavior. We could also substitute tokens_lookup(exclusive = FALSE)
with this.
require(quanteda)
txt <- c(d1 = "The United States is bordered by the Atlantic Ocean and the Pacific Ocean.",
d2 = "The Supreme Court of the United States is seldom in a united state.")
toks <- tokens(txt, remove_punct = TRUE)
dict2 <- dictionary(list(Countries = c("* States", "Federal Republic of *"),
Oceans = c("* Ocean")), tolower = FALSE)
tokens_replace2 <- function(x, pattern, replacement = NULL,
concatenator = "_", add_key = FALSE) {
x <- tokens_compound(x, pattern, concatenator = concatenator, join = FALSE)
fixed <- unlist(object2fixed(pattern, types(x), concatenator = concatenator))
if (add_key) {
rep <- paste0(fixed, "/", names(fixed))
} else {
rep <- names(fixed)
}
tokens_replace(x, fixed, rep)
}
tokens_replace2(toks, dict2, concatenator = " ")
#> Tokens consisting of 2 documents.
#> d1 :
#> [1] "The" "Countries" "is" "bordered" "by" "the"
#> [7] "Oceans" "and" "the" "Oceans"
#>
#> d2 :
#> [1] "The" "Supreme" "Court" "of" "the" "Countries"
#> [7] "is" "seldom" "in" "a" "united" "state"
tokens_replace2(toks, dict2, concatenator = " ", add_key = TRUE)
#> Tokens consisting of 2 documents.
#> d1 :
#> [1] "The" "United States/Countries"
#> [3] "is" "bordered"
#> [5] "by" "the"
#> [7] "Atlantic Ocean/Oceans" "and"
#> [9] "the" "Pacific Ocean/Oceans"
#>
#> d2 :
#> [1] "The" "Supreme"
#> [3] "Court" "of"
#> [5] "the" "United States/Countries"
#> [7] "is" "seldom"
#> [9] "in" "a"
#> [11] "united" "state"
Created on 2023-12-08 with reprex v2.0.2
from quanteda.
What if we just added an argument to tokens_lookup()
that kept the token matched and appended the dictionary key?
This could allow lookups to function as a way of annotating tokens more generally, and keep it all within one function.
For instance:
tokens_lookup(toks, dict, keep_tokens = TRUE, concatenator = "/")
#> Tokens consisting of 2 documents.
#> d1 :
#> [1] "The" "United States/Countries"
#> [3] "is" "bordered"
#> [5] "by" "the"
#> [7] "Atlantic Ocean/Oceans" "and"
#> [9] "the" "Pacific Ocean/Oceans"
#>
#> d2 :
#> [1] "The" "Supreme"
#> [3] "Court" "of"
#> [5] "the" "United States/Countries"
#> [7] "is" "seldom"
#> [9] "in" "a"
#> [11] "united" "state"
from quanteda.
I wanted to simplify tokens_lookup()
but it seems the easiest to do it there. I tentatively added append
and separator
. If we want add only one argument, separator = NULL
can means do not append.
require(quanteda)
#> Loading required package: quanteda
#> Package version: 4.0.0
#> Unicode version: 15.1
#> ICU version: 74.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- c(d1 = "The United States is bordered by the Atlantic Ocean and the Pacific Ocean.",
d2 = "The Supreme Court of the United States is seldom in a united state.")
toks <- tokens(txt, remove_punct = TRUE)
dict <- dictionary(list(Countries = c("* States", "Federal Republic of *"),
Oceans = c("* Ocean")), tolower = FALSE)
tokens_lookup(toks, dict, exclusive = FALSE, append = TRUE, separator = "/")
#> Tokens consisting of 2 documents.
#> d1 :
#> [1] "The" "United_States/Countries"
#> [3] "is" "bordered"
#> [5] "by" "the"
#> [7] "Atlantic_Ocean/Oceans" "and"
#> [9] "the" "Pacific_Ocean/Oceans"
#>
#> d2 :
#> [1] "The" "Supreme"
#> [3] "Court" "of"
#> [5] "the" "United_States/Countries"
#> [7] "is" "seldom"
#> [9] "in" "a"
#> [11] "united" "state"
tokens_lookup(toks, dict, exclusive = FALSE, append = TRUE, separator = "+")
#> Tokens consisting of 2 documents.
#> d1 :
#> [1] "The" "United_States+Countries"
#> [3] "is" "bordered"
#> [5] "by" "the"
#> [7] "Atlantic_Ocean+Oceans" "and"
#> [9] "the" "Pacific_Ocean+Oceans"
#>
#> d2 :
#> [1] "The" "Supreme"
#> [3] "Court" "of"
#> [5] "the" "United_States+Countries"
#> [7] "is" "seldom"
#> [9] "in" "a"
#> [11] "united" "state"
The concatenator for phrases are taken from the meta field of the tokens object. This means users should specify concatenator in the upstream. For this, I prefer adding concatenator = "_"
to tokens() so that
tokens_compound(),
tokens_ngrams(),
tokens_lookup()` use the same concatenator.
Lines 164 to 167 in c597cb0
from quanteda.
I can see good reasons not to force the same concatenator. What if we wanted to concatenate tokens that had a POS tag denoted by a "/" separator? This would be something like "capital/ADJ_gains/NOUN_tax/NOUN".
We use concatenator
in tokens_compound()
and in as.tokens.spacyr_parsed()
. Since we are concatenating the dictionary key, I think we should use that argument name instead of separator
.
Also it's conceivable that we would have functions later that appended other info to a token, what if we call this append_key
rather than append
?
from quanteda.
Related Issues (20)
- rolling stylometry HOT 2
- Quanteda build_tokens error when using rsplit from collapse package HOT 2
- Inconsistent results of corpus_reshape(to="sentences") HOT 1
- Consider passing ... to print
- Make tokens_substitute() to replace characters in tokens?
- Add more explicit information on enabling parallelization in quanteda >v4.0.0 HOT 1
- Experiencing problem with textmodel_mlp
- Add apply_if to tokens_ngrams()
- Error in parallel computing HOT 1
- Add invert to sampling functions
- parallel computing is disabled in CRAN version HOT 2
- Keep original unigrams in tokens_compound() HOT 1
- Add only_unigram argument
- Only geneate existing sequence
- Function dfm_stem() does not exist but is required to replace dfm(stem) HOT 1
- Add tokens_trim()
- CRAN problems: documentation links
- CRAN problems: UBSAN HOT 31
- Return value of `cpp_dfm` can be invalid, non-deterministic HOT 2
- dfm_weight with the weights= option does not produce a dfm HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from quanteda.