Comments (3)
How about simplifying ntoken.character()
and ntoken.corpus()
using stri_count_boundaries()
to avoid on-the-fly tokenization? It only gives rough estimate (which should be enough for text cleaning) but 10 times faster.
require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.3.0
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
require(stringi)
#> Loading required package: stringi
txt <- data_corpus_inaugural[]
ntoken2 <- function(x) {
structure(
stri_count_boundaries(x, type = "word", skip_word_none = TRUE),
names = names(x)
)
}
ntoken(txt)
#> 1789-Washington 1793-Washington 1797-Adams 1801-Jefferson 1805-Jefferson
#> 1537 147 2577 1923 2380
#> 1809-Madison 1813-Madison 1817-Monroe 1821-Monroe 1825-Adams
#> 1261 1302 3677 4886 3147
#> 1829-Jackson 1833-Jackson 1837-VanBuren 1841-Harrison 1845-Polk
#> 1208 1267 4158 9123 5186
#> 1849-Taylor 1853-Pierce 1857-Buchanan 1861-Lincoln 1865-Lincoln
#> 1178 3636 3083 3999 775
#> 1869-Grant 1873-Grant 1877-Hayes 1881-Garfield 1885-Cleveland
#> 1229 1472 2707 3209 1816
#> 1889-Harrison 1893-Cleveland 1897-McKinley 1901-McKinley 1905-Roosevelt
#> 4721 2125 4353 2437 1079
#> 1909-Taft 1913-Wilson 1917-Wilson 1921-Harding 1925-Coolidge
#> 5821 1882 1652 3719 4440
#> 1929-Hoover 1933-Roosevelt 1937-Roosevelt 1941-Roosevelt 1945-Roosevelt
#> 3860 2057 1989 1519 633
#> 1949-Truman 1953-Eisenhower 1957-Eisenhower 1961-Kennedy 1965-Johnson
#> 2504 2743 1907 1541 1710
#> 1969-Nixon 1973-Nixon 1977-Carter 1981-Reagan 1985-Reagan
#> 2416 1995 1369 2780 2909
#> 1989-Bush 1993-Clinton 1997-Clinton 2001-Bush 2005-Bush
#> 2673 1833 2436 1806 2312
#> 2009-Obama 2013-Obama 2017-Trump 2021-Biden
#> 2689 2317 1660 2766
ntoken2(txt)
#> 1789-Washington 1793-Washington 1797-Adams 1801-Jefferson 1805-Jefferson
#> 1431 135 2321 1730 2166
#> 1809-Madison 1813-Madison 1817-Monroe 1821-Monroe 1825-Adams
#> 1177 1211 3378 4476 2916
#> 1829-Jackson 1833-Jackson 1837-VanBuren 1841-Harrison 1845-Polk
#> 1128 1177 3844 8463 4812
#> 1849-Taylor 1853-Pierce 1857-Buchanan 1861-Lincoln 1865-Lincoln
#> 1091 3341 2834 3639 701
#> 1869-Grant 1873-Grant 1877-Hayes 1881-Garfield 1885-Cleveland
#> 1131 1340 2491 2989 1687
#> 1889-Harrison 1893-Cleveland 1897-McKinley 1901-McKinley 1905-Roosevelt
#> 4397 2019 3976 2218 989
#> 1909-Taft 1913-Wilson 1917-Wilson 1921-Harding 1925-Coolidge
#> 5440 1706 1529 3338 4057
#> 1929-Hoover 1933-Roosevelt 1937-Roosevelt 1941-Roosevelt 1945-Roosevelt
#> 3573 1881 1823 1345 559
#> 1949-Truman 1953-Eisenhower 1957-Eisenhower 1961-Kennedy 1965-Johnson
#> 2281 2458 1660 1367 1490
#> 1969-Nixon 1973-Nixon 1977-Carter 1981-Reagan 1985-Reagan
#> 2124 1805 1226 2436 2571
#> 1989-Bush 1993-Clinton 1997-Clinton 2001-Bush 2005-Bush
#> 2318 1600 2158 1585 2073
#> 2009-Obama 2013-Obama 2017-Trump 2021-Biden
#> 2399 2106 1446 2377
microbenchmark::microbenchmark(
ntoken(txt),
ntoken2(txt)
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> ntoken(txt) 103.4246 107.96365 112.80753 109.53810 111.63860 200.1408 100
#> ntoken2(txt) 10.5555 10.82705 11.55977 11.09695 11.66865 27.2980 100
Created on 2023-12-29 with reprex v2.0.2
from quanteda.
OK, I'm happy to deprecate that function (and make it internal) since it generates tokens on the fly.
The only other char_()
function that generates tokens on the fly is char_trim()
, but I'm in favour of keeping that one because of its usefulness in cleaning up texts that have very short sentences, often the page cruft from pdf conversion. I have used this not only myself but also in teaching examples. It's useful in cleaning up a corpus before a definitive version can be saved for subsequent tokenisation and downstream processing. But it only has to be run once. We could add a note suggesting that its use be limited to cleaning. What do you think?
from quanteda.
That's a GREAT idea that allows us to keep the functionality intact but with a reduction in execution time of over 90%.
And because of the boundaries in stringi it works fine on languages without whitespace delimiters between tokens.
library("quanteda")
#> Package version: 4.0.0
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
corp <- corpus("ジョギングやウォーキングよりも優しく足腰をしっかり強化する「健康ステッパー ナイスデイ 」。
この短い文を作成しました。
コンパクト・軽量なので座布団1枚分のスペースがあればどこでもお使い頂けます。")
corpus_reshape(corp) |>
ntoken()
#> text1.1 text1.2 text1.3
#> 19 8 21
# works
corpus_trim(corp, min_ntoken = 10) |>
corpus_reshape() |>
ntoken()
#> text1.1 text1.2
#> 19 21
Created on 2024-01-08 with reprex v2.0.2
from quanteda.
Related Issues (20)
- Make tokens_substitute() to replace characters in tokens?
- Add more explicit information on enabling parallelization in quanteda >v4.0.0 HOT 1
- Experiencing problem with textmodel_mlp
- Add apply_if to tokens_ngrams()
- Error in parallel computing HOT 1
- Add invert to sampling functions
- parallel computing is disabled in CRAN version HOT 2
- Keep original unigrams in tokens_compound() HOT 1
- Add only_unigram argument
- Only geneate existing sequence
- Function dfm_stem() does not exist but is required to replace dfm(stem) HOT 1
- Add tokens_trim()
- CRAN problems: documentation links
- CRAN problems: UBSAN HOT 31
- Return value of `cpp_dfm` can be invalid, non-deterministic HOT 2
- dfm_weight with the weights= option does not produce a dfm HOT 1
- Trouble creating fcm from very large tokens object HOT 6
- Error in left_join running topic model HOT 1
- Always remove paddings in `dfm()` HOT 1
- Quanteda: Can create tokens on one subset of corpus, but not the other: Error: The type of x must be character HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from quanteda.