Git Product home page Git Product logo

Comments (3)

koheiw avatar koheiw commented on July 19, 2024 1

How about simplifying ntoken.character() and ntoken.corpus() using stri_count_boundaries() to avoid on-the-fly tokenization? It only gives rough estimate (which should be enough for text cleaning) but 10 times faster.

require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.3.0
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
require(stringi)
#> Loading required package: stringi

txt <- data_corpus_inaugural[]

ntoken2 <- function(x) {
    structure(
        stri_count_boundaries(x, type = "word", skip_word_none = TRUE),
        names = names(x)
    )
}

ntoken(txt)
#> 1789-Washington 1793-Washington      1797-Adams  1801-Jefferson  1805-Jefferson 
#>            1537             147            2577            1923            2380 
#>    1809-Madison    1813-Madison     1817-Monroe     1821-Monroe      1825-Adams 
#>            1261            1302            3677            4886            3147 
#>    1829-Jackson    1833-Jackson   1837-VanBuren   1841-Harrison       1845-Polk 
#>            1208            1267            4158            9123            5186 
#>     1849-Taylor     1853-Pierce   1857-Buchanan    1861-Lincoln    1865-Lincoln 
#>            1178            3636            3083            3999             775 
#>      1869-Grant      1873-Grant      1877-Hayes   1881-Garfield  1885-Cleveland 
#>            1229            1472            2707            3209            1816 
#>   1889-Harrison  1893-Cleveland   1897-McKinley   1901-McKinley  1905-Roosevelt 
#>            4721            2125            4353            2437            1079 
#>       1909-Taft     1913-Wilson     1917-Wilson    1921-Harding   1925-Coolidge 
#>            5821            1882            1652            3719            4440 
#>     1929-Hoover  1933-Roosevelt  1937-Roosevelt  1941-Roosevelt  1945-Roosevelt 
#>            3860            2057            1989            1519             633 
#>     1949-Truman 1953-Eisenhower 1957-Eisenhower    1961-Kennedy    1965-Johnson 
#>            2504            2743            1907            1541            1710 
#>      1969-Nixon      1973-Nixon     1977-Carter     1981-Reagan     1985-Reagan 
#>            2416            1995            1369            2780            2909 
#>       1989-Bush    1993-Clinton    1997-Clinton       2001-Bush       2005-Bush 
#>            2673            1833            2436            1806            2312 
#>      2009-Obama      2013-Obama      2017-Trump      2021-Biden 
#>            2689            2317            1660            2766
ntoken2(txt)
#> 1789-Washington 1793-Washington      1797-Adams  1801-Jefferson  1805-Jefferson 
#>            1431             135            2321            1730            2166 
#>    1809-Madison    1813-Madison     1817-Monroe     1821-Monroe      1825-Adams 
#>            1177            1211            3378            4476            2916 
#>    1829-Jackson    1833-Jackson   1837-VanBuren   1841-Harrison       1845-Polk 
#>            1128            1177            3844            8463            4812 
#>     1849-Taylor     1853-Pierce   1857-Buchanan    1861-Lincoln    1865-Lincoln 
#>            1091            3341            2834            3639             701 
#>      1869-Grant      1873-Grant      1877-Hayes   1881-Garfield  1885-Cleveland 
#>            1131            1340            2491            2989            1687 
#>   1889-Harrison  1893-Cleveland   1897-McKinley   1901-McKinley  1905-Roosevelt 
#>            4397            2019            3976            2218             989 
#>       1909-Taft     1913-Wilson     1917-Wilson    1921-Harding   1925-Coolidge 
#>            5440            1706            1529            3338            4057 
#>     1929-Hoover  1933-Roosevelt  1937-Roosevelt  1941-Roosevelt  1945-Roosevelt 
#>            3573            1881            1823            1345             559 
#>     1949-Truman 1953-Eisenhower 1957-Eisenhower    1961-Kennedy    1965-Johnson 
#>            2281            2458            1660            1367            1490 
#>      1969-Nixon      1973-Nixon     1977-Carter     1981-Reagan     1985-Reagan 
#>            2124            1805            1226            2436            2571 
#>       1989-Bush    1993-Clinton    1997-Clinton       2001-Bush       2005-Bush 
#>            2318            1600            2158            1585            2073 
#>      2009-Obama      2013-Obama      2017-Trump      2021-Biden 
#>            2399            2106            1446            2377

microbenchmark::microbenchmark(
    ntoken(txt),
    ntoken2(txt)
)
#> Unit: milliseconds
#>          expr      min        lq      mean    median        uq      max neval
#>   ntoken(txt) 103.4246 107.96365 112.80753 109.53810 111.63860 200.1408   100
#>  ntoken2(txt)  10.5555  10.82705  11.55977  11.09695  11.66865  27.2980   100

Created on 2023-12-29 with reprex v2.0.2

from quanteda.

kbenoit avatar kbenoit commented on July 19, 2024

OK, I'm happy to deprecate that function (and make it internal) since it generates tokens on the fly.

The only other char_() function that generates tokens on the fly is char_trim(), but I'm in favour of keeping that one because of its usefulness in cleaning up texts that have very short sentences, often the page cruft from pdf conversion. I have used this not only myself but also in teaching examples. It's useful in cleaning up a corpus before a definitive version can be saved for subsequent tokenisation and downstream processing. But it only has to be run once. We could add a note suggesting that its use be limited to cleaning. What do you think?

from quanteda.

kbenoit avatar kbenoit commented on July 19, 2024

That's a GREAT idea that allows us to keep the functionality intact but with a reduction in execution time of over 90%.

And because of the boundaries in stringi it works fine on languages without whitespace delimiters between tokens.

library("quanteda")
#> Package version: 4.0.0
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.

corp <- corpus("ジョギングやウォーキングよりも優しく足腰をしっかり強化する「健康ステッパー ナイスデイ 」。
               この短い文を作成しました。
               コンパクト・軽量なので座布団1枚分のスペースがあればどこでもお使い頂けます。")
corpus_reshape(corp) |>
    ntoken()
#> text1.1 text1.2 text1.3 
#>      19       8      21

# works 
corpus_trim(corp, min_ntoken = 10) |>
    corpus_reshape() |>
    ntoken()
#> text1.1 text1.2 
#>      19      21

Created on 2024-01-08 with reprex v2.0.2

from quanteda.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.