To compare v4 to v3 performance, and to compare to a few other packages.

<g-emoji class="g-emoji" alias="peanuts" fallback-src="https://github.githubassets.com

Thanks! I made a few changes in <a class="issue-link js-issue-link" data-error-text="F

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Add a performance vignette about quanteda HOT 14 CLOSED

kbenoit commented on July 20, 2024

Add a performance vignette

from quanteda.

Comments (14)

stefan-mueller commented on July 20, 2024 1

I like the performance vignette and made some adjustments in the branch. I also have a few suggestions for @koheiw.

Could you add 1-2 more sentences to the first paragraph describing in more detail what toeksn_xptr does behind the scenes?
Can you recommend when users should (not) convert the object to a tokens_xptr object? In other words, who should rewrite and update their existing code?
It would be good if you could briefly interpret the results from all microbenchmark() calls.

from quanteda.

koheiw commented on July 20, 2024

What corpus do you want to use? v4 is up to 6x faster than v3 only on a large corpus (one million unique types and half a billion tokens). What is the standard corpus in data science?

from quanteda.

kbenoit commented on July 20, 2024

How long is a piece of string?? Maybe we could take some well-used Kaggle dataset as an example.

I would think however we would want to have some 3x3 table of short documents, small n, up to long documents, long n, and show the differences in each of the 9 cells. The LMRD would be toward the middle of both.

from quanteda.

koheiw commented on July 20, 2024

With the peanut corpus like those in quanteda.corpora, there is no measurable difference between v3 and v4. If the corpus is large, we can subset. The corpus should contain many double bite characters. Many short documents is usually tougher than few longer documents if the total number of the word is the same.

from quanteda.

kbenoit commented on July 20, 2024

🥜🥜🥜

from quanteda.

koheiw commented on July 20, 2024

Thank you. Update accordingly. Please see benchmarking.Rmd and tokens.Rmd.

from quanteda.

stefan-mueller commented on July 20, 2024

Thanks! I made a few changes in #2255. I suggest we adjust the remaining vignettes in this branch + PR (see #2256), and merge it to main and rebuild the website when we're happy with all revisions.
I will get started with revising the remaining vignettes this week.

from quanteda.

kbenoit commented on July 20, 2024

@stefan-mueller that's a good idea but to preview it on the website (quanteda.io) means we would have to overwrite the existing v3.3 of the pkgdown site with the 4.0 version. We could make a decision to do that, but then the main documentation site is for the dev rather than the CRAN version.

from quanteda.

stefan-mueller commented on July 20, 2024

I'm fine with either option. I can also update the existing vignettes (see #2256) in a separate branch relying on v3.3, and we add the performance and tokens vignettes after v4.0 is on CRAN. What do you prefer, @koheiw?

from quanteda.

koheiw commented on July 20, 2024

I think is OK to update the pkgdown website, because you are update the code in a way that works with v3.3 too. New pages will be about v4.0 but it is said so clearly. I want people to use v4.0 and find bugs.

Greater problem is that #2251 is not merge to master yet.

from quanteda.

kbenoit commented on July 20, 2024

I'm ok with this too.

from quanteda.

stefan-mueller commented on July 20, 2024

If you agree to name the object tokens_xptr, it would be good if @koheiw could explain in the two vignettes and the documentation why we called it xptr, where the name comes from, and what it stands for.

As mentioned by @kbenoit, xptr is less intuitive than the names of other objects and functions, and some clarification would be helpful.

from quanteda.

koheiw commented on July 20, 2024

I updated tokens.rmd. Thanks for the suggestion.

from quanteda.

AdaemmerP commented on July 20, 2024

Thanks for the great package and exciting new version! Do you also plan to do benchmarks with other R- (e. g. tidytext, spacyr, ...) and Python-packages (e. g. Gensim)?

from quanteda.

Add a performance vignette about quanteda HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent