Git Product home page Git Product logo

Comments (14)

stefan-mueller avatar stefan-mueller commented on July 20, 2024 1

I like the performance vignette and made some adjustments in the branch. I also have a few suggestions for @koheiw.

  1. Could you add 1-2 more sentences to the first paragraph describing in more detail what toeksn_xptr does behind the scenes?
  2. Can you recommend when users should (not) convert the object to a tokens_xptr object? In other words, who should rewrite and update their existing code?
  3. It would be good if you could briefly interpret the results from all microbenchmark() calls.

from quanteda.

koheiw avatar koheiw commented on July 20, 2024

What corpus do you want to use? v4 is up to 6x faster than v3 only on a large corpus (one million unique types and half a billion tokens). What is the standard corpus in data science?

from quanteda.

kbenoit avatar kbenoit commented on July 20, 2024

How long is a piece of string?? Maybe we could take some well-used Kaggle dataset as an example.

I would think however we would want to have some 3x3 table of short documents, small n, up to long documents, long n, and show the differences in each of the 9 cells. The LMRD would be toward the middle of both.

from quanteda.

koheiw avatar koheiw commented on July 20, 2024

With the peanut corpus like those in quanteda.corpora, there is no measurable difference between v3 and v4. If the corpus is large, we can subset. The corpus should contain many double bite characters. Many short documents is usually tougher than few longer documents if the total number of the word is the same.

from quanteda.

kbenoit avatar kbenoit commented on July 20, 2024

πŸ₯œπŸ₯œπŸ₯œ

from quanteda.

koheiw avatar koheiw commented on July 20, 2024

Thank you. Update accordingly. Please see benchmarking.Rmd and tokens.Rmd.

from quanteda.

stefan-mueller avatar stefan-mueller commented on July 20, 2024

Thanks! I made a few changes in #2255. I suggest we adjust the remaining vignettes in this branch + PR (see #2256), and merge it to main and rebuild the website when we're happy with all revisions.
I will get started with revising the remaining vignettes this week.

from quanteda.

kbenoit avatar kbenoit commented on July 20, 2024

@stefan-mueller that's a good idea but to preview it on the website (quanteda.io) means we would have to overwrite the existing v3.3 of the pkgdown site with the 4.0 version. We could make a decision to do that, but then the main documentation site is for the dev rather than the CRAN version.

from quanteda.

stefan-mueller avatar stefan-mueller commented on July 20, 2024

I'm fine with either option. I can also update the existing vignettes (see #2256) in a separate branch relying on v3.3, and we add the performance and tokens vignettes after v4.0 is on CRAN. What do you prefer, @koheiw?

from quanteda.

koheiw avatar koheiw commented on July 20, 2024

I think is OK to update the pkgdown website, because you are update the code in a way that works with v3.3 too. New pages will be about v4.0 but it is said so clearly. I want people to use v4.0 and find bugs.

Greater problem is that #2251 is not merge to master yet.

from quanteda.

kbenoit avatar kbenoit commented on July 20, 2024

I'm ok with this too.

from quanteda.

stefan-mueller avatar stefan-mueller commented on July 20, 2024

If you agree to name the object tokens_xptr, it would be good if @koheiw could explain in the two vignettes and the documentation why we called it xptr, where the name comes from, and what it stands for.

As mentioned by @kbenoit, xptr is less intuitive than the names of other objects and functions, and some clarification would be helpful.

from quanteda.

koheiw avatar koheiw commented on July 20, 2024

I updated tokens.rmd. Thanks for the suggestion.

from quanteda.

AdaemmerP avatar AdaemmerP commented on July 20, 2024

Thanks for the great package and exciting new version! Do you also plan to do benchmarks with other R- (e. g. tidytext, spacyr, ...) and Python-packages (e. g. Gensim)?

from quanteda.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.