Comments (14)
I like the performance vignette and made some adjustments in the branch. I also have a few suggestions for @koheiw.
- Could you add 1-2 more sentences to the first paragraph describing in more detail what
toeksn_xptr
does behind the scenes? - Can you recommend when users should (not) convert the object to a
tokens_xptr
object? In other words, who should rewrite and update their existing code? - It would be good if you could briefly interpret the results from all
microbenchmark()
calls.
from quanteda.
What corpus do you want to use? v4 is up to 6x faster than v3 only on a large corpus (one million unique types and half a billion tokens). What is the standard corpus in data science?
from quanteda.
How long is a piece of string?? Maybe we could take some well-used Kaggle dataset as an example.
I would think however we would want to have some 3x3 table of short documents, small n, up to long documents, long n, and show the differences in each of the 9 cells. The LMRD would be toward the middle of both.
from quanteda.
With the peanut corpus like those in quanteda.corpora, there is no measurable difference between v3 and v4. If the corpus is large, we can subset. The corpus should contain many double bite characters. Many short documents is usually tougher than few longer documents if the total number of the word is the same.
from quanteda.
from quanteda.
Thank you. Update accordingly. Please see benchmarking.Rmd
and tokens.Rmd
.
from quanteda.
Thanks! I made a few changes in #2255. I suggest we adjust the remaining vignettes in this branch + PR (see #2256), and merge it to main and rebuild the website when we're happy with all revisions.
I will get started with revising the remaining vignettes this week.
from quanteda.
@stefan-mueller that's a good idea but to preview it on the website (quanteda.io) means we would have to overwrite the existing v3.3 of the pkgdown site with the 4.0 version. We could make a decision to do that, but then the main documentation site is for the dev rather than the CRAN version.
from quanteda.
I'm fine with either option. I can also update the existing vignettes (see #2256) in a separate branch relying on v3.3, and we add the performance and tokens vignettes after v4.0 is on CRAN. What do you prefer, @koheiw?
from quanteda.
I think is OK to update the pkgdown website, because you are update the code in a way that works with v3.3 too. New pages will be about v4.0 but it is said so clearly. I want people to use v4.0 and find bugs.
Greater problem is that #2251 is not merge to master yet.
from quanteda.
I'm ok with this too.
from quanteda.
If you agree to name the object tokens_xptr
, it would be good if @koheiw could explain in the two vignettes and the documentation why we called it xptr
, where the name comes from, and what it stands for.
As mentioned by @kbenoit, xptr
is less intuitive than the names of other objects and functions, and some clarification would be helpful.
from quanteda.
I updated tokens.rmd
. Thanks for the suggestion.
from quanteda.
Thanks for the great package and exciting new version! Do you also plan to do benchmarks with other R- (e. g. tidytext, spacyr, ...) and Python-packages (e. g. Gensim)?
from quanteda.
Related Issues (20)
- Deprecate nsentence() HOT 1
- Improve the performance of token type indexing HOT 1
- Update README.Rmd for v4
- Replace sparseMatrix() with cpp_dfm()
- Tidy up vignettes HOT 9
- Title: Error when creating DFM: missing value where TRUE/FALSE needed HOT 2
- Fix tests incompatible with forthcoming version of the Matrix package HOT 3
- Test `bootstrap_dfm()` removals from #2251
- docvars() produces an error after update to R version 4.3.0 HOT 1
- Removing ... tokens options from ntoken(), ntype()
- tokens.tokens_xptr() with remove_ options permanently removes stuff HOT 7
- Consider returning NULL for empty dimnames for dfm and fcm
- tokens_xptr article needs revision HOT 1
- Deprecations and removals for 4.0 HOT 1
- Deprecations and removals for 4.0
- UBSAN issues on CRAN from tbb HOT 2
- Incompatibility Issue with docnames Function in corpus and tokens dfm Objects HOT 4
- Erreur dans if (...length() && any(...names() == "Dimnames")) .Object@Dimnames <- fixupDN(.Object@Dimnames) : valeur manquante lΓ oΓΉ TRUE / FALSE est requis HOT 1
- Performance issues with quanteda.textstats and tokens
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from quanteda.