Comments (7)
This is exactly how pointers (or pass by reference) work. tokenx_xptr functions modify the underlying data in place and return the pointer to the data like "0x116016470" (shallow copy). In this way, they use CPU and RAM much less.
If you want to keep the original object intact, you should deep copy tokens(as.tokens_xptr(xtoks2), remove_punct = TRUE)
or don't use tokenx_xptr at all.
from quanteda.
Python's list is also passed by reference: https://stackoverflow.com/questions/8744113/python-list-by-value-not-by-reference
from quanteda.
Yes and this is one of the fundamental differences between R and other languages such as Python, where lists are passed by reference, and names are bound to external pointers. So if we bind a second name to a first name via =
, and modify the second object, the first is also modified. I think we explain this pretty well in the vignette.
What's less clear is the modify-in-place functionality, which is very un-R-like, and could confuse users because of how differently the same tokens_*()
functions operate in the absence of any assignment. In my example above,
tokens(xtoks2, remove_punct = TRUE)
modified xtoks2
and any other shallow copies whose object names point to the same memory address.
Had I used
xtoks3 <- tokens(xtoks2, remove_punct = TRUE)
it would have a) modified xtoks2
and b) made a shallow copy of the modified object called xtoks3
.
This is not something we explained in the vignette, and is a bigger departure from how the package works (and R works in almost every other package) than the binding issue of names to the same memory address. If we had a Python version of quanteda, it probably would look like mytokens.tokens(remove_punct = TRUE
to modify mytokens
in place. If we had an R6 version of quanteda, it would work that way too. But to make that shift within the current syntax is going to confuse users.
There might be alternatives that achieve the efficiency sought here but also make the semantics more clear. While I am not yet sure what to suggest in our case, let's look for precedent at data.table. In that package, objects are shallow copied by <-
but standard R operations then override this by using the standard R modify-and-copy approach, in the functional paradigm of R. To modify objects in place, data.table requires the use of the :=
operator, to make this explicit, but reverts to modify-and-copy for standard R operators. In the example below, note how the second operation to assign a new column using $
uses modify-and-copy.
library(data.table)
DT <- data.table(iris[1:5, ])
DT2 <- DT
address(DT)
#> [1] "0x12d9c4200"
address(DT2)
#> [1] "0x12d9c4200"
DT[, x := 0]
DT2
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species x
#> 1: 5.1 3.5 1.4 0.2 setosa 0
#> 2: 4.9 3.0 1.4 0.2 setosa 0
#> 3: 4.7 3.2 1.3 0.2 setosa 0
#> 4: 4.6 3.1 1.5 0.2 setosa 0
#> 5: 5.0 3.6 1.4 0.2 setosa 0
address(DT)
#> [1] "0x12d9c4200"
address(DT2)
#> [1] "0x12d9c4200"
DT$y <- 1
DT2
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species x
#> 1: 5.1 3.5 1.4 0.2 setosa 0
#> 2: 4.9 3.0 1.4 0.2 setosa 0
#> 3: 4.7 3.2 1.3 0.2 setosa 0
#> 4: 4.6 3.1 1.5 0.2 setosa 0
#> 5: 5.0 3.6 1.4 0.2 setosa 0
address(DT)
#> [1] "0x12da59400"
address(DT2)
#> [1] "0x12d9c4200"
Created on 2023-05-10 with reprex v2.0.2
from quanteda.
Quanteda users can choose whichever they prefer: :=
is like the tokens_xptr and <-
is like the classic tokens. tokens_xptr works best with |>
.
from quanteda.
Again, I don't think many people use tokenx_xptr because their data is too small to enjoy the its efficiency.
from quanteda.
tokens_xptr works best with
|>
But same effect:
library("quanteda")
#> Package version: 4.0.0
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
xtoks <- tokens(c("A a a, b.", "A b c c c?"), xptr = TRUE)
xtoks |>
tokens_remove("b")
#> Tokens consisting of 2 documents (pointer to 0x11dab8748).
#> text1 :
#> [1] "A" "a" "a" "," "."
#>
#> text2 :
#> [1] "A" "c" "c" "c" "?"
# "b" gone from original
xtoks
#> Tokens consisting of 2 documents (pointer to 0x11dab8748).
#> text1 :
#> [1] "A" "a" "a" "," "."
#>
#> text2 :
#> [1] "A" "c" "c" "c" "?"
Created on 2023-05-11 with reprex v2.0.2
I suppose there is no way not to modify the original without making a deep copy of it. So we just need to alert users that the behaviour is very different for references xptr objects, and that they should be fully aware that these are not simply a more efficient version of tokens. Which is why they are not the default.
from quanteda.
Exactly! If they don't want to modify the original object, use as.tokens_xptr()
in the beginning of a pipeline like this:
from quanteda.
Related Issues (20)
- Improve the performance of token type indexing HOT 1
- Update README.Rmd for v4
- Add a performance vignette HOT 14
- Replace sparseMatrix() with cpp_dfm()
- Tidy up vignettes HOT 9
- Title: Error when creating DFM: missing value where TRUE/FALSE needed HOT 2
- Fix tests incompatible with forthcoming version of the Matrix package HOT 3
- Test `bootstrap_dfm()` removals from #2251
- docvars() produces an error after update to R version 4.3.0 HOT 1
- Removing ... tokens options from ntoken(), ntype()
- Consider returning NULL for empty dimnames for dfm and fcm
- tokens_xptr article needs revision HOT 1
- Deprecations and removals for 4.0 HOT 1
- Deprecations and removals for 4.0
- UBSAN issues on CRAN from tbb HOT 2
- Incompatibility Issue with docnames Function in corpus and tokens dfm Objects HOT 4
- Erreur dans if (...length() && any(...names() == "Dimnames")) .Object@Dimnames <- fixupDN(.Object@Dimnames) : valeur manquante lร oรน TRUE / FALSE est requis HOT 1
- Performance issues with quanteda.textstats and tokens
- Extend support for dfm() to accept matrix/dataframe-like objects HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from quanteda.