This goes beyond the issue of shallow v deep copies. It modifies the object in place,

Python's list is also passed by reference: <a href="https://stackoverflow.com/question

Exactly! If they don't want to modify the original object, use <code class="notranslat

tokens.tokens_xptr() with remove_ options permanently removes stuff about quanteda HOT 7 CLOSED

kbenoit commented on July 2, 2024

tokens.tokens_xptr() with remove_ options permanently removes stuff

from quanteda.

Comments (7)

koheiw commented on July 2, 2024

This is exactly how pointers (or pass by reference) work. tokenx_xptr functions modify the underlying data in place and return the pointer to the data like "0x116016470" (shallow copy). In this way, they use CPU and RAM much less.

If you want to keep the original object intact, you should deep copy tokens(as.tokens_xptr(xtoks2), remove_punct = TRUE) or don't use tokenx_xptr at all.

from quanteda.

koheiw commented on July 2, 2024

Python's list is also passed by reference: https://stackoverflow.com/questions/8744113/python-list-by-value-not-by-reference

from quanteda.

kbenoit commented on July 2, 2024

Yes and this is one of the fundamental differences between R and other languages such as Python, where lists are passed by reference, and names are bound to external pointers. So if we bind a second name to a first name via =, and modify the second object, the first is also modified. I think we explain this pretty well in the vignette.

What's less clear is the modify-in-place functionality, which is very un-R-like, and could confuse users because of how differently the same tokens_*() functions operate in the absence of any assignment. In my example above,

tokens(xtoks2, remove_punct = TRUE)

modified xtoks2 and any other shallow copies whose object names point to the same memory address.

Had I used

xtoks3 <- tokens(xtoks2, remove_punct = TRUE)

it would have a) modified xtoks2 and b) made a shallow copy of the modified object called xtoks3.

This is not something we explained in the vignette, and is a bigger departure from how the package works (and R works in almost every other package) than the binding issue of names to the same memory address. If we had a Python version of quanteda, it probably would look like mytokens.tokens(remove_punct = TRUE to modify mytokens in place. If we had an R6 version of quanteda, it would work that way too. But to make that shift within the current syntax is going to confuse users.

There might be alternatives that achieve the efficiency sought here but also make the semantics more clear. While I am not yet sure what to suggest in our case, let's look for precedent at data.table. In that package, objects are shallow copied by <- but standard R operations then override this by using the standard R modify-and-copy approach, in the functional paradigm of R. To modify objects in place, data.table requires the use of the := operator, to make this explicit, but reverts to modify-and-copy for standard R operators. In the example below, note how the second operation to assign a new column using $ uses modify-and-copy.

library(data.table)

DT <- data.table(iris[1:5, ])
DT2 <- DT
address(DT)
#> [1] "0x12d9c4200"
address(DT2)
#> [1] "0x12d9c4200"

DT[, x := 0]
DT2
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species x
#> 1:          5.1         3.5          1.4         0.2  setosa 0
#> 2:          4.9         3.0          1.4         0.2  setosa 0
#> 3:          4.7         3.2          1.3         0.2  setosa 0
#> 4:          4.6         3.1          1.5         0.2  setosa 0
#> 5:          5.0         3.6          1.4         0.2  setosa 0
address(DT)
#> [1] "0x12d9c4200"
address(DT2)
#> [1] "0x12d9c4200"


DT$y <- 1
DT2
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species x
#> 1:          5.1         3.5          1.4         0.2  setosa 0
#> 2:          4.9         3.0          1.4         0.2  setosa 0
#> 3:          4.7         3.2          1.3         0.2  setosa 0
#> 4:          4.6         3.1          1.5         0.2  setosa 0
#> 5:          5.0         3.6          1.4         0.2  setosa 0
address(DT)
#> [1] "0x12da59400"
address(DT2)
#> [1] "0x12d9c4200"

^{Created on 2023-05-10 with reprex v2.0.2}

from quanteda.

koheiw commented on July 2, 2024

Quanteda users can choose whichever they prefer: := is like the tokens_xptr and <- is like the classic tokens. tokens_xptr works best with |>.

from quanteda.

koheiw commented on July 2, 2024

Again, I don't think many people use tokenx_xptr because their data is too small to enjoy the its efficiency.

from quanteda.

kbenoit commented on July 2, 2024

tokens_xptr works best with |>

But same effect:

library("quanteda")
#> Package version: 4.0.0
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.

xtoks <- tokens(c("A a a, b.", "A b c c c?"), xptr = TRUE)
xtoks |>
    tokens_remove("b")
#> Tokens consisting of 2 documents (pointer to 0x11dab8748).
#> text1 :
#> [1] "A" "a" "a" "," "."
#> 
#> text2 :
#> [1] "A" "c" "c" "c" "?"

# "b" gone from original
xtoks
#> Tokens consisting of 2 documents (pointer to 0x11dab8748).
#> text1 :
#> [1] "A" "a" "a" "," "."
#> 
#> text2 :
#> [1] "A" "c" "c" "c" "?"

^{Created on 2023-05-11 with reprex v2.0.2}

I suppose there is no way not to modify the original without making a deep copy of it. So we just need to alert users that the behaviour is very different for references xptr objects, and that they should be fully aware that these are not simply a more efficient version of tokens. Which is why they are not the default.

from quanteda.

koheiw commented on July 2, 2024

Exactly! If they don't want to modify the original object, use as.tokens_xptr() in the beginning of a pipeline like this:

http://quanteda.io/articles/pkgdown/tokens_xptr.html#creating-a-document-feature-matrix-from-a-tokens_xptr-object

from quanteda.

tokens.tokens_xptr() with remove_ options permanently removes stuff about quanteda HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent