Git Product home page Git Product logo

Comments (7)

koheiw avatar koheiw commented on July 2, 2024

This is exactly how pointers (or pass by reference) work. tokenx_xptr functions modify the underlying data in place and return the pointer to the data like "0x116016470" (shallow copy). In this way, they use CPU and RAM much less.

If you want to keep the original object intact, you should deep copy tokens(as.tokens_xptr(xtoks2), remove_punct = TRUE) or don't use tokenx_xptr at all.

from quanteda.

koheiw avatar koheiw commented on July 2, 2024

Python's list is also passed by reference: https://stackoverflow.com/questions/8744113/python-list-by-value-not-by-reference

from quanteda.

kbenoit avatar kbenoit commented on July 2, 2024

Yes and this is one of the fundamental differences between R and other languages such as Python, where lists are passed by reference, and names are bound to external pointers. So if we bind a second name to a first name via =, and modify the second object, the first is also modified. I think we explain this pretty well in the vignette.

What's less clear is the modify-in-place functionality, which is very un-R-like, and could confuse users because of how differently the same tokens_*() functions operate in the absence of any assignment. In my example above,

tokens(xtoks2, remove_punct = TRUE)

modified xtoks2 and any other shallow copies whose object names point to the same memory address.

Had I used

xtoks3 <- tokens(xtoks2, remove_punct = TRUE)

it would have a) modified xtoks2 and b) made a shallow copy of the modified object called xtoks3.

This is not something we explained in the vignette, and is a bigger departure from how the package works (and R works in almost every other package) than the binding issue of names to the same memory address. If we had a Python version of quanteda, it probably would look like mytokens.tokens(remove_punct = TRUE to modify mytokens in place. If we had an R6 version of quanteda, it would work that way too. But to make that shift within the current syntax is going to confuse users.

There might be alternatives that achieve the efficiency sought here but also make the semantics more clear. While I am not yet sure what to suggest in our case, let's look for precedent at data.table. In that package, objects are shallow copied by <- but standard R operations then override this by using the standard R modify-and-copy approach, in the functional paradigm of R. To modify objects in place, data.table requires the use of the := operator, to make this explicit, but reverts to modify-and-copy for standard R operators. In the example below, note how the second operation to assign a new column using $ uses modify-and-copy.

library(data.table)

DT <- data.table(iris[1:5, ])
DT2 <- DT
address(DT)
#> [1] "0x12d9c4200"
address(DT2)
#> [1] "0x12d9c4200"

DT[, x := 0]
DT2
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species x
#> 1:          5.1         3.5          1.4         0.2  setosa 0
#> 2:          4.9         3.0          1.4         0.2  setosa 0
#> 3:          4.7         3.2          1.3         0.2  setosa 0
#> 4:          4.6         3.1          1.5         0.2  setosa 0
#> 5:          5.0         3.6          1.4         0.2  setosa 0
address(DT)
#> [1] "0x12d9c4200"
address(DT2)
#> [1] "0x12d9c4200"


DT$y <- 1
DT2
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species x
#> 1:          5.1         3.5          1.4         0.2  setosa 0
#> 2:          4.9         3.0          1.4         0.2  setosa 0
#> 3:          4.7         3.2          1.3         0.2  setosa 0
#> 4:          4.6         3.1          1.5         0.2  setosa 0
#> 5:          5.0         3.6          1.4         0.2  setosa 0
address(DT)
#> [1] "0x12da59400"
address(DT2)
#> [1] "0x12d9c4200"

Created on 2023-05-10 with reprex v2.0.2

from quanteda.

koheiw avatar koheiw commented on July 2, 2024

Quanteda users can choose whichever they prefer: := is like the tokens_xptr and <- is like the classic tokens. tokens_xptr works best with |>.

from quanteda.

koheiw avatar koheiw commented on July 2, 2024

Again, I don't think many people use tokenx_xptr because their data is too small to enjoy the its efficiency.

from quanteda.

kbenoit avatar kbenoit commented on July 2, 2024

tokens_xptr works best with |>

But same effect:

library("quanteda")
#> Package version: 4.0.0
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.

xtoks <- tokens(c("A a a, b.", "A b c c c?"), xptr = TRUE)
xtoks |>
    tokens_remove("b")
#> Tokens consisting of 2 documents (pointer to 0x11dab8748).
#> text1 :
#> [1] "A" "a" "a" "," "."
#> 
#> text2 :
#> [1] "A" "c" "c" "c" "?"

# "b" gone from original
xtoks
#> Tokens consisting of 2 documents (pointer to 0x11dab8748).
#> text1 :
#> [1] "A" "a" "a" "," "."
#> 
#> text2 :
#> [1] "A" "c" "c" "c" "?"

Created on 2023-05-11 with reprex v2.0.2

I suppose there is no way not to modify the original without making a deep copy of it. So we just need to alert users that the behaviour is very different for references xptr objects, and that they should be fully aware that these are not simply a more efficient version of tokens. Which is why they are not the default.

from quanteda.

koheiw avatar koheiw commented on July 2, 2024

Exactly! If they don't want to modify the original object, use as.tokens_xptr() in the beginning of a pipeline like this:

http://quanteda.io/articles/pkgdown/tokens_xptr.html#creating-a-document-feature-matrix-from-a-tokens_xptr-object

from quanteda.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.