Git Product home page Git Product logo

Comments (9)

yutannihilation avatar yutannihilation commented on September 27, 2024 2

Thanks for inviting! Honestly, I rarely arrange() text data so I had no idea about issues specific to Japanese locale. But, I posted the question on our community Slack and they told me this article (in Japanese, but I hope the code should be useful at least): https://qiita.com/nozma/items/4aea36022ce18a6aa5ca

The article says the problem is that ICU's locale C doesn't behave the same as base R's locale C (and it seems ICU has no option corresponding to R's locale C?)

x <- c("", "", "", "", "")

# my locale
Sys.getlocale("LC_COLLATE")
#> [1] "ja_JP.UTF-8"

# default
sort(x)
#> [1] "あ" "イ" "う" "エ" "お"

# locale C with base R
withr::with_collate("C", sort(x))
#> [1] "あ" "う" "お" "イ" "エ"

# locale C with stringi
stringi::stri_sort(x, locale = "C")
#> [1] "あ" "イ" "う" "エ" "お"

Created on 2021-07-29 by the reprex package (v2.0.0)

I basically think the proposal is great because specifying the locale makes the code more portable and reproducible, but this difference might surprise the users a bit (edit: I meant this "surprise" is only about the consistency and it's not that any of the above orderings is "wrong". In CJK locale, there's no obvious "correct order" in most cases simply because we have too many letters to think in one's head. So, I think the ordering isn't that serious to us, compared to locales that share some letters with English alphabets).

Btw, the article shows more detailed options of ICU. Can this also be available? If so, this looks a great news to me!

x <- c("A1", "A2", "A12")

sort(x)
#> [1] "A1"  "A12" "A2"

stringi::stri_sort(x, locale = "en@colNumeric=yes") 
#> [1] "A1"  "A2"  "A12"

Created on 2021-07-29 by the reprex package (v2.0.0)

from tidyups.

DavisVaughan avatar DavisVaughan commented on September 27, 2024 2

Will group_by()/summarize() use the same sorting algorithm as arrange()?

@markfairbanks good question! As you probably know, group_by() currently computes groupings in two steps. It first locates the groups, and then orders those groups internally using order() to do so (which respects LC_COLLATE like arrange() currently does). So when you do a summarize() afterwards you get results ordered by group.

In a separate tidyup (but probably in the same dplyr release), we are likely going to switch group_by() to a vec_order()-backed algorithm that simultaneously computes the groupings and orders them. This should produce nice performance improvements when grouping by any data type (not just character).

That said, it also means that LC_COLLATE will no longer be respected. The current thinking is that group_by() would always compute ordered groups using the C locale, because that is so fast, which would propagate to the order of the groups returned by summarize(). If you really need the character groups returned from summarize() to follow a specific ordering, then you'd have to explicitly arrange() the results after summarizing (I believe that this is similar to SQL, where order isn't guaranteed without an ORDER BY step, i.e. https://stackoverflow.com/questions/10064532/the-order-of-a-sql-select-statement-without-order-by-clause).

So if you were to do an arrange(.locale = "C") to physically reorder the rows before a series of steps that might involve multiple group_by() + summarize() combinations, you'd probably see a pretty nice performance improvement since the groups would already be in sequential order.

from tidyups.

mgirlich avatar mgirlich commented on September 27, 2024 1

Working with text is often quite tricky and sorting is no exception. For example

  • natural sort order vs alphabetical sort order (as in the example of @yutannihilation)
x <- c("1", "2", "10")

stringi::stri_sort(x) 
#> [1] "1"  "10" "2"

stringi::stri_sort(x, locale = "en@colNumeric=yes") 
#> [1] "1"  "2"  "10"
  • Unicode... e.g. the "a" character in the cyrillic alphabet
sort(c("A", "B", "\u0410"))
#> [1] "A" "B" "А"

So, I'm already used to being "surprised" when working with text. Specifying the locale just makes sense to at least reduce the amount of surprise. Usually, text has to be normalised in one way or the other anyway before being useful to work with.

I didn't encounter a situation where I had to sort by a text column for analytical reasons. Rather, I want the same text to be next to each other (so basically the text column defines a group) and because it is nicer to glance over and look at. In both cases I would be fine with sorting in the C locale and actually prefer the performance benefits of sorting in the C locale.

The "nicer" order of the english locale only gets relevant when I have to export data for other people. Having to explicitly change the locale in these cases feels reasonable to me.

from tidyups.

DavisVaughan avatar DavisVaughan commented on September 27, 2024 1

Thanks all for your feedback! I'll be closing this issue in #17 today.

We have ultimately decided not to incorporate stri_opts_collator() at this time. While we recognize that numeric is a useful argument, we think that the rest of them are pretty special cases that don't offer enough to officially support it in arrange(). If you really need to be able to order with these special options, you can always call vec_order() directly and supply a custom sort key generator like:

order <- vec_order(
  df[cols], 
  chr_transform = ~stri_sort_key(.x, opts_collator = stri_opts_collator())
)

vec_slice(df, order)

from tidyups.

DavisVaughan avatar DavisVaughan commented on September 27, 2024

@yutannihilation thanks so much for re-sharing this! That feedback is extremely helpful.

"C" is actually an invalid locale in ICU, so the behavior you are seeing with "C" and stri_sort() is actually just stringi silently swallowing the locale argument and falling back to your system default. The closest ICU locale to "C" is "en_US_POSIX" (from what I can tell), but that isn't quite right either.

However, in arrange() we special case "C" to mean "do not use stringi at all", so it just passes through to vec_order() which uses the C locale by default (which matches base R).

So I imagine this behavior of arrange() will be exactly what is expected:

library(dplyr)
library(vctrs)

x <- c("", "", "", "", "")

withr::with_collate("C", sort(x))
#> [1] "あ" "う" "お" "イ" "エ"

# you are seeing `stri_sort()` silently swallow an "invalid" locale
# and use your default instead
stringi::stri_sort(x, locale = "C")
#> [1] "あ" "イ" "う" "エ" "お"
stringi::stri_sort(x, locale = "foobar")
#> [1] "あ" "イ" "う" "エ" "お"

# Special case to not use stringi. This matches base R.
arrange(data.frame(x = x), .locale = "C")
#>    x
#> 1 あ
#> 2 う
#> 3 お
#> 4 イ
#> 5 エ

# i.e. it basically just does:
x[vec_order(x)]
#> [1] "あ" "う" "お" "イ" "エ"

I think the ordering isn't that serious to us, compared to locales that share some letters with English alphabets

This is where we landed as well, but it is nice to hear confirmation of this. We phrased this as "non-English Latin script" languages (i.e. French or Danish) - these are the ones that we figured might be a little surprised.


Regarding the additional options to ICU, the official way to do this in stringi is to create a list of collator options with stri_opts_collator() and pass that on to your sorting functions (I didn't know putting it in the locale string would even work). We briefly considered allowing .locale = <list> in arrange(), which would allow this to work, but decided against it for now because we wanted to keep the scope narrow. Maybe we should reconsider that if it seems useful.

library(vctrs)
library(stringi)

x <- c("A12", "A1", "A2")

# C locale
x[vec_order(x)]
#> [1] "A1"  "A12" "A2"

# "en" locale with numeric support
opts <- stri_opts_collator(locale = "en", numeric = TRUE)
x[vec_order(x, chr_transform = ~stri_sort_key(.x, opts_collator = opts))]
#> [1] "A1"  "A2"  "A12"

# We could support:
# arrange(df, .locale = stri_opts_collator(locale = "en", numeric = TRUE))

@mgirlich

I want the same text to be next to each other (so basically the text column defines a group)

I definitely think more advanced users will appreciate being able to do arrange(.locale = "C"), which will be very fast when they just want to physically sort by the groups. That additionally makes a followup group_by() + summarise() extremely fast - with the data already being physically sorted by group, it basically becomes sequential access rather than random access at the C level (see also tidyverse/dplyr#4406).

from tidyups.

yutannihilation avatar yutannihilation commented on September 27, 2024

However, in arrange() we special case "C" to mean "do not use stringi at all", so it just passes through to vec_order() which uses the C locale by default (which matches base R).

Ah, sorry, the proposal clearly explains this, but it seems I was a bit confused. Then, I think there's no problem. (An off topic comment is that I think "C" isn't an invalid locale in ICU as the document enumerates this explicitly: "Special case: C => en_US_POSIX," though the conclusion is the same anyway)

Regarding the additional options to ICU, I understand your decision. Thanks for clarifying!

from tidyups.

DavisVaughan avatar DavisVaughan commented on September 27, 2024

Something may be wrong with stringi then, because it definitely doesn't remap C->en_US_POSIX

library(stringi)

x <- c("A", "a", "B", "b")

# C locale puts upper case first, then lower case
withr::with_collate("C", sort(x))
#> [1] "A" "B" "a" "b"

# Doesn't recognize C and just uses "en" ordering (my system default)
stri_sort(x, locale = "C")
#> [1] "a" "A" "b" "B"

# Looks like C
stri_sort(x, locale = "en_US_POSIX")
#> [1] "A" "B" "a" "b"

from tidyups.

yutannihilation avatar yutannihilation commented on September 27, 2024

Oh, curious...

from tidyups.

markfairbanks avatar markfairbanks commented on September 27, 2024

I didn't encounter a situation where I had to sort by a text column for analytical reasons. Rather, I want the same text to be next to each other (so basically the text column defines a group) and because it is nicer to glance over and look at. In both cases I would be fine with sorting in the C locale and actually prefer the performance benefits of sorting in the C locale.

The "nicer" order of the english locale only gets relevant when I have to export data for other people. Having to explicitly change the locale in these cases feels reasonable to me.

Just want to second everything @mgirlich mentioned here - it lines up with my thoughts. Overall I think the proposal is a good one. I think it's becoming more common for R users to work on larger datasets, so the performance improvements will be extremely useful.

@DavisVaughan

That additionally makes a followup group_by() + summarise() extremely fast

Will group_by()/summarize() use the same sorting algorithm as arrange()?

from tidyups.

Related Issues (10)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.