Comments (9)
Thanks for inviting! Honestly, I rarely arrange()
text data so I had no idea about issues specific to Japanese locale. But, I posted the question on our community Slack and they told me this article (in Japanese, but I hope the code should be useful at least): https://qiita.com/nozma/items/4aea36022ce18a6aa5ca
The article says the problem is that ICU's locale C
doesn't behave the same as base R's locale C
(and it seems ICU has no option corresponding to R's locale C
?)
x <- c("あ", "う", "お", "イ", "エ")
# my locale
Sys.getlocale("LC_COLLATE")
#> [1] "ja_JP.UTF-8"
# default
sort(x)
#> [1] "あ" "イ" "う" "エ" "お"
# locale C with base R
withr::with_collate("C", sort(x))
#> [1] "あ" "う" "お" "イ" "エ"
# locale C with stringi
stringi::stri_sort(x, locale = "C")
#> [1] "あ" "イ" "う" "エ" "お"
Created on 2021-07-29 by the reprex package (v2.0.0)
I basically think the proposal is great because specifying the locale makes the code more portable and reproducible, but this difference might surprise the users a bit (edit: I meant this "surprise" is only about the consistency and it's not that any of the above orderings is "wrong". In CJK locale, there's no obvious "correct order" in most cases simply because we have too many letters to think in one's head. So, I think the ordering isn't that serious to us, compared to locales that share some letters with English alphabets).
Btw, the article shows more detailed options of ICU. Can this also be available? If so, this looks a great news to me!
x <- c("A1", "A2", "A12")
sort(x)
#> [1] "A1" "A12" "A2"
stringi::stri_sort(x, locale = "en@colNumeric=yes")
#> [1] "A1" "A2" "A12"
Created on 2021-07-29 by the reprex package (v2.0.0)
from tidyups.
Will group_by()/summarize() use the same sorting algorithm as arrange()?
@markfairbanks good question! As you probably know, group_by()
currently computes groupings in two steps. It first locates the groups, and then orders those groups internally using order()
to do so (which respects LC_COLLATE
like arrange()
currently does). So when you do a summarize()
afterwards you get results ordered by group.
In a separate tidyup (but probably in the same dplyr release), we are likely going to switch group_by()
to a vec_order()
-backed algorithm that simultaneously computes the groupings and orders them. This should produce nice performance improvements when grouping by any data type (not just character).
That said, it also means that LC_COLLATE
will no longer be respected. The current thinking is that group_by()
would always compute ordered groups using the C locale, because that is so fast, which would propagate to the order of the groups returned by summarize()
. If you really need the character groups returned from summarize()
to follow a specific ordering, then you'd have to explicitly arrange()
the results after summarizing (I believe that this is similar to SQL, where order isn't guaranteed without an ORDER BY step, i.e. https://stackoverflow.com/questions/10064532/the-order-of-a-sql-select-statement-without-order-by-clause).
So if you were to do an arrange(.locale = "C")
to physically reorder the rows before a series of steps that might involve multiple group_by() + summarize()
combinations, you'd probably see a pretty nice performance improvement since the groups would already be in sequential order.
from tidyups.
Working with text is often quite tricky and sorting is no exception. For example
- natural sort order vs alphabetical sort order (as in the example of @yutannihilation)
x <- c("1", "2", "10")
stringi::stri_sort(x)
#> [1] "1" "10" "2"
stringi::stri_sort(x, locale = "en@colNumeric=yes")
#> [1] "1" "2" "10"
- Unicode... e.g. the "a" character in the cyrillic alphabet
sort(c("A", "B", "\u0410"))
#> [1] "A" "B" "А"
So, I'm already used to being "surprised" when working with text. Specifying the locale just makes sense to at least reduce the amount of surprise. Usually, text has to be normalised in one way or the other anyway before being useful to work with.
I didn't encounter a situation where I had to sort by a text column for analytical reasons. Rather, I want the same text to be next to each other (so basically the text column defines a group) and because it is nicer to glance over and look at. In both cases I would be fine with sorting in the C locale and actually prefer the performance benefits of sorting in the C locale.
The "nicer" order of the english locale only gets relevant when I have to export data for other people. Having to explicitly change the locale in these cases feels reasonable to me.
from tidyups.
Thanks all for your feedback! I'll be closing this issue in #17 today.
We have ultimately decided not to incorporate stri_opts_collator()
at this time. While we recognize that numeric
is a useful argument, we think that the rest of them are pretty special cases that don't offer enough to officially support it in arrange()
. If you really need to be able to order with these special options, you can always call vec_order()
directly and supply a custom sort key generator like:
order <- vec_order(
df[cols],
chr_transform = ~stri_sort_key(.x, opts_collator = stri_opts_collator())
)
vec_slice(df, order)
from tidyups.
@yutannihilation thanks so much for re-sharing this! That feedback is extremely helpful.
"C"
is actually an invalid locale in ICU, so the behavior you are seeing with "C"
and stri_sort()
is actually just stringi silently swallowing the locale
argument and falling back to your system default. The closest ICU locale to "C"
is "en_US_POSIX"
(from what I can tell), but that isn't quite right either.
However, in arrange()
we special case "C"
to mean "do not use stringi at all", so it just passes through to vec_order()
which uses the C locale by default (which matches base R).
So I imagine this behavior of arrange()
will be exactly what is expected:
library(dplyr)
library(vctrs)
x <- c("あ", "う", "お", "イ", "エ")
withr::with_collate("C", sort(x))
#> [1] "あ" "う" "お" "イ" "エ"
# you are seeing `stri_sort()` silently swallow an "invalid" locale
# and use your default instead
stringi::stri_sort(x, locale = "C")
#> [1] "あ" "イ" "う" "エ" "お"
stringi::stri_sort(x, locale = "foobar")
#> [1] "あ" "イ" "う" "エ" "お"
# Special case to not use stringi. This matches base R.
arrange(data.frame(x = x), .locale = "C")
#> x
#> 1 あ
#> 2 う
#> 3 お
#> 4 イ
#> 5 エ
# i.e. it basically just does:
x[vec_order(x)]
#> [1] "あ" "う" "お" "イ" "エ"
I think the ordering isn't that serious to us, compared to locales that share some letters with English alphabets
This is where we landed as well, but it is nice to hear confirmation of this. We phrased this as "non-English Latin script" languages (i.e. French or Danish) - these are the ones that we figured might be a little surprised.
Regarding the additional options to ICU, the official way to do this in stringi is to create a list of collator options with stri_opts_collator()
and pass that on to your sorting functions (I didn't know putting it in the locale
string would even work). We briefly considered allowing .locale = <list>
in arrange()
, which would allow this to work, but decided against it for now because we wanted to keep the scope narrow. Maybe we should reconsider that if it seems useful.
library(vctrs)
library(stringi)
x <- c("A12", "A1", "A2")
# C locale
x[vec_order(x)]
#> [1] "A1" "A12" "A2"
# "en" locale with numeric support
opts <- stri_opts_collator(locale = "en", numeric = TRUE)
x[vec_order(x, chr_transform = ~stri_sort_key(.x, opts_collator = opts))]
#> [1] "A1" "A2" "A12"
# We could support:
# arrange(df, .locale = stri_opts_collator(locale = "en", numeric = TRUE))
I want the same text to be next to each other (so basically the text column defines a group)
I definitely think more advanced users will appreciate being able to do arrange(.locale = "C")
, which will be very fast when they just want to physically sort by the groups. That additionally makes a followup group_by() + summarise()
extremely fast - with the data already being physically sorted by group, it basically becomes sequential access rather than random access at the C level (see also tidyverse/dplyr#4406).
from tidyups.
However, in
arrange()
we special case"C"
to mean "do not use stringi at all", so it just passes through tovec_order()
which uses the C locale by default (which matches base R).
Ah, sorry, the proposal clearly explains this, but it seems I was a bit confused. Then, I think there's no problem. (An off topic comment is that I think "C" isn't an invalid locale in ICU as the document enumerates this explicitly: "Special case: C => en_US_POSIX," though the conclusion is the same anyway)
Regarding the additional options to ICU, I understand your decision. Thanks for clarifying!
from tidyups.
Something may be wrong with stringi then, because it definitely doesn't remap C->en_US_POSIX
library(stringi)
x <- c("A", "a", "B", "b")
# C locale puts upper case first, then lower case
withr::with_collate("C", sort(x))
#> [1] "A" "B" "a" "b"
# Doesn't recognize C and just uses "en" ordering (my system default)
stri_sort(x, locale = "C")
#> [1] "a" "A" "b" "B"
# Looks like C
stri_sort(x, locale = "en_US_POSIX")
#> [1] "A" "B" "a" "b"
from tidyups.
Oh, curious...
from tidyups.
I didn't encounter a situation where I had to sort by a text column for analytical reasons. Rather, I want the same text to be next to each other (so basically the text column defines a group) and because it is nicer to glance over and look at. In both cases I would be fine with sorting in the C locale and actually prefer the performance benefits of sorting in the C locale.
The "nicer" order of the english locale only gets relevant when I have to export data for other people. Having to explicitly change the locale in these cases feels reasonable to me.
Just want to second everything @mgirlich mentioned here - it lines up with my thoughts. Overall I think the proposal is a good one. I think it's becoming more common for R users to work on larger datasets, so the performance improvements will be extremely useful.
That additionally makes a followup group_by() + summarise() extremely fast
Will group_by()
/summarize()
use the same sorting algorithm as arrange()
?
from tidyups.
Related Issues (10)
- Feedback on tidyup 004: Governance model HOT 5
- Pick a license
- Feedback on tidyup 006: Ordering of `dplyr::group_by()` HOT 4
- Feedback on tidyup 2: stringr <-> tidyr realignment HOT 10
- Write up options options
- Write more about standard processess
- What's the process for changing/updating tidyups? HOT 1
- Feedback on the "tidyup" process for making big changes to the tidyverse HOT 19
- 001-tidyup-process.Rmd out of sync with 001-tidyup-process.md HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tidyups.