I updated to arrow v11.0.0.2 yesterday and noticed that the identical workflows are loading files 50-60x slower.
A benchmark are selectively choosing 200 columns from a (3400 rows x 16000 column dataset, each column containing two text attributes)
~300ms with 9.0.0
~17000ms using v11.0.0.2.
If I save the file I read from v9 saving with v11, then re-read with v11, it's the same , ~17 seconds
On the mac (M1 10-core) it's about 12 seconds.
I understand there are some significant changes under the hood with c++ under the v10 updates but based on the update notes no action would be required for doing anything fancy with ubuntu (I have gcc v9.4) or mac (clang v13).
Any idea what's going on? We store thousands of parquet files so wondering if we should avoid >= v11.
Sorry if this is mislabeled as a bug, but it seems like one to me as I havent changed anything else but systems meet the requirements.
library(arrow)
library(microbenchmark)
# fake data, 3500x16000 with attributes
test_data <- as.data.frame(
lapply(setNames(1:16000, paste0("col", 1:16000)), function(x) {
col <- sample(c(1:5, NA), 3500, replace = TRUE)
levels(col) <- paste("level", 1:4)
attr(col, "text") <- "test question text"
col
})
)
# write out the file somewhere
file <- "~/Desktop/test_data_v9.parquet"
### using arrow 9.0.0
packageVersion('arrow')
arrow::write_parquet(test_data, file)
### benchmark
microbenchmark::microbenchmark(
arrow::read_parquet(file, col_select = 1:200), times = 10
)
### NOW SWITCH to arrow 11.0.0.2 just from the 9.0.0 data
packageVersion('arrow')
microbenchmark::microbenchmark(
arrow::read_parquet(file, col_select = 1:200), times = 10
)
### now re-create, save, re-read using all v11
packageVersion('arrow')
file2 <- "~/Desktop/test_data_v11.parquet"
arrow::write_parquet(test_data, file2)
### benchmark
microbenchmark::microbenchmark(
arrow::read_parquet(file2, col_select = 1:200), times = 10
)