assignuser / arrow Goto Github PK

This project forked from apache/arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

License: Apache License 2.0

Makefile 0.06% C++ 53.16% C 2.89% Shell 0.79% Ruby 3.44% Batchfile 0.06% CMake 1.41% Python 6.25% Java 14.70% FreeMarker 0.01% JavaScript 0.27% HTML 0.01% TypeScript 2.13% Lua 0.02% Go 11.08% Awk 0.01% Meson 0.12% Dockerfile 0.26% Thrift 0.07% R 3.26%

arrow's People

Contributors

Watchers

Forkers

assigneduser

arrow's Issues

v11.0.0.2 extremely slow with parquet files written in v9.0.0

Describe the bug, including details regarding any error messages, version, and platform.

I run pretty straight forward usage of read_parquet in several workflows at work. This has been a tremendous upgrade from using rds files with the same content for all the reasons arrow exists. Some of our datasets are wider than longer, which isnt great for parquet, but we usually only need a small subset.

I updated to arrow v11.0.0.2 yesterday and noticed that the identical workflows are loading files 50-60x slower.

A benchmark are selectively choosing 200 columns from a (3400 rows x 16000 column dataset, each column containing two text attributes)
~300ms with 9.0.0
~17000ms using v11.0.0.2.

Ubuntu 8-core intel i7

If I save the file I read from v9 saving with v11, then re-read with v11, it's the same , ~17 seconds

On the mac (M1 10-core) it's about 12 seconds.

I understand there are some significant changes under the hood with c++ under the v10 updates but based on the update notes no action would be required for doing anything fancy with ubuntu (I have gcc v9.4) or mac (clang v13).

Any idea what's going on? We store thousands of parquet files so wondering if we should avoid >= v11.

Sorry if this is mislabeled as a bug, but it seems like one to me as I havent changed anything else but systems meet the requirements.

here's a reprex:

library(arrow)
library(microbenchmark)

# fake data, 3500x16000 with attributes
test_data <- as.data.frame(
    lapply(setNames(1:16000, paste0("col", 1:16000)), function(x) {
    col <- sample(c(1:5, NA), 3500, replace = TRUE)
    levels(col) <- paste("level", 1:4)
    attr(col, "text") <- "test question text"
    col
  })
)

# write out the file somewhere
file <- "~/Desktop/test_data_v9.parquet"

### using arrow 9.0.0
packageVersion('arrow')
arrow::write_parquet(test_data, file)

### benchmark
microbenchmark::microbenchmark(
  arrow::read_parquet(file, col_select = 1:200), times = 10
)

### NOW SWITCH to arrow 11.0.0.2 just from the 9.0.0 data
packageVersion('arrow')
microbenchmark::microbenchmark(
  arrow::read_parquet(file, col_select = 1:200), times = 10
)

### now re-create, save, re-read using all v11
packageVersion('arrow')
file2 <- "~/Desktop/test_data_v11.parquet"
arrow::write_parquet(test_data, file2)

### benchmark
microbenchmark::microbenchmark(
  arrow::read_parquet(file2, col_select = 1:200), times = 10
)

Component(s)

v11.0.0.2 extremely slow with parquet files written in v9.0.0

Describe the bug, including details regarding any error messages, version, and platform.

I updated to arrow v11.0.0.2 yesterday and noticed that the identical workflows are loading files 50-60x slower.

A benchmark are selectively choosing 200 columns from a (3400 rows x 16000 column dataset, each column containing two text attributes)
~300ms with 9.0.0
~17000ms using v11.0.0.2.

Ubuntu 8-core intel i7

If I save the file I read from v9 saving with v11, then re-read with v11, it's the same , ~17 seconds

On the mac (M1 10-core) it's about 12 seconds.

Any idea what's going on? We store thousands of parquet files so wondering if we should avoid >= v11.

Sorry if this is mislabeled as a bug, but it seems like one to me as I havent changed anything else but systems meet the requirements.

here's a reprex:

library(arrow)
library(microbenchmark)

# fake data, 3500x16000 with attributes
test_data <- as.data.frame(
    lapply(setNames(1:16000, paste0("col", 1:16000)), function(x) {
    col <- sample(c(1:5, NA), 3500, replace = TRUE)
    levels(col) <- paste("level", 1:4)
    attr(col, "text") <- "test question text"
    col
  })
)

# write out the file somewhere
file <- "~/Desktop/test_data_v9.parquet"

### using arrow 9.0.0
packageVersion('arrow')
arrow::write_parquet(test_data, file)

### benchmark
microbenchmark::microbenchmark(
  arrow::read_parquet(file, col_select = 1:200), times = 10
)

### NOW SWITCH to arrow 11.0.0.2 just from the 9.0.0 data
packageVersion('arrow')
microbenchmark::microbenchmark(
  arrow::read_parquet(file, col_select = 1:200), times = 10
)

### now re-create, save, re-read using all v11
packageVersion('arrow')
file2 <- "~/Desktop/test_data_v11.parquet"
arrow::write_parquet(test_data, file2)

### benchmark
microbenchmark::microbenchmark(
  arrow::read_parquet(file2, col_select = 1:200), times = 10
)

Component(s)

feature request

Describe the enhancement requested

zxczxc

Component(s)

Java

v11.0.0.2 extremely slow with parquet files written in v9.0.0

Describe the bug, including details regarding any error messages, version, and platform.

I updated to arrow v11.0.0.2 yesterday and noticed that the identical workflows are loading files 50-60x slower.

A benchmark are selectively choosing 200 columns from a (3400 rows x 16000 column dataset, each column containing two text attributes)
~300ms with 9.0.0
~17000ms using v11.0.0.2.

Ubuntu 8-core intel i7

If I save the file I read from v9 saving with v11, then re-read with v11, it's the same , ~17 seconds

On the mac (M1 10-core) it's about 12 seconds.

Any idea what's going on? We store thousands of parquet files so wondering if we should avoid >= v11.

Sorry if this is mislabeled as a bug, but it seems like one to me as I havent changed anything else but systems meet the requirements.

here's a reprex:

library(arrow)
library(microbenchmark)

# fake data, 3500x16000 with attributes
test_data <- as.data.frame(
    lapply(setNames(1:16000, paste0("col", 1:16000)), function(x) {
    col <- sample(c(1:5, NA), 3500, replace = TRUE)
    levels(col) <- paste("level", 1:4)
    attr(col, "text") <- "test question text"
    col
  })
)

# write out the file somewhere
file <- "~/Desktop/test_data_v9.parquet"

### using arrow 9.0.0
packageVersion('arrow')
arrow::write_parquet(test_data, file)

### benchmark
microbenchmark::microbenchmark(
  arrow::read_parquet(file, col_select = 1:200), times = 10
)

### NOW SWITCH to arrow 11.0.0.2 just from the 9.0.0 data
packageVersion('arrow')
microbenchmark::microbenchmark(
  arrow::read_parquet(file, col_select = 1:200), times = 10
)

### now re-create, save, re-read using all v11
packageVersion('arrow')
file2 <- "~/Desktop/test_data_v11.parquet"
arrow::write_parquet(test_data, file2)

### benchmark
microbenchmark::microbenchmark(
  arrow::read_parquet(file2, col_select = 1:200), times = 10
)

Component(s)

another test

Describe the bug, including details regarding any error messages, version, and platform.

aczcx asasda
asdasd

Component(s)

Archery, C#, C++ - Gandiva

[CI] A test issue

Describe the bug, including details regarding any error messages, version, and platform.

What a problem!

Component(s)

Archery, Benchmarking, C++

[CI] some issue

Describe the bug, including details regarding any error messages, version, and platform.

nightlies fail so hard

Component(s)

Archery, C#, FlightRPC

assignuser / arrow Goto Github PK

arrow's People

Contributors

Watchers

Forkers

arrow's Issues

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Describe the enhancement requested

Component(s)

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Recommend Projects

Recommend Topics

Recommend Org