Git Product home page Git Product logo

assignuser / arrow Goto Github PK

View Code? Open in Web Editor NEW

This project forked from apache/arrow

0.0 0.0 1.0 178.84 MB

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Home Page: https://arrow.apache.org/

License: Apache License 2.0

Makefile 0.06% C++ 53.16% C 2.89% Shell 0.79% Ruby 3.44% Batchfile 0.06% CMake 1.41% Python 6.25% Java 14.70% FreeMarker 0.01% JavaScript 0.27% HTML 0.01% TypeScript 2.13% Lua 0.02% Go 11.08% Awk 0.01% Meson 0.12% Dockerfile 0.26% Thrift 0.07% R 3.26%

arrow's People

Contributors

alamb avatar alenkaf avatar andygrove avatar assignuser avatar bkietz avatar cyb70289 avatar dependabot[bot] avatar domoritz avatar emkornfield avatar fsaintjacques avatar jonkeane avatar jorgecarleitao avatar jorisvandenbossche avatar kou avatar kszucs avatar lidavidm avatar liyafan82 avatar nealrichardson avatar nevi-me avatar paleolimbot avatar pcmoritz avatar pitrou avatar raulcd avatar thisisnic avatar tianchen92 avatar wesm avatar westonpace avatar wjones127 avatar xhochy avatar zeroshade avatar

Watchers

 avatar

Forkers

assigneduser

arrow's Issues

v11.0.0.2 extremely slow with parquet files written in v9.0.0

Describe the bug, including details regarding any error messages, version, and platform.

I run pretty straight forward usage of read_parquet in several workflows at work. This has been a tremendous upgrade from using rds files with the same content for all the reasons arrow exists. Some of our datasets are wider than longer, which isnt great for parquet, but we usually only need a small subset.

I updated to arrow v11.0.0.2 yesterday and noticed that the identical workflows are loading files 50-60x slower.

A benchmark are selectively choosing 200 columns from a (3400 rows x 16000 column dataset, each column containing two text attributes)
~300ms with 9.0.0
~17000ms using v11.0.0.2.

Ubuntu 8-core intel i7

If I save the file I read from v9 saving with v11, then re-read with v11, it's the same , ~17 seconds

On the mac (M1 10-core) it's about 12 seconds.

I understand there are some significant changes under the hood with c++ under the v10 updates but based on the update notes no action would be required for doing anything fancy with ubuntu (I have gcc v9.4) or mac (clang v13).

Any idea what's going on? We store thousands of parquet files so wondering if we should avoid >= v11.

Sorry if this is mislabeled as a bug, but it seems like one to me as I havent changed anything else but systems meet the requirements.

here's a reprex:

library(arrow)
library(microbenchmark)

# fake data, 3500x16000 with attributes
test_data <- as.data.frame(
    lapply(setNames(1:16000, paste0("col", 1:16000)), function(x) {
    col <- sample(c(1:5, NA), 3500, replace = TRUE)
    levels(col) <- paste("level", 1:4)
    attr(col, "text") <- "test question text"
    col
  })
)

# write out the file somewhere
file <- "~/Desktop/test_data_v9.parquet"

### using arrow 9.0.0
packageVersion('arrow')
arrow::write_parquet(test_data, file)

### benchmark
microbenchmark::microbenchmark(
  arrow::read_parquet(file, col_select = 1:200), times = 10
)

### NOW SWITCH to arrow 11.0.0.2 just from the 9.0.0 data
packageVersion('arrow')
microbenchmark::microbenchmark(
  arrow::read_parquet(file, col_select = 1:200), times = 10
)

### now re-create, save, re-read using all v11
packageVersion('arrow')
file2 <- "~/Desktop/test_data_v11.parquet"
arrow::write_parquet(test_data, file2)

### benchmark
microbenchmark::microbenchmark(
  arrow::read_parquet(file2, col_select = 1:200), times = 10
)

Component(s)

R

v11.0.0.2 extremely slow with parquet files written in v9.0.0

Describe the bug, including details regarding any error messages, version, and platform.

I run pretty straight forward usage of read_parquet in several workflows at work. This has been a tremendous upgrade from using rds files with the same content for all the reasons arrow exists. Some of our datasets are wider than longer, which isnt great for parquet, but we usually only need a small subset.

I updated to arrow v11.0.0.2 yesterday and noticed that the identical workflows are loading files 50-60x slower.

A benchmark are selectively choosing 200 columns from a (3400 rows x 16000 column dataset, each column containing two text attributes)
~300ms with 9.0.0
~17000ms using v11.0.0.2.

Ubuntu 8-core intel i7

If I save the file I read from v9 saving with v11, then re-read with v11, it's the same , ~17 seconds

On the mac (M1 10-core) it's about 12 seconds.

I understand there are some significant changes under the hood with c++ under the v10 updates but based on the update notes no action would be required for doing anything fancy with ubuntu (I have gcc v9.4) or mac (clang v13).

Any idea what's going on? We store thousands of parquet files so wondering if we should avoid >= v11.

Sorry if this is mislabeled as a bug, but it seems like one to me as I havent changed anything else but systems meet the requirements.

here's a reprex:

library(arrow)
library(microbenchmark)

# fake data, 3500x16000 with attributes
test_data <- as.data.frame(
    lapply(setNames(1:16000, paste0("col", 1:16000)), function(x) {
    col <- sample(c(1:5, NA), 3500, replace = TRUE)
    levels(col) <- paste("level", 1:4)
    attr(col, "text") <- "test question text"
    col
  })
)

# write out the file somewhere
file <- "~/Desktop/test_data_v9.parquet"

### using arrow 9.0.0
packageVersion('arrow')
arrow::write_parquet(test_data, file)

### benchmark
microbenchmark::microbenchmark(
  arrow::read_parquet(file, col_select = 1:200), times = 10
)

### NOW SWITCH to arrow 11.0.0.2 just from the 9.0.0 data
packageVersion('arrow')
microbenchmark::microbenchmark(
  arrow::read_parquet(file, col_select = 1:200), times = 10
)

### now re-create, save, re-read using all v11
packageVersion('arrow')
file2 <- "~/Desktop/test_data_v11.parquet"
arrow::write_parquet(test_data, file2)

### benchmark
microbenchmark::microbenchmark(
  arrow::read_parquet(file2, col_select = 1:200), times = 10
)

Component(s)

R

feature request

Describe the enhancement requested

zxczxc

Component(s)

Java

v11.0.0.2 extremely slow with parquet files written in v9.0.0

Describe the bug, including details regarding any error messages, version, and platform.

I run pretty straight forward usage of read_parquet in several workflows at work. This has been a tremendous upgrade from using rds files with the same content for all the reasons arrow exists. Some of our datasets are wider than longer, which isnt great for parquet, but we usually only need a small subset.

I updated to arrow v11.0.0.2 yesterday and noticed that the identical workflows are loading files 50-60x slower.

A benchmark are selectively choosing 200 columns from a (3400 rows x 16000 column dataset, each column containing two text attributes)
~300ms with 9.0.0
~17000ms using v11.0.0.2.

Ubuntu 8-core intel i7

If I save the file I read from v9 saving with v11, then re-read with v11, it's the same , ~17 seconds

On the mac (M1 10-core) it's about 12 seconds.

I understand there are some significant changes under the hood with c++ under the v10 updates but based on the update notes no action would be required for doing anything fancy with ubuntu (I have gcc v9.4) or mac (clang v13).

Any idea what's going on? We store thousands of parquet files so wondering if we should avoid >= v11.

Sorry if this is mislabeled as a bug, but it seems like one to me as I havent changed anything else but systems meet the requirements.

here's a reprex:

library(arrow)
library(microbenchmark)

# fake data, 3500x16000 with attributes
test_data <- as.data.frame(
    lapply(setNames(1:16000, paste0("col", 1:16000)), function(x) {
    col <- sample(c(1:5, NA), 3500, replace = TRUE)
    levels(col) <- paste("level", 1:4)
    attr(col, "text") <- "test question text"
    col
  })
)

# write out the file somewhere
file <- "~/Desktop/test_data_v9.parquet"

### using arrow 9.0.0
packageVersion('arrow')
arrow::write_parquet(test_data, file)

### benchmark
microbenchmark::microbenchmark(
  arrow::read_parquet(file, col_select = 1:200), times = 10
)

### NOW SWITCH to arrow 11.0.0.2 just from the 9.0.0 data
packageVersion('arrow')
microbenchmark::microbenchmark(
  arrow::read_parquet(file, col_select = 1:200), times = 10
)

### now re-create, save, re-read using all v11
packageVersion('arrow')
file2 <- "~/Desktop/test_data_v11.parquet"
arrow::write_parquet(test_data, file2)

### benchmark
microbenchmark::microbenchmark(
  arrow::read_parquet(file2, col_select = 1:200), times = 10
)

Component(s)

R

another test

Describe the bug, including details regarding any error messages, version, and platform.

aczcx asasda
asdasd

Component(s)

Archery, C#, C++ - Gandiva

[CI] A test issue

Describe the bug, including details regarding any error messages, version, and platform.

What a problem!

Component(s)

Archery, Benchmarking, C++

[CI] some issue

Describe the bug, including details regarding any error messages, version, and platform.

nightlies fail so hard

Component(s)

Archery, C#, FlightRPC

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.