Comments (9)
I have reproduced the issue. The root cause is that the reader tried to read col3 (w/ columnId = 4) which does not have any stream (both PRESENT and DATA streams all have ZERO length as listed below). The parent of col3 is col2 (w/ columnId = 3) whose values are all null, which means the reader should stop reading at col2 w/o touching col3.
Rows: 50000
Compression: ZLIB
Compression size: 65536
Calendar: Julian/Gregorian
Type: struct<col0:struct<col1:int>,col2:struct<col3:int>>
Stripe Statistics:
Stripe 1:
Column 0: count: 50000 hasNull: false
Column 1: count: 50000 hasNull: false
Column 2: count: 50000 hasNull: false min: 0 max: 149992 sum: 3752012883
Column 3: count: 0 hasNull: true
Column 4: count: 0 hasNull: true sum: 0
File Statistics:
Column 0: count: 50000 hasNull: false
Column 1: count: 50000 hasNull: false
Column 2: count: 50000 hasNull: false min: 0 max: 149992 sum: 3752012883
Column 3: count: 0 hasNull: true
Column 4: count: 0 hasNull: true sum: 0
Stripes:
Stripe: offset: 3 data: 129019 rows: 50000 tail: 68 index: 216
Stream: column 0 section ROW_INDEX start: 3 length 17
Stream: column 1 section ROW_INDEX start: 20 length 17
Stream: column 2 section ROW_INDEX start: 37 length 122
Stream: column 3 section ROW_INDEX start: 159 length 35
Stream: column 4 section ROW_INDEX start: 194 length 25
Stream: column 2 section DATA start: 219 length 129007
Stream: column 3 section PRESENT start: 129226 length 12
Stream: column 4 section PRESENT start: 129238 length 0
Stream: column 4 section DATA start: 129238 length 0
Encoding column 0: DIRECT
Encoding column 1: DIRECT
Encoding column 2: DIRECT_V2
Encoding column 3: DIRECT
Encoding column 4: DIRECT_V2
from orc.
I will file a JIRA and fix it shortly.
from orc.
Thank you for reporting, @jnwan .
from orc.
@wgtmac has explained the root cause well. Just want to reemphasize that same issue happens on other complicated columns, like map, empty map will also get "bad read in nextBuffer" error.
This issue has been fixed into the main branch. Please have a try and let us know if there is any issue. Thanks @jnwan !
from orc.
ColumnReader needs to fix this bug by processing for cases where data stream does not exist.
from orc.
Thank you so much, @wgtmac and @coderex2522 !
from orc.
I create a new issue in Jira.
from orc.
@wgtmac has explained the root cause well. Just want to reemphasize that same issue happens on other complicated columns, like map, empty map will also get "bad read in nextBuffer" error.
from orc.
Verified the issue got fixed! Thank you!
from orc.
Related Issues (20)
- In cpp/java sdk, SearchArgument looks like didn't use the footer and stripe stats. HOT 1
- ORC-1618: Disable building tests for snappy HOT 1
- ORC-1620: Add Apple Silicon Test Coverage HOT 1
- ORC-1621: Switch to `oraclelinux9` from `rocky9` HOT 1
- ORC-1621: Switch to `oraclelinux9` from `rocky9` HOT 1
- What's the meaning of EvaluatedRowGroupCount in ReaderMetrics HOT 5
- support new zstd library in java 8 HOT 5
- [C++] uniform identifiers naming style. HOT 7
- [Vcpkg] Add 2.0.0 to vcpkg versions HOT 6
- Release Apache ORC 1.9.3 HOT 4
- [C++] Store decimal values as strings instead of floats in the JSON output of `orc-contents` HOT 6
- [Python] Snappy 1.2.0 breaking release - `ImportError: .../../../.././liborc.so: undefined symbol: _ZN6snappy11RawCompressEPKcmPcPm` HOT 1
- Release Apache ORC 1.8.7
- Java orc-core 2.0.0:nohive doesn't relocate orc-format HOT 8
- ORC-1696: Fix ClassCastException when reading avro decimal type in bechmark HOT 1
- ORC-1696: Fix ClassCastException when reading avro decimal type in bechmark HOT 1
- ORC-1696: Fix ClassCastException when reading avro decimal type in bechmark HOT 1
- ORC-1699: Fix SparkBenchmark in Parquet format according to SPARK-40918 HOT 1
- orc-tools unknown subcommand "Merge" HOT 3
- Release ORC 2.0.1 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from orc.