Comments (4)
@jiantao-vungle do you have a small reproducible example? Without that it's quite difficult to reproduce this
from iceberg.
n. It seemed that it made some mistake when writing the parquet file.
To add to what @nastra said If you're allowed to share this file that led to the query failing that would be very helpful to determine if the file was correctly written. specifically, I'm looking to check if the dictionary encoding of this particular column is correct. Based on the exception there's some invalid offset.
If you're not able to share the file then it would be helpful to use parquet-cli or some other parquet inspection tool to do the dictionary encoding verification I mentioned.
from iceberg.
thank @nastra and @amogh-jahagirdar , it's need to evaluate to share the parquet file, i did some explores,like
➜ parquet dictionary 00339-809-290052b9-087e-4bda-b9a6-716fb7ef841c-00001.parquet -c publisher_payout_type_at_delivery
the result is:
Row group 0 dictionary for "publisher_payout_type_at_delivery":
0: "CPM"
1: "REVENUE_SHARE"
2: "FLAT_CPM"
➜ parquet check-stats 00339-809-290052b9-087e-4bda-b9a6-716fb7ef841c-00001.parquet
is:
00339-809-290052b9-087e-4bda-b9a6-716fb7ef841c-00001.parquet has no corrupt stats
➜ parquet pages 00339-809-290052b9-087e-4bda-b9a6-716fb7ef841c-00001.parquet -c publisher_payout_type_at_delivery
Column: publisher_payout_type_at_delivery
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-D dict G _ 3 12.00 B 36 B
0-1 data G R 20000 0.14 B 2.729 kB
0-2 data G R 20000 0.13 B 2.621 kB
0-3 data G R 20000 0.17 B 3.289 kB
0-4 data G R 20000 0.13 B 2.615 kB
0-5 data G R 20000 0.14 B 2.674 kB
0-6 data G R 20000 0.14 B 2.665 kB
0-7 data G R 20000 0.14 B 2.669 kB
0-8 data G R 20000 0.14 B 2.726 kB
0-9 data G R 20000 0.13 B 2.602 kB
0-10 data G R 20000 0.14 B 2.818 kB
0-11 data G R 20000 0.13 B 2.585 kB
0-12 data G R 20000 0.15 B 2.834 kB
0-13 data G R 20000 0.13 B 2.585 kB
0-14 data G R 14872 0.14 B 2.022 kB
➜ parquet scan 00339-809-290052b9-087e-4bda-b9a6-716fb7ef841c-00001.parquet -c publisher_payout_type_at_delivery
Unknown error
java.lang.RuntimeException: Failed on record 52315
at org.apache.parquet.cli.commands.ScanCommand.run(ScanCommand.java:69)
at org.apache.parquet.cli.Main.run(Main.java:163)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.parquet.cli.Main.main(Main.java:193)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 52317 in block 0 in file file:/Users/Env/parquet/00339-809-290052b9-087e-4bda-b9a6-716fb7ef841c-00001.parquet
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:264)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363)
at org.apache.parquet.cli.BaseCommand$1$1.next(BaseCommand.java:357)
at org.apache.parquet.cli.commands.ScanCommand.run(ScanCommand.java:64)
... 3 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 3 out of bounds for length 3
at org.apache.parquet.avro.AvroConverters$BinaryConverter.addValueFromDictionary(AvroConverters.java:87)
at org.apache.parquet.column.impl.ColumnReaderBase$1.writeValue(ColumnReaderBase.java:186)
at org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:440)
at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:234)
... 7 more
Or could you share the possible methods to check the dictionary encoding @amogh-jahagirdar
from iceberg.
and yesterday we encountered the similar problem again:
Py4JJavaError: An error occurred while calling o140.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 5.0 failed 4 times, most recent failure: Lost task 2.3 in stage 5.0 (TID 15) (172.26.33.207 executor 1): org.apache.parquet.io.ParquetDecodingException: Failed to read from input stream
at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readNextGroup(VectorizedRleValuesReader.java:942)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readIntegers(VectorizedRleValuesReader.java:696)
at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory$IntegerUpdater.readValues(ParquetVectorUpdaterFactory.java:256)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBatchInternal(VectorizedRleValuesReader.java:244)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readIntegers(VectorizedRleValuesReader.java:193)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:204)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:316)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:212)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:554)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithKeys_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.EOFException
at org.apache.parquet.bytes.SingleBufferInputStream.slice(SingleBufferInputStream.java:116)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readNextGroup(VectorizedRleValuesReader.java:933)
... 27 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: org.apache.parquet.io.ParquetDecodingException: Failed to read from input stream
at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readNextGroup(VectorizedRleValuesReader.java:942)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readIntegers(VectorizedRleValuesReader.java:696)
at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory$IntegerUpdater.readValues(ParquetVectorUpdaterFactory.java:256)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBatchInternal(VectorizedRleValuesReader.java:244)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readIntegers(VectorizedRleValuesReader.java:193)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:204)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:316)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:212)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:554)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithKeys_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.EOFException
at org.apache.parquet.bytes.SingleBufferInputStream.slice(SingleBufferInputStream.java:116)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readNextGroup(VectorizedRleValuesReader.java:933)
... 27 more
from iceberg.
Related Issues (20)
- Delete using Merge-on-Read sets `OVERWRITE` while `DELETE` is expected
- Enable reading WASB and WASBS file paths with ABFS and ABFSS HOT 3
- rewriting manifest can rewrite based on filter? HOT 2
- Iceberg Rest OpenAPI Spec views/rename should return 204 HOT 1
- Does the FlushOnEveryBlock feature in Avro affect Iceberg data integrity?
- Iceberg Spark Extensions conflict with Paimon HOT 2
- spark.table() raises warn: Unclosed S3FileIO instance in NessieTableOperations HOT 7
- spark.table() raises warn: Unclosed S3FileIO instance in HadoopTableOperations HOT 3
- Newly generated Positional Delete file has lowerbound & upperbound values as empty after running rewrite_position_delete_files spark procedure
- flink autoscaler: how set write-parallelism ? HOT 3
- Doubts about the types supported by Iceberg, Not in line with expectations HOT 3
- Iceberg Spark streaming skips rows of data
- Drop table purge issue for parquet tables with SparkSessionCatalog HOT 2
- Timestamp/Day transform returns Date as required type while days is actually stored integer HOT 2
- Cannot insert table created by spark temp into iceberg table
- Iceberg may occur data duplication when use flink to write data to iceberg and commit failed HOT 1
- Restrict generated locations to URI syntax HOT 1
- TestDataFrameWrites#testFaultToleranceOnWrite failed due to exception in cleaning up temporary directory
- Changes in describe behaviour of a table break partition info? HOT 1
- java.lang.NoClassDefFoundError: scala/jdk/CollectionConverters$ HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from iceberg.