Git Product home page Git Product logo

Comments (4)

nastra avatar nastra commented on June 14, 2024

@jiantao-vungle do you have a small reproducible example? Without that it's quite difficult to reproduce this

from iceberg.

amogh-jahagirdar avatar amogh-jahagirdar commented on June 14, 2024

n. It seemed that it made some mistake when writing the parquet file.

To add to what @nastra said If you're allowed to share this file that led to the query failing that would be very helpful to determine if the file was correctly written. specifically, I'm looking to check if the dictionary encoding of this particular column is correct. Based on the exception there's some invalid offset.

If you're not able to share the file then it would be helpful to use parquet-cli or some other parquet inspection tool to do the dictionary encoding verification I mentioned.

from iceberg.

jiantao-vungle avatar jiantao-vungle commented on June 14, 2024

thank @nastra and @amogh-jahagirdar , it's need to evaluate to share the parquet file, i did some explores,like

➜ parquet dictionary 00339-809-290052b9-087e-4bda-b9a6-716fb7ef841c-00001.parquet -c publisher_payout_type_at_delivery

the result is:

Row group 0 dictionary for "publisher_payout_type_at_delivery":
     0: "CPM"
     1: "REVENUE_SHARE"
     2: "FLAT_CPM"

➜ parquet check-stats 00339-809-290052b9-087e-4bda-b9a6-716fb7ef841c-00001.parquet is:

00339-809-290052b9-087e-4bda-b9a6-716fb7ef841c-00001.parquet has no corrupt stats

➜ parquet pages 00339-809-290052b9-087e-4bda-b9a6-716fb7ef841c-00001.parquet -c publisher_payout_type_at_delivery

Column: publisher_payout_type_at_delivery
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  G _  3       12.00 B    36 B
  0-1    data  G R  20000   0.14 B     2.729 kB
  0-2    data  G R  20000   0.13 B     2.621 kB
  0-3    data  G R  20000   0.17 B     3.289 kB
  0-4    data  G R  20000   0.13 B     2.615 kB
  0-5    data  G R  20000   0.14 B     2.674 kB
  0-6    data  G R  20000   0.14 B     2.665 kB
  0-7    data  G R  20000   0.14 B     2.669 kB
  0-8    data  G R  20000   0.14 B     2.726 kB
  0-9    data  G R  20000   0.13 B     2.602 kB
  0-10   data  G R  20000   0.14 B     2.818 kB
  0-11   data  G R  20000   0.13 B     2.585 kB
  0-12   data  G R  20000   0.15 B     2.834 kB
  0-13   data  G R  20000   0.13 B     2.585 kB
  0-14   data  G R  14872   0.14 B     2.022 kB

➜ parquet scan 00339-809-290052b9-087e-4bda-b9a6-716fb7ef841c-00001.parquet -c publisher_payout_type_at_delivery

Unknown error
java.lang.RuntimeException: Failed on record 52315
	at org.apache.parquet.cli.commands.ScanCommand.run(ScanCommand.java:69)
	at org.apache.parquet.cli.Main.run(Main.java:163)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
	at org.apache.parquet.cli.Main.main(Main.java:193)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 52317 in block 0 in file file:/Users/Env/parquet/00339-809-290052b9-087e-4bda-b9a6-716fb7ef841c-00001.parquet
	at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:264)
	at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
	at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363)
	at org.apache.parquet.cli.BaseCommand$1$1.next(BaseCommand.java:357)
	at org.apache.parquet.cli.commands.ScanCommand.run(ScanCommand.java:64)
	... 3 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 3 out of bounds for length 3
	at org.apache.parquet.avro.AvroConverters$BinaryConverter.addValueFromDictionary(AvroConverters.java:87)
	at org.apache.parquet.column.impl.ColumnReaderBase$1.writeValue(ColumnReaderBase.java:186)
	at org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:440)
	at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
	at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
	at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:234)
	... 7 more

Or could you share the possible methods to check the dictionary encoding @amogh-jahagirdar

from iceberg.

jiantao-vungle avatar jiantao-vungle commented on June 14, 2024

and yesterday we encountered the similar problem again:

Py4JJavaError: An error occurred while calling o140.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 5.0 failed 4 times, most recent failure: Lost task 2.3 in stage 5.0 (TID 15) (172.26.33.207 executor 1): org.apache.parquet.io.ParquetDecodingException: Failed to read from input stream
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readNextGroup(VectorizedRleValuesReader.java:942)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readIntegers(VectorizedRleValuesReader.java:696)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory$IntegerUpdater.readValues(ParquetVectorUpdaterFactory.java:256)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBatchInternal(VectorizedRleValuesReader.java:244)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readIntegers(VectorizedRleValuesReader.java:193)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:204)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:316)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:212)
	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:554)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithKeys_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.EOFException
	at org.apache.parquet.bytes.SingleBufferInputStream.slice(SingleBufferInputStream.java:116)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readNextGroup(VectorizedRleValuesReader.java:933)
	... 27 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: org.apache.parquet.io.ParquetDecodingException: Failed to read from input stream
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readNextGroup(VectorizedRleValuesReader.java:942)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readIntegers(VectorizedRleValuesReader.java:696)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory$IntegerUpdater.readValues(ParquetVectorUpdaterFactory.java:256)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBatchInternal(VectorizedRleValuesReader.java:244)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readIntegers(VectorizedRleValuesReader.java:193)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:204)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:316)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:212)
	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:554)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithKeys_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.EOFException
	at org.apache.parquet.bytes.SingleBufferInputStream.slice(SingleBufferInputStream.java:116)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readNextGroup(VectorizedRleValuesReader.java:933)
	... 27 more

from iceberg.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.