Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
1.8.1
-
None
Description
Parquet will ignore a column if the combined amount of elements in the column is larger than the size of an int.
The issue is that as the column reader is initialized and the rep and def levels are initialized per column, the size of the integer will overflow, causing these values to not be set properly. Then, during read, the level will not match the current level of the reader, and a null value will be provided. Since there is no overflow checking, no exception is thrown, and it appears that the data is corrupted.
This happened to us with a fairly complex schema, with an array of maps, which contained arrays as well. There were over 4 billion values in all column pages in one row group, which is what triggered the overflow.
Relevant stack trace
org.apache.parquet.io.ParquetDecodingException: Can not read value at 172310 in block 0 in file <redacted>
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:245)
at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
...
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1626)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: <redacted> INT64 at value 95584934 out of 95530352, 130598 out of 130598 in currentPage. repetition level: 0, definition level: 2
at org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:484)
at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:370)
at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218)
... 18 more
Caused by: java.lang.IllegalArgumentException: Reading past RLE/BitPacking stream.
at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
at org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:82)
at org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:64)
at org.apache.parquet.column.values.dictionary.DictionaryValuesReader.readLong(DictionaryValuesReader.java:121)
at org.apache.parquet.column.impl.ColumnReaderImpl$2$4.read(ColumnReaderImpl.java:263)
at org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:464)
... 21 more