Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-511

Integer overflow on counting values in column

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.8.1
    • 1.9.0, 1.8.2
    • parquet-mr
    • None

    Description

      Parquet will ignore a column if the combined amount of elements in the column is larger than the size of an int.

      The issue is that as the column reader is initialized and the rep and def levels are initialized per column, the size of the integer will overflow, causing these values to not be set properly. Then, during read, the level will not match the current level of the reader, and a null value will be provided. Since there is no overflow checking, no exception is thrown, and it appears that the data is corrupted.

      This happened to us with a fairly complex schema, with an array of maps, which contained arrays as well. There were over 4 billion values in all column pages in one row group, which is what triggered the overflow.

      Relevant stack trace
      org.apache.parquet.io.ParquetDecodingException: Can not read value at 172310 in block 0 in file <redacted>
      at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:245)
      at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
      ...
      at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
      at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
      at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1626)
      at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099)
      at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099)
      at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
      at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
      at org.apache.spark.scheduler.Task.run(Task.scala:70)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)
      Caused by: <redacted> INT64 at value 95584934 out of 95530352, 130598 out of 130598 in currentPage. repetition level: 0, definition level: 2
      at org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:484)
      at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:370)
      at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
      at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218)
      ... 18 more
      Caused by: java.lang.IllegalArgumentException: Reading past RLE/BitPacking stream.
      at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
      at org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:82)
      at org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:64)
      at org.apache.parquet.column.values.dictionary.DictionaryValuesReader.readLong(DictionaryValuesReader.java:121)
      at org.apache.parquet.column.impl.ColumnReaderImpl$2$4.read(ColumnReaderImpl.java:263)
      at org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:464)
      ... 21 more

      Attachments

        Activity

          People

            goreckim Michal Gorecki
            goreckim Michal Gorecki
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: