Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1633

Integer overflow in ParquetFileReader.ConsecutiveChunkList

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.10.1
    • None
    • parquet-mr
    • None

    Description

      When reading a large Parquet file (2.8GB), I encounter the following exception:

      Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet
      at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
      at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
      at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40)
      at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228)
      ... 14 more
      Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212
      at java.util.ArrayList.<init>(ArrayList.java:157)
      at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169)

       

      The file metadata is:

      • block 1 (3 columns)
        • rowCount: 110,100
        • totalByteSize: 348,492,072
        • compressedSize: 165,689,649
      • block 2 (3 columns)
        • rowCount: 90,054
        • totalByteSize: 3,243,165,541
        • compressedSize: 2,509,579,966
      • block 3 (3 columns)
        • rowCount: 105,119
        • totalByteSize: 350,901,693
        • compressedSize: 144,952,177
      • block 4 (3 columns)
        • rowCount: 48,741
        • totalByteSize: 1,275,995
        • compressedSize: 914,205

      I don't have the code to reproduce the issue, unfortunately; however, I looked at the code and it seems that integer length field in ConsecutiveChunkList overflows, which results in negative capacity for array list in readAll method:

      int fullAllocations = length / options.getMaxAllocationSize();
      int lastAllocationSize = length % options.getMaxAllocationSize();
      	
      int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0);
      List<ByteBuffer> buffers = new ArrayList<>(numAllocations);

       

      This is caused by cast to integer in readNextRowGroup method in ParquetFileReader:

      currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, (int)mc.getTotalSize()));
      

      which overflows when total size of the column is larger than Integer.MAX_VALUE.

      I would appreciate if you could help addressing the issue. Thanks!

       

      Attachments

        Activity

          People

            edw_vtxa Edward Wright
            sadikovi Ivan Sadikov
            Votes:
            3 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: