Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-82

ColumnChunkPageWriteStore assumes pages are smaller than Integer.MAX_VALUE

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6.0
    • Component/s: parquet-mr
    • Labels:
      None

      Description

      The ColumnChunkPageWriteStore casts both the compressed size and uncompressed size of a page from a long to an int. If the uncompressed size of a page exceeds Integer.MAX_VALUE, the write doesn't fail, although it creates bad metadata:

      chunk1: BINARY GZIP DO:0 FPO:4 SZ:267184096/-2143335445/-8.02 VC:41 ENC:BIT_PACKED,PLAIN
      

      At read time, the BytesInput will try to allocate a byte array for the uncompressed data and fails:

      Caused by: parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://nameservice1/OUTPUT/part-m-00000.gz.parquet 
      at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:177) 
      at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130) 
      at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:95) 
      at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:66) 
      at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51) 
      at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:65) 
      ... 16 more 
      Caused by: java.lang.NegativeArraySizeException 
      at parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:183) 
      at parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:521) 
      at parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:493) 
      at parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:544) 
      at parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:339) 
      at parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63) 
      at parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58) 
      at parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:265) 
      at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:59) 
      at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:73) 
      at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110) 
      at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172) 
      ... 21 more
      

        Attachments

          Activity

            People

            • Assignee:
              rdblue Ryan Blue
              Reporter:
              rdblue Ryan Blue
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: