Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-2060

Parquet corruption can cause infinite loop with Snappy

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.14.0
    • parquet-mr
    • None

    Description

      I am attaching a valid and corrupt parquet file (datapageV2) that differ in one byte.

      We hit an infinite loop when trying to read the corrupt file in https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnReaderBase.java#L698 and specifically in the `page.getData().toInputStream()` call.  

      Stack trace of infinite loop:

      java.io.DataInputStream.readFully(DataInputStream.java:195)
      java.io.DataInputStream.readFully(DataInputStream.java:169)
      org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:287)
      org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:237)
      org.apache.parquet.bytes.BytesInput.toInputStream(BytesInput.java:246)
      org.apache.parquet.column.impl.ColumnReaderBase.readPageV2(ColumnReaderBase.java:698)
      org.apache.parquet.column.impl.ColumnReaderBase.access$400(ColumnReaderBase.java:57)
      org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:628)
      org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:620)
      org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:192)
      org.apache.parquet.column.impl.ColumnReaderBase.readPage(ColumnReaderBase.java:620)
      org.apache.parquet.column.impl.ColumnReaderBase.checkRead(ColumnReaderBase.java:594)

       

      The call to `readFully` will underneath go through `NonBlockedDecompressorStream` which will always hit this path: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/NonBlockedDecompressorStream.java#L45. This will cause `setInput` to not be called on the decompressor, and the subsequent calls to `decompress` will always hit this condition: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/SnappyDecompressor.java#L54. Therefore, the 0 value will be returned by the read method, which will cause an infinite loop in https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/io/DataInputStream.java#L198 
      This originates from the corruption, which causes the input stream of the data page to be of size 0, which makes `getCompressedData` always return -1. 

      I am wondering whether this can be caught earlier so that the read fails in case of such corruptions. 

      Since this happens in `BytesInput.toInputStream`, I don't think it's only relevant to DataPageV2. 

       

      In https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java#L111, if we call `bytes.toByteArray` and log its length, it is 0 in the case of the corrupt file, and 6 in the case of the valid file. 

      A potential fix is to check the array size there and fail early, but I am not sure if a zero-length byte array can ever be expected in the case of valid files.

       

      Attached:

      Valid file: `datapage_v2_snappy.parquet`

      Corrupt file: `datapage_v2_snappy.parquet1383`

      Attachments

        1. datapage_v2.snappy.parquet1383
          1 kB
          Marios Meimaris
        2. datapage_v2.snappy.parquet
          1 kB
          Marios Meimaris

        Issue Links

          Activity

            People

              mindjolt Rathin Bhargava
              mmeimaris Marios Meimaris
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: