Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1575

Parquet reader throws error "Reading past RLE/BitPacking stream" for parquet file with null values

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.12.0
    • None
    • parquet-mr
    • None

    Description

      Recently moved from parquet 1.8.x to 1.12 recently.

      Dataset has > 20k null values to be written to a complex type. Earlier with 1.8.x, it would create single page but with 1.12 it creates 20 pages (parquet - 1414). Writing nulls to complex types has been optimised to be cached (null cache) that would be flushed on next non null encounter or explicit flush/close. With 1.8, it would have encountered explicit close and flush the null cache and write the page. But with 1.12, after encountering 20k values, the page is written prematurely.

       

      Below is the metadata dump in both cases.

      1.8 :

      index._id TV=111396 RL=0 DL=2 ---------------------------------------------------------------------------- page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[num_nulls: 111396, min/max not defined] SZ:8 VC:111396

       

      1.12 :

      index._index TV=111396 RL=0 DL=2 ---------------------------------------------------------------------------- page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[no stats for this column] SZ:4 VC:0 ...... page 19: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[no stats for this column] SZ:8 VC:111396

      All the pages in 1.12 except the last page have same metadata. Now the issue is when the parquet reader kicks in, it sees that the RLE is bit packed and reads 8 bytes which goes beyond the stream as the size is only 4 (Reading past RLE/BitPacking stream).

      Attachments

        Activity

          People

            Unassigned Unassigned
            shyamsingh shyam narayan singh
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: