Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-2124

Bad DCHECK For Intermixed Dictionary Encoding

    XMLWordPrintableJSON

Details

    Description

      Parquet CPP has a DCHECK for a dictionary encoded page coming after a non-dictionary encoded page. This is bad because the DCHECK can be triggered by Parquet files that have a column that has a dictionary page, then a non-dictionary encoded page, then a page of dictionary encoded values(indices). Fuzzing found such a file. While this could be turned into an exception, I don't see anything in the Parquet specification that prohibits such an occurrence of pages.

      This situation has brought up on the mailing list before(https://lists.apache.org/thread/3bzymmbxvmzj12km7cjz1150ndvy9bos) and it seems like this is valid but nobody is doing it.

      In the PR that added this check(https://github.com/apache/parquet-cpp/pull/73) it was noted that the check is probably not needed.

      Attachments

        Activity

          People

            willb_google William Butler
            willb_google William Butler
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 2h 10m
                2h 10m