Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1977

Invalid data_page_offset

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.12.0
    • 1.12.0
    • parquet-mr
    • None

    Description

      The thrift field data_page_offset is filled with incorrect value. Currently, it always points to the beginning of the column chunk which is not correct according to the spec in case there is a dictionary page. This is not a regression as it was written incorrectly since the beginning of parquet-mr.
      Meanwhile PARQUET-1850 fixed that we never wrote the field dictionary_page_offset. After the fix we correctly write this field if there is a dictionary filter. The problem is we are using the same value to fill both fields. So there are two possibilities:

      • There is no dictionary page in the column chunk so data_page_offset is filled with the correct value while dictionary_page_offset is not filled which is still correct. We are good.
      • There is a dictionary page at the beginning of the column chunk so data_page_offset and dictionary_page_offset are both contains the same value. This is not only a regression but it causes issues in other implementations (e.g. Impala) where footer validation is more strict than in parquet-mr because dictionary_page_offset shall be less than data_page_offset at all time if it is filled.

      So, we need to fill data_page_offset correctly.

      Attachments

        Issue Links

          Activity

            People

              gszadovszky Gabor Szadovszky
              gszadovszky Gabor Szadovszky
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: