Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1977

Invalid data_page_offset

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.12.0
    • Fix Version/s: 1.12.0
    • Component/s: parquet-mr
    • Labels:
      None

      Description

      The thrift field data_page_offset is filled with incorrect value. Currently, it always points to the beginning of the column chunk which is not correct according to the spec in case there is a dictionary page. This is not a regression as it was written incorrectly since the beginning of parquet-mr.
      Meanwhile PARQUET-1850 fixed that we never wrote the field dictionary_page_offset. After the fix we correctly write this field if there is a dictionary filter. The problem is we are using the same value to fill both fields. So there are two possibilities:

      • There is no dictionary page in the column chunk so data_page_offset is filled with the correct value while dictionary_page_offset is not filled which is still correct. We are good.
      • There is a dictionary page at the beginning of the column chunk so data_page_offset and dictionary_page_offset are both contains the same value. This is not only a regression but it causes issues in other implementations (e.g. Impala) where footer validation is more strict than in parquet-mr because dictionary_page_offset shall be less than data_page_offset at all time if it is filled.

      So, we need to fill data_page_offset correctly.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                gszadovszky Gabor Szadovszky
                Reporter:
                gszadovszky Gabor Szadovszky
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: