The thrift field data_page_offset is filled with incorrect value. Currently, it always points to the beginning of the column chunk which is not correct according to the spec in case there is a dictionary page. This is not a regression as it was written incorrectly since the beginning of parquet-mr.
PARQUET-1850 fixed that we never wrote the field dictionary_page_offset. After the fix we correctly write this field if there is a dictionary filter. The problem is we are using the same value to fill both fields. So there are two possibilities:
- There is no dictionary page in the column chunk so data_page_offset is filled with the correct value while dictionary_page_offset is not filled which is still correct. We are good.
- There is a dictionary page at the beginning of the column chunk so data_page_offset and dictionary_page_offset are both contains the same value. This is not only a regression but it causes issues in other implementations (e.g. Impala) where footer validation is more strict than in parquet-mr because dictionary_page_offset shall be less than data_page_offset at all time if it is filled.
So, we need to fill data_page_offset correctly.