Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Invalid
-
None
-
None
-
None
Description
Currently the Parquet writer sets file_offset in the following way (from metadata.cc)
if (dictionary_page_offset > 0) {
column_chunk_->meta_data.__set_dictionary_page_offset(dictionary_page_offset);
column_chunk_->__set_file_offset(dictionary_page_offset + compressed_size);
} else {
column_chunk_->__set_file_offset(data_page_offset + compressed_size);
}
This doesn't look correct, as it shouldn't take compressed_size into consideration.
The file_offset is used when filtering row groups, and the above could cause correctness issue. See SPARK-36696.