Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-2090

[C++] Parquet writes incorrect file_offset

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Invalid
    • None
    • None
    • parquet-cpp
    • None

    Description

      Currently the Parquet writer sets file_offset in the following way (from metadata.cc)

          if (dictionary_page_offset > 0) {
            column_chunk_->meta_data.__set_dictionary_page_offset(dictionary_page_offset);
            column_chunk_->__set_file_offset(dictionary_page_offset + compressed_size);
          } else {
            column_chunk_->__set_file_offset(data_page_offset + compressed_size);
          }

      This doesn't look correct, as it shouldn't take compressed_size into consideration.

      The file_offset is used when filtering row groups, and the above could cause correctness issue. See SPARK-36696.

      Attachments

        Activity

          People

            emkornfield Micah Kornfield
            csun Chao Sun
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: