Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10353

[C++] Parquet decompresses DataPageV2 pages even if is_compressed==0

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.0.0
    • Component/s: C++

      Description

      According to the parquet-format specification, DataPageV2 pages have an is_compressed flag. Even if the column chunk has a decompression codec set, the page is only compressed if this flag is true (this likely enables not compressing some pages where the compression wouldn't save memory).

      Here is the relevant excerpt from parquet.thrift describing the semantics of the is_compressed flag in a DataPageV2:

      /** whether the values are compressed.
      Which means the section of the page between
      definition_levels_byte_length + repetition_levels_byte_length + 1 and compressed_page_size (included)
      is compressed with the compression_codec.
      If missing it is considered compressed */
      7: optional bool is_compressed = 1;

       

      It seems that the apache parquet cpp library (haven't checked other languages but might have the bug as well) totally disregards this flag and decompresses the page in all cases if a decompressor is set for the column chunk.

      The erroneous code is in column_reader.cc: 

      std::shared_ptr<Page> SerializedPageReader::NextPage() 

      This method first decompresses the page if there is a decompressor set and only then does a case distinction on whether this page is a DataPageV2 and has the is_compressed flag. Thus, even if the page would have this flag set to 0, the page would be decompressed anyway.

      The method that should use the is_compressed flag but doesn't is:

      std::shared_ptr<Buffer> SerializedPageReader::DecompressPage

      This method doesn't look at the is_compressed flag at all.

       

      The reason why this bug probably doesn't show in any unit test is that the write implementation seems to do the same mistake: It always compresses the page, even if the page has its is_compressed flag set to false.

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                apitrou Antoine Pitrou
                Reporter:
                jfinis Jan Finis
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m