Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8127

[C++] [Parquet] Incorrect column chunk metadata for multipage batch writes

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 0.17.0
    • C++

    Description

      When writing to a buffered column writer using PLAIN encoding, if the size of the batch supplied for writing exceeds the page size for the writer, the resulting file has an incorrect data_page_offset set in its column chunk metadata. This causes an exception to be thrown when reading the file (file appears to be too short to the reader).

      For example, the attached code, which attempts to write a batch of 262145 Int32's (= 1048576 + 4 bytes) using the default page size of 1048576 bytes (with buffered writer, PLAIN encoding), fails on reading, throwing the error: "Tried reading 1048678 bytes starting at position 1048633 from file but only got 333".

      The error is caused by the second page write tripping the conditional here https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L302, in the serialized in-memory writer wrapped by the buffered writer.

      The fix builds the metadata with offsets from the terminal sink rather than the in memory buffered sink. A PR is coming.

      Attachments

        1. multipage-batch-write.cc
          3 kB
          TP Boudreau

        Issue Links

          Activity

            People

              tpboudreau TP Boudreau
              tpboudreau TP Boudreau
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1.5h
                  1.5h