Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1716

[C++] Add support for BYTE_STREAM_SPLIT encoding

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: parquet-cpp
    • Flags:
      Patch

      Description

      From the Parquet issue ( https://issues.apache.org/jira/browse/PARQUET-1622 ):

      Apache Parquet does not have any encodings suitable for FP data and the available text compressors (zstd, gzip, etc) do not handle FP data very well.

      It is possible to apply a simple data transformation named "stream splitting". Such could be "byte stream splitting" which creates K streams of length N where K is the number of bytes in the data type (4 for floats, 8 for doubles) and N is the number of elements in the sequence.

      The transformed data compresses significantly better on average than the original data and for some cases there is a performance improvement in compression and decompression speed.

      You can read a more detailed report here:
      [https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view

      Apache Arrow can benefit from the reduced requirements for storing FP parquet column data and improvements in decompression speed.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                martinradev Martin Radev
              • Votes:
                2 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - 72h
                  72h
                  Remaining:
                  Time Spent - 50m Remaining Estimate - 71h 10m
                  71h 10m
                  Logged:
                  Time Spent - 50m Remaining Estimate - 71h 10m
                  50m