Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1622

Add BYTE_STREAM_SPLIT encoding

    XMLWordPrintableJSON

Details

    • Patch

    Description

      Apache Parquet does not have any encodings suitable for FP data and the available text compressors (zstd, gzip, etc) do not handle FP data very well.

      It is possible to apply a simple data transformation named "stream splitting". Such could be "byte stream splitting" which creates K streams of length N where K is the number of bytes in the data type (4 for floats, 8 for doubles) and N is the number of elements in the sequence.

      The transformed data compresses significantly better on average than the original data and for some cases there is a performance improvement in compression and decompression speed.

      You can read a more detailed report here:
      https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view

      Attachments

        Activity

          People

            martinradev Martin Radev
            martinradev Martin Radev
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 48h
                48h
                Remaining:
                Remaining Estimate - 48h
                48h
                Logged:
                Time Spent - Not Specified
                Not Specified