Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-2414

[Format] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64

    XMLWordPrintableJSON

Details

    Description

      In PARQUET-1622 we added the BYTE_STREAM_SPLIT encoding which, while simple to implement, allows to significantly improve compression efficiency on FLOAT and DOUBLE columns.

      In PARQUET-758 we added the FLOAT16 logical type which annotates a 2-byte-wide FIXED_LEN_BYTE_ARRAY column to denote that it contains 16-bit IEEE binary floating-point (colloquially called "half float").

      This issue proposes to widen the types supported by the BYTE_STREAM_SPLIT encoding. By allowing the BYTE_STREAM_SPLIT encoding on any FIXED_LEN_BYTE_ARRAY column, we can automatically improve compression efficiency on various column types including:

      • half-float data
      • fixed-width decimal data

      Also, by allowing the BYTE_STREAM_SPLIT encoding on any INT32 or INT64 column, we can improve compression efficiency on further column types such as timestamps.

      I've run compression measurements on various pieces of sample data which I detail below.

      Float16 data

      I've downloaded the sample datasets from
      https://userweb.cs.txstate.edu/~burtscher/research/datasets/FPsingle/ , uncompressed them and converted them to half-float using NumPy. Two files had to be discarded because of overflow when converting to half-float.

      I've then run three different compression algorithms (lz4, zstd, snappy), optionally preceded by a BYTE_STREAM_SPLIT encoding with 2 streams (corresponding to the byte width of the FLBA columns. Here are the results:

      +----------------+----------------+--------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | name           |   uncompressed |    lz4 |   bss_lz4 |   snappy |   bss_snappy |   zstd |   bss_zstd |   bss_ratio_lz4 |   bss_ratio_snappy |   bss_ratio_zstd |
      +================+================+========+===========+==========+==============+========+============+=================+====================+==================+
      | msg_sp.sp      |    72526464.00 |   1.42 |      1.94 |     1.38 |         1.78 |   2.28 |       2.71 |            1.37 |               1.30 |             1.18 |
      +----------------+----------------+--------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | msg_sppm.sp    |    69748966.00 |  18.90 |     29.05 |    11.38 |        14.39 |  45.81 |      71.49 |            1.54 |               1.26 |             1.56 |
      +----------------+----------------+--------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | msg_sweep3d.sp |    31432806.00 |   2.06 |      3.20 |     1.03 |         1.94 |  11.77 |      17.00 |            1.55 |               1.89 |             1.44 |
      +----------------+----------------+--------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | num_brain.sp   |    35460000.00 |   1.02 |      1.51 |     1.01 |         1.49 |   1.26 |       1.81 |            1.49 |               1.48 |             1.44 |
      +----------------+----------------+--------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | num_comet.sp   |    26836992.00 |   1.45 |      1.74 |     1.42 |         1.69 |   1.64 |       2.07 |            1.20 |               1.19 |             1.26 |
      +----------------+----------------+--------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | num_control.sp |    39876186.00 |   1.35 |      1.49 |     1.37 |         1.53 |   1.70 |       1.93 |            1.11 |               1.12 |             1.14 |
      +----------------+----------------+--------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | num_plasma.sp  |     8772400.00 | 123.88 |    152.12 |     1.00 |         1.80 | 259.58 |     405.96 |            1.23 |               1.80 |             1.56 |
      +----------------+----------------+--------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | obs_error.sp   |    15540204.00 |   1.05 |      1.51 |     1.02 |         1.46 |   2.06 |       3.55 |            1.44 |               1.43 |             1.72 |
      +----------------+----------------+--------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | obs_info.sp    |     4732632.00 |   1.08 |      1.74 |     1.00 |         1.61 |   2.60 |       3.63 |            1.62 |               1.61 |             1.40 |
      +----------------+----------------+--------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | obs_spitzer.sp |    49545216.00 |   1.00 |      1.01 |     1.00 |         1.01 |   1.22 |       1.35 |            1.01 |               1.01 |             1.11 |
      +----------------+----------------+--------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | obs_temp.sp    |     9983568.00 |   1.00 |      1.00 |     1.00 |         1.00 |   1.08 |       1.17 |            1.00 |               1.00 |             1.08 |
      +----------------+----------------+--------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      

      Explanation:

      • the columns "lz4", "snappy", "zstd" show the compression ratio achieved with the respective compressors (i.e. uncompressed size divided by compressed size)
      • the columns "bss_lz4", "bss_snappy", "bss_zstd" are similar, but with a BYTE_STREAM_SPLIT encoding applied first
      • the columns "bss_ratio_lz4", "bss_ratio_snappy", "bss_ratio_zstd" show the additional compression ratio achieved by prepending the BYTE_STREAM_SPLIT encoding step (i.e. PLAIN-encoded compressed size divided by BYTE_STREAM_SPLIT-encoded compressed size).

      (reference) Float32 data

      For reference, here are the measurements for the original single-precision floating-point data.

      +----------------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | name           |   uncompressed |   lz4 |   bss_lz4 |   snappy |   bss_snappy |   zstd |   bss_zstd |   bss_ratio_lz4 |   bss_ratio_snappy |   bss_ratio_zstd |
      +================+================+=======+===========+==========+==============+========+============+=================+====================+==================+
      | msg_sp.sp      |   145052928.00 |  1.00 |      1.45 |     1.00 |         1.39 |   1.12 |       1.66 |            1.46 |               1.39 |             1.48 |
      +----------------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | msg_sppm.sp    |   139497932.00 |  8.56 |      8.66 |     5.64 |         5.90 |  12.51 |      11.16 |            1.01 |               1.05 |             0.89 |
      +----------------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | msg_sweep3d.sp |    62865612.00 |  1.01 |      2.80 |     1.02 |         1.68 |   5.50 |       9.41 |            2.76 |               1.66 |             1.71 |
      +----------------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | num_brain.sp   |    70920000.00 |  1.00 |      1.31 |     1.00 |         1.30 |   1.13 |       1.43 |            1.31 |               1.30 |             1.27 |
      +----------------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | num_comet.sp   |    53673984.00 |  1.08 |      1.27 |     1.08 |         1.27 |   1.15 |       1.36 |            1.17 |               1.18 |             1.18 |
      +----------------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | num_control.sp |    79752372.00 |  1.01 |      1.12 |     1.01 |         1.13 |   1.08 |       1.21 |            1.11 |               1.12 |             1.12 |
      +----------------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | num_plasma.sp  |    17544800.00 |  1.00 |    140.74 |     1.01 |         1.30 | 279.49 |     310.68 |          141.29 |               1.30 |             1.11 |
      +----------------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | obs_error.sp   |    31080408.00 |  1.12 |      1.37 |     1.16 |         1.29 |   1.73 |       3.10 |            1.22 |               1.11 |             1.80 |
      +----------------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | obs_info.sp    |     9465264.00 |  1.07 |      1.42 |     1.00 |         1.29 |   2.25 |       3.04 |            1.33 |               1.29 |             1.35 |
      +----------------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | obs_spitzer.sp |    99090432.00 |  1.02 |      1.11 |     1.01 |         1.12 |   1.20 |       1.31 |            1.09 |               1.10 |             1.09 |
      +----------------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | obs_temp.sp    |    19967136.00 |  1.00 |      1.12 |     1.00 |         1.13 |   1.08 |       1.19 |            1.12 |               1.13 |             1.10 |
      +----------------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      

      Comments

      The additional efficiency of the BYTE_STREAM_SPLIT encoding step is very significant on most files (except obs_temp.sp which generally doesn't compress at all), with additional gains usually around 30%.

      The BYTE_STREAM_SPLIT encoding is, perhaps surprisingly, on average as beneficial on Float16 data as it is on Float32 data.

      Decimal data from OpenStreetMap changesets

      I've downloaded one of the recent OSM changesets file changesets-231030.orc, and loaded the four decimal columns from the first stripe of that file. Those columns look like:

      pyarrow.RecordBatch
      min_lat: decimal128(9, 7)
      max_lat: decimal128(9, 7)
      min_lon: decimal128(10, 7)
      max_lon: decimal128(10, 7)
      ----
      min_lat: [51.5288506,51.0025063,51.5326805,51.5248871,51.5266800,51.5261841,51.5264130,51.5238914,59.9463692,59.9513092,...,50.8238277,52.1707376,44.2701598,53.1589748,43.5988333,37.7867167,45.5448822,null,50.7998334,50.5653478]
      max_lat: [51.5288620,51.0047760,51.5333176,51.5289383,51.5291901,51.5300598,51.5264130,51.5238914,59.9525642,59.9561501,...,50.8480772,52.1714300,44.3790161,53.1616817,43.6001496,37.7867913,45.5532716,null,51.0188961,50.5691352]
      min_lon: [-0.1465242,-1.0052705,-0.1566335,-0.1485492,-0.1418076,-0.1550623,-0.1539768,-0.1432930,10.7782278,10.7719727,...,10.6863813,13.2218676,19.8840738,8.9128186,1.4030591,-122.4212761,18.6789571,null,-4.2085209,8.6851671]
      max_lon: [-0.1464925,-0.9943439,-0.1541054,-0.1413791,-0.1411505,-0.1453212,-0.1539768,-0.1432930,10.7898550,10.7994537,...,10.7393494,13.2298706,20.2262343,8.9183611,1.4159345,-122.4212503,18.6961594,null,-4.0496079,8.6879264]
      

      Here are the compression measurements using the same methodology as above. The number of BYTE_STREAM_SPLIT streams is the respective byte width of each FLBA column (i.e., 4 for latitudes and 5 for longitudes).

      +---------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | name    |   uncompressed |   lz4 |   bss_lz4 |   snappy |   bss_snappy |   zstd |   bss_zstd |   bss_ratio_lz4 |   bss_ratio_snappy |   bss_ratio_zstd |
      +=========+================+=======+===========+==========+==============+========+============+=================+====================+==================+
      | min_lat |     4996652.00 |  1.00 |      1.01 |     1.00 |         1.03 |   1.05 |       1.12 |            1.01 |               1.03 |             1.07 |
      +---------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | max_lat |     4996652.00 |  1.00 |      1.01 |     1.00 |         1.03 |   1.05 |       1.13 |            1.01 |               1.03 |             1.07 |
      +---------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | min_lon |     6245825.00 |  1.00 |      1.14 |     1.00 |         1.16 |   1.15 |       1.31 |            1.14 |               1.16 |             1.14 |
      +---------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | max_lon |     6245825.00 |  1.00 |      1.14 |     1.00 |         1.16 |   1.15 |       1.31 |            1.14 |               1.16 |             1.14 |
      +---------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      

      Comments

      On this dataset, compression efficiency is generally quite poor and BYTE_STREAM_SPLIT encoding brings almost no additional efficiency to the table. It can be assumed that OSM changeset entries have geographical coordinates all over the place (literally!) and therefore do not offer many opportunities for compression.

      Decimal data from an OpenStreetMap region

      I've chosen a small region of the world (Belgium) whose geographical coordinates presumably allow for better compression by being much more clustered. The file belgium-latest.osm.pbf was converted to ORC for easier handling, resulting in a 745 MB ORC file.

      I've then loaded the decimal columns from the first stripe in that file:

      pyarrow.RecordBatch
      lat: decimal128(9, 7)
      lon: decimal128(10, 7)
      ----
      lat: [50.4443865,50.4469017,50.4487890,50.4499558,50.4523446,50.4536530,50.4571053,50.4601436,50.4631197,50.4678563,...,51.1055899,51.1106197,51.1049620,51.1047010,51.1104755,51.0997955,51.1058101,51.1010664,51.1014336,51.1055106]
      lon: [3.6857362,3.6965046,3.7074481,3.7173626,3.8126033,3.9033178,3.9193678,3.9253319,3.9292409,3.9332670,...,4.6663214,4.6699997,4.6720536,4.6655159,4.6666372,4.6680394,4.6747172,4.6684242,4.6713693,4.6644899]
      

      Here are the compression measurements for these columns. As in the previous dataset, the number of BYTE_STREAM_SPLIT streams is the respective byte width of each FLBA column (i.e., 4 for latitudes and 5 for longitudes).

      +--------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | name   |   uncompressed |   lz4 |   bss_lz4 |   snappy |   bss_snappy |   zstd |   bss_zstd |   bss_ratio_lz4 |   bss_ratio_snappy |   bss_ratio_zstd |
      +========+================+=======+===========+==========+==============+========+============+=================+====================+==================+
      | lat    |    12103680.00 |  1.00 |      1.63 |     1.00 |         1.63 |   1.18 |       1.73 |            1.63 |               1.63 |             1.47 |
      +--------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      | lon    |    15129600.00 |  1.00 |      1.93 |     1.00 |         1.90 |   1.27 |       2.06 |            1.93 |               1.90 |             1.62 |
      +--------+----------------+-------+-----------+----------+--------------+--------+------------+-----------------+--------------------+------------------+
      

      Comments

      This dataset shows that a BYTE_STREAM_SPLIT encoding before compression achieves a very significant additional efficiency compared to compression alone.

      Integer data from two OpenStreetMap data files

      I also tried to evaluate the efficiency of BYTE_STREAM_SPLIT on integer columns (INT32 or INT64). Here, however, another efficient encoding is already available (DELTA_BINARY_PACKED). So the evaluation focussed on comparing BYTE_STREAM_SPLIT + compression against DELTA_BINARY_PACKED alone.

      The comparison was done on the two same OpenStreetMap files as above, using only the first stripe. Here are the measurement results in table format:

      *Caution*: the DELTA_BINARY_PACKED length measurement did not use a real encoder implementation, but a length estimation function written in pure Python. The estimation function should be accurate according to quick tests.

      Comments

      The results are very heterogeneous, depending on the kind of data those integer columns represent.

      Some columns achieve very good compression ratios, far above 10x, with all methods; for these columns, it does not make sense to compare the compression ratios, since the column sizes will be very small in all cases; performance and interoperability should be the only concerns.

      On other columns, the compression ratios are more moderate and BYTE_STREAM_SPLIT + compression seems to be preferable to DELTA_BINARY_PACKED.

      Integer data from a PyPI archive file

      I downloaded one of the "index" Parquet files from https://github.com/pypi-data/data/releases and read the first row group.
      The measurement results are as follows:

      Comments

      On this data, BYTE_STREAM_SPLIT + compression is clearly better than DELTA_BINARY_PACKED. The timestamp column ("uploaded_on") in particular shows very strong benefits.

      Integer data from a NYC "yellow" taxi file

      I downloaded one of the "yellow" taxi trip records from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page and read the first row group. The measurement results are as follows:

      Comments

      These results are a bit of a mixed bag. Only BYTE_STREAM_SPLIT + zstd is consistenly superior to DELTA_BINARY_PACKED. However, if one focusses on the timestamp columns, then all three general-purpose compressors provide a benefit.

      Discussion

      When reading these results, it is important to remind that the exact compression ratios do not necessarily matter, as long as the efficiency is high enough. A compressor that achieves 100x compression on a column is not necessarily worse than one that achieves 300x compression on the same column: both are "good enough" on this particular data. On the contrary, when compression ratios are moderate (lower than 10x), they should certainly be compared.

      Efficiency

      Efficiency on FIXED_LEN_BYTE_ARRAY data

      These examples show that extending the BYTE_STREAM_SPLIT encoding to FIXED_LEN_BYTE_ARRAY columns (even regardless of their logical types) can yield very significant compression efficiency improvements on two specific types of FIXED_LEN_BYTE_ARRAY data: FLOAT16 data and DECIMAL data.

      Efficiency on INT32 / INT64 data

      Extending the BYTE_STREAM_SPLIT encoding to INT32 and INT64 columns can bring significant benefits over DELTA_BINARY_PACKED. However, whether and by how much depends on the kind of data that is encoded as integers. Timestamps seem to always benefit from BYTE_STREAM_SPLIT encoding. Pairing BYTE_STREAM_SPLIT with zstd also generally achieves higher efficiency than DELTA_BINARY_PACKED.

      Whether to choose BYTE_STREAM_SPLIT + compression over DELTA_BINARY_PACKED will in practice have to be informed by several factors, such as performance expectations and interoperability. Sophisticated writers might also implement some form of sampling to find out the best encoding + compression combination for a given column.

      *Note*: all tested data above is actually INT64. However, given the mechanics of BYTE_STREAM_SPLIT and DELTA_BINARY_PACKED, we can assume that similar results would have been obtained for INT32 data.

      Performance

      Since BYTE_STREAM_SPLIT only brings benefits in combination with compression, the overall encoding + compression cost should be considered.

      Performance on FIXED_LEN_BYTE_ARRAY data

      The choice is between BYTE_STREAM_SPLIT + compression vs. compression alone. Even a non-SIMD optimized version of BYTE_STREAM_SPLIT, such as in Parquet C++, can achieve multiple GB/s; there is little reason to pay the cost of compression but refuse to pay the much smaller cost of the BYTE_STREAM_SPLIT encoding step.

      Performance on INT32 / INT64 data

      The choice is between BYTE_STREAM_SPLIT + compression vs. DELTA_BINARY_PACKED alone. DELTA_BINARY_PACKED has a significant performance edge. The current Parquet C++ implementation of DELTA_BINARY_PACKED encodes between 600 MB/s and 2 GB/s, and decodes between 3 and 6 GB/s. This is faster than any of the general-purpose compression schemes available in Parquet, even lz4.

      Implementation complexity

      BYTE_STREAM_SPLIT, even byte width-agnostic, is almost trivial to implement. A simple implementation can yield good performance with a minimum of work.

      For example, the non-SIMD-optimized BYTE_STREAM_SPLIT encoding and decoding routines in Parquet C++ amount to a mere total of ~200 lines of code, despite explicitly-unrolled loops:
      https://github.com/apache/arrow/blob/4e58f7ca0016c2b2d8a859a0c5965df3b15523e0/cpp/src/arrow/util/byte_stream_split_internal.h#L593-L702

      Attachments

        1. bss_osm_belgium.png
          15 kB
          Antoine Pitrou
        2. bss_osm_changesets.png
          22 kB
          Antoine Pitrou
        3. bss_fp16.png
          51 kB
          Antoine Pitrou
        4. bss_ints_osm_changesets.png
          23 kB
          Antoine Pitrou
        5. bss_ints_nyctaxi.png
          33 kB
          Antoine Pitrou
        6. bss_ints_pypi.png
          19 kB
          Antoine Pitrou
        7. bss_ints_osm_belgium.png
          26 kB
          Antoine Pitrou

        Issue Links

          Activity

            People

              apitrou Antoine Pitrou
              apitrou Antoine Pitrou
              Votes:
              2 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: