Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34651 Improve ZSTD support
  3. SPARK-33978

Support ZSTD compression in ORC data source

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.2.0
    • Fix Version/s: 3.2.0
    • Component/s: SQL
    • Labels:
      None

      Description

      What changes were proposed in this pull request?

      This PR aims to support ZSTD compression in ORC data source.

      Why are the changes needed?

      Apache ORC 1.6 supports ZSTD compression to generate more compact files and save the storage cost.

      BEFORE

      scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd")
       java.lang.IllegalArgumentException: Codec [zstd] is not available. Available codecs are uncompressed, lzo, snappy, zlib, none. 

      AFTER

      scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd") 
       $ orc-tools meta /tmp/zstd 
       Processing data file file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc [length: 230]
       Structure for file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc
       File Version: 0.12 with ORC_14
       Rows: 1
       Compression: ZSTD
       Compression size: 262144
       Calendar: Julian/Gregorian
       Type: struct<id:bigint>
      Stripe Statistics:
       Stripe 1:
       Column 0: count: 1 hasNull: false
       Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9
      File Statistics:
       Column 0: count: 1 hasNull: false
       Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9
      Stripes:
       Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
       Stream: column 0 section ROW_INDEX start: 3 length 11
       Stream: column 1 section ROW_INDEX start: 14 length 24
       Stream: column 1 section DATA start: 38 length 6
       Encoding column 0: DIRECT
       Encoding column 1: DIRECT_V2
      File length: 230 bytes
       Padding length: 0 bytes
       Padding ratio: 0%
      User Metadata:
       org.apache.spark.version=3.2.0

       

        Attachments

          Activity

            People

            • Assignee:
              dongjoon Dongjoon Hyun
              Reporter:
              dongjoon Dongjoon Hyun
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: