Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-817

Replace aircompressor ZStandard compression with zstd-jni

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.8.0
    • None
    • Java
    • None

    Description

      This issue tracks the replacement of the aircompressor dependency for ZStandard compression with zstd-jni.

      ORC's Java ZStandard compression codec currently uses the aircompressor dependency. This implementation is in pure Java, which provides all the niceties of not using an additional language, but over time, it has become less ideal:

      • Multiple other projects in the big data processing ecosystem like spark, parquet, and avro, all rely on zstd-jni, which is a Java Native Interface wrapper over the core zstd C++ library. Relying on the same dependency as other projects in our realm will let us track the same improvements and maintain the aesthetic of a ZStandard implementation blessed by the community.
      • ORC C++ uses the zstd library directly, while ORC Java relies on aircompressor. Since these versions do not have feature parity, it is theoretically possible to modify ORC C++ to produce a file that ORC Java cannot read. Maintaining compatibility between C++ and Java ORC means keeping the available features to those supported by both, which is limiting when relying on aircompressor. It is also conceivable that unintended incompatibilities between implementations could silently arise.
      • aircompressor implements a very limited set of ZStandard compression modes. In https://github.com/airlift/aircompressor/blob/495bae80ac7487d2efa1bba437d04e8a2a42bb7b/src/main/java/io/airlift/compress/zstd/CompressionParameters.java#L143 it can be seen that only the DoubleFastBlockCompressor strategy of ZStandard (out of the eight possible strategies) is actually implemented. This is a fast-speed/lower-compression-ratio strategy, which means it's suitable for things like shuffle data, but that that higher compression ratio/slower speed levels, which could be used to store "write-once-read-many" or backup data in ORC with high compression ratios, aren't possible with aircompressor.
      • aircompressor currently suffers from a bug, originally discovered in the presto community, that prevents ORC from upgrading to the most recent aircompressor version, lest we introduce the same bug into ORC: https://github.com/airlift/aircompressor/issues/122 Moving to zstd-jni could let presto to move to zstd-jni as well.
      • Besides bug and performance fixes, zstd-jni supports newer functionality like –long mode that aircompressor doesn't. This mode uses longer distance windows to achieve materially higher compression ratios at the same speeds as earlier ZStandard versions, and has been available for more than two years: https://github.com/facebook/zstd/releases/tag/v1.3.2 

      Attachments

        Activity

          People

            dchristle David Christle
            jcamacho Jesús Camacho Rodríguez
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated: