Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-17096

Fix ZStandardCompressor input buffer offset



    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.2.1
    • 3.2.2, 3.3.1, 3.4.0
    • io


      A bug in index handling causes ZStandardCompressor.c to pass a malformed ZSTD_inBuffer to libzstd. libzstd then returns an "Error (generic)" that gets thrown. The crux of the issue is two variables, uncompressedDirectBufLen and uncompressedDirectBufOff. The hadoop code counts uncompressedDirectBufOff from the start of uncompressedDirectBuf, then uncompressedDirectBufLen is counted from uncompressedDirectBufOff. However, libzstd considers pos and size to both be counted from the start of the buffer. As a result, this line https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L228 causes a malformed buffer to be passed to libzstd, where pos>size. Here's a longer description of the bug in case this abstract explanation is unclear:

      Suppose we initialize uncompressedDirectBuf (via setInputFromSavedData) with five bytes of input. This results in uncompressedDirectBufOff=0 and uncompressedDirectBufLen=5 (https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.java#L140-L146).

      Then we call compress(), which initializes a ZSTD_inBuffer (https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L195-L196). The definition of those libzstd structs is here https://github.com/facebook/zstd/blob/v1.3.1/lib/zstd.h#L251-L261 - note that we set size=uncompressedDirectBufLen and pos=uncompressedDirectBufOff. The ZSTD_inBuffer gets passed to libzstd, compression happens, etc. When libzstd returns from the compression function, it updates the ZSTD_inBuffer struct to indicate how many bytes were consumed (https://github.com/facebook/zstd/blob/v1.3.1/lib/compress/zstd_compress.c#L3919-L3920). Note that pos is advanced, but size is unchanged.

      Now, libzstd does not guarantee that the entire input will be compressed in a single call of the compression function. (Some of the compression libraries used by hadoop, such as snappy, do provide this guarantee, but libzstd is not one of them.) So the hadoop native code updates uncompressedDirectBufOff and uncompressedDirectBufLen using the updated ZSTD_inBuffer: https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L227-L228

      Now, returning to our example, we started with 5 bytes of uncompressed input. Suppose libzstd compressed 4 of those bytes, leaving one unread. This would result in a ZSTD_inBuffer struct with size=5 (unchanged) and pos=4 (four bytes were consumed). The hadoop native code would then set uncompressedDirectBufOff=4, but it would also set uncompressedDirectBufLen=1 (five minus four equals one).

      Since some of the input was not consumed, we will eventually call compress() again. Then we instantiate another ZSTD_inBuffer struct, this time with size=1 and pos=4. This is a bug - libzstd expects size and pos to both be counted from the start of the buffer, therefore pos>size is unsound. So it returns an error https://github.com/facebook/zstd/blob/v1.3.1/lib/compress/zstd_compress.c#L3932 which gets escalated as a java.lang.InternalError.

      I will be submitting a pull request on github with a fix for this bug. The key is that the hadoop code should handle offsets the same way libzstd does, ie uncompressedDirectBufLen should be counted from the start of uncompressedDirectBuf, not from uncompressedDirectBufOff.


        1. fuzztest.patch
          2 kB
          Stephen Jung (Stripe)

        Issue Links



              sjung-stripe Stephen Jung (Stripe)
              sjung-stripe Stephen Jung (Stripe)
              0 Vote for this issue
              4 Start watching this issue



                Time Tracking

                  Original Estimate - Not Specified
                  Not Specified
                  Remaining Estimate - 0h
                  Time Spent - 10m