Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-2160

Close decompression stream to free off-heap memory in time

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 1.12.3
    • 1.13.0
    • parquet-format
    • None
    • Spark 3.1.2 + Iceberg 0.12 + Parquet 1.12.3 + zstd-jni 1.4.9.1 + glibc

    Description

      The decompressed stream in HeapBytesDecompressor$decompress now relies on the JVM GC to close. When reading parquet in zstd compressed format, sometimes I ran into OOM cause high off-heap usage. I think the reason is that the GC is not timely and causes off-heap memory fragmentation. I had to set  lower MALLOC_TRIM_THRESHOLD_ to make glibc give back memory to system quickly. There is a [thread|https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1650928750269869?thread_ts=1650927062.590789&cid=C025PH0G1D4] of this zstd parquet issus in Iceberg community slack:  some people had the same problem. 

      I think maybe we can use ByteArrayBytesInput as decompressed bytes input and close decompressed stream in time to solve this problem:

      InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);
      decompressed = BytesInput.from(is, uncompressedSize); 

      ->

      InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);
      decompressed = BytesInput.copy(BytesInput.from(is, uncompressedSize));
      is.close(); 

      After I made this change to decompress, I found off-heap memory is significantly reduced (with same query on same data).

      Attachments

        Activity

          People

            Unassigned Unassigned
            zhongyuj Yujiang Zhong
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: