Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-2429

Direct buffer churn in NonBlockedDecompressor

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.14.0
    • None
    • None

    Description

      Input buffers for NonBlockedDecompressor (and NonBlockedCompressor) are grown one chunk at a time as the class receives successive setInput calls. When decompressing a 64MB block using a 4KB chunk size, this leads to thousands of allocations and deallocations totaling GBs of memory. This can be avoided by doubling the buffer each time rather than adding on a minimal amount of new space.

      In a practical scenario I ran into, the time taken to read a 140MB Parquet file was reduced from 35s to <2s.

      PR: https://github.com/apache/parquet-mr/pull/1270

      Attachments

        Issue Links

          Activity

            People

              gian Gian Merlino
              gian Gian Merlino
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: