Uploaded image for project: 'Apache NiFi'
  1. Apache NiFi
  2. NIFI-11636

ParquetReader buffers up to 2 GB of content into heap unnecessarily

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.0.0-M1, 1.22.0
    • Extensions
    • None

    Description

      The Parquet Record Reader uses the NiFiSeekableInputStream. Because Parquet requires reading the footer first, this class is intended to use mark/reset so that we can read the footer and then reset back to the beginning.

      To achieve this, it calls InputStream.mark(Integer.MAX_VALUE) which will buffer up to 2 GB onto heap. However, the underlying InputStream is the ContentClaimInputStream. The ContentClaimInputStream has smarts built into it to allow resetting without having to buffer content into memory. In particular, if you read over the limit provided and then call reset it will close the InputStream and open a new InputStream from the beginning of the FlowFIle content and seek to the desired offset.

      Because of this, we don't need to use InputStream.mark(Integer.MAX_VALUE) and can instead use InputStream.mark(8192) or some similarly small value.

      Attachments

        Issue Links

          Activity

            People

              markap14 Mark Payne
              markap14 Mark Payne
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m