Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-7380

Untracked memory for file metadata like AvroHeader accumulates until end of query

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Backend

    Description

      HdfsScanNodeBase maintains a map of per-file metadata objects for use by different scan ranges from the same file, e.g. AvroFileHeader. These are not cleaned up until the end of the query.

      Note that because of IMPALA-6932 this doesn't necessarily increase peak memory significantly (because the headers are all accumulated during the header-parsing phase anyway).

      We should track the number of scanners remaining for each file and delete the headers when we no longer need them.

      How to reproduce

      Create an Avro table with a large number of files (e.g. 10000).

      Run an Avro scan on a single node:

      set num_nodes=1;
      select * from table where foo = 'bar';
      

      Notice on the /memz debug page that untracked memory increases a lot, then drops once the query is cancelled or finishes.

      Proposed fix

      Values from HdfsScanNodeBase::per_file_metadata_ should be removed and the metadata object deleted once all scanners for that file/partition combination are finished. We already know the expected number of scan ranges per file from HdfsFileDesc::splits so we can delete the object once all scan ranges for the file are finished.

      I can see two options here, both of which involve evicting members from per_file_metadata_ at different points:

      1. unique ownership: per_file_metadata_ owns the metadata objects via a unique_ptr and maintains a refcount that is decremented by the scanner when it is done (e.g. by BaseSequenceScanner::Close()).
      2. shared ownership: per_file_metadata_ stores shared_ptr and maintains a refcount that is decremented when each scanner makes a copy of the shared_ptr.

      I think #1 is better since it's more consistent with our usual memory management. The nice thing about #2 though is that the interaction with the scanners is simpler.

      Attachments

        Activity

          People

            afan Alice Fan
            tarmstrong Tim Armstrong
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: