Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4899

Parquet table writer leaks dictionaries

    Details

      Description

      Mostafa Mokhtar found a memory leak while inserting into Parquet files.

      memz showed a lot of untracked memory (notice how the sum of the RequestPool peak memory doesn't add up to anywhere near the Process peak memory):

      Process: Limit=100.00 GB Total=11.20 GB Peak=100.24 GB
        Free Disk IO Buffers: Total=609.44 MB Peak=1.76 GB
        RequestPool=fe-eval-exprs: Total=0 Peak=4.00 KB
        RequestPool=root.jenkins: Total=0 Peak=31.08 GB
        RequestPool=root.default: Total=0 Peak=2.05 GB
        RequestPool=root.mmokhtar: Total=1.85 GB Peak=2.30 GB
          Query(9341d70e5e64d792:420d626600000000): Limit=80.00 GB Total=1.85 GB Peak=2.04 GB
            Fragment 9341d70e5e64d792:420d626600000001: Total=1.83 GB Peak=2.04 GB
              SORT_NODE (id=1): Total=1.83 GB Peak=1.85 GB
              HDFS_SCAN_NODE (id=0): Total=0 Peak=594.56 MB
              HdfsTableSink: Total=2.94 MB Peak=3.06 MB
              CodeGen: Total=181.00 B Peak=290.00 KB
            Block Manager: Limit=64.00 GB Total=1.85 GB Peak=1.85 GB
      

      I was able to get a heap growth profile from the live impalad (see https://cwiki.apache.org/confluence/display/IMPALA/Collecting+Impala+CPU+and+Heap+Profiles). I've attached the output of --pdf, which shows that DictEncoders are responsible for a lot of the heap growth.

      This looks like the same bug as IMPALA-2940 except on the write path.

        Attachments

        1. heap-growth.pdf
          12 kB
          Tim Armstrong

          Activity

            People

            • Assignee:
              joemcdonnell Joe McDonnell
              Reporter:
              tarmstrong Tim Armstrong
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: