Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4899

Parquet table writer leaks dictionaries

    Details

      Description

      Mostafa Mokhtar found a memory leak while inserting into Parquet files.

      memz showed a lot of untracked memory (notice how the sum of the RequestPool peak memory doesn't add up to anywhere near the Process peak memory):

      Process: Limit=100.00 GB Total=11.20 GB Peak=100.24 GB
        Free Disk IO Buffers: Total=609.44 MB Peak=1.76 GB
        RequestPool=fe-eval-exprs: Total=0 Peak=4.00 KB
        RequestPool=root.jenkins: Total=0 Peak=31.08 GB
        RequestPool=root.default: Total=0 Peak=2.05 GB
        RequestPool=root.mmokhtar: Total=1.85 GB Peak=2.30 GB
          Query(9341d70e5e64d792:420d626600000000): Limit=80.00 GB Total=1.85 GB Peak=2.04 GB
            Fragment 9341d70e5e64d792:420d626600000001: Total=1.83 GB Peak=2.04 GB
              SORT_NODE (id=1): Total=1.83 GB Peak=1.85 GB
              HDFS_SCAN_NODE (id=0): Total=0 Peak=594.56 MB
              HdfsTableSink: Total=2.94 MB Peak=3.06 MB
              CodeGen: Total=181.00 B Peak=290.00 KB
            Block Manager: Limit=64.00 GB Total=1.85 GB Peak=1.85 GB
      

      I was able to get a heap growth profile from the live impalad (see https://cwiki.apache.org/confluence/display/IMPALA/Collecting+Impala+CPU+and+Heap+Profiles). I've attached the output of --pdf, which shows that DictEncoders are responsible for a lot of the heap growth.

      This looks like the same bug as IMPALA-2940 except on the write path.

      1. heap-growth.pdf
        12 kB
        Tim Armstrong

        Activity

        Hide
        tarmstrong Tim Armstrong added a comment -

        If you look at Nong's comment on this JIRA: IMPALA-1440 , I think he was encountering the same bug.

        Show
        tarmstrong Tim Armstrong added a comment - If you look at Nong's comment on this JIRA: IMPALA-1440 , I think he was encountering the same bug.
        Hide
        tarmstrong Tim Armstrong added a comment -

        Adding this to the relevant epic so that it's more easily findable.

        Show
        tarmstrong Tim Armstrong added a comment - Adding this to the relevant epic so that it's more easily findable.
        Hide
        mmokhtar Mostafa Mokhtar added a comment -

        Repro

          
        create table orders_p (
         O_CUSTKEY BIGINT,
         O_ORDERSTATUS STRING,
         O_TOTALPRICE double,
         O_ORDERPRIORITY STRING,
         O_CLERK STRING,
         O_SHIPPRIORITY BIGINT,
         O_COMMENT STRING,O_ORDERDATE string)
         partitioned by (O_ORDERKEY BIGINT)
         stored as parquet;
        

        insert overwrite table orders_p partition(O_ORDERKEY) /+ clustered, noshuffle/
        select O_CUSTKEY ,
        O_ORDERSTATUS ,
        O_TOTALPRICE ,
        O_ORDERPRIORITY,
        O_CLERK ,
        O_SHIPPRIORITY,
        O_COMMENT, O_ORDERDATE,O_ORDERKEY from orders where o_orderkey < 50000;

        
        

        Change 50000 to whichever number of partitions needed.

        Show
        mmokhtar Mostafa Mokhtar added a comment - Repro create table orders_p ( O_CUSTKEY BIGINT, O_ORDERSTATUS STRING, O_TOTALPRICE double , O_ORDERPRIORITY STRING, O_CLERK STRING, O_SHIPPRIORITY BIGINT, O_COMMENT STRING,O_ORDERDATE string) partitioned by (O_ORDERKEY BIGINT) stored as parquet; insert overwrite table orders_p partition(O_ORDERKEY) / + clustered, noshuffle / select O_CUSTKEY , O_ORDERSTATUS , O_TOTALPRICE , O_ORDERPRIORITY, O_CLERK , O_SHIPPRIORITY, O_COMMENT, O_ORDERDATE,O_ORDERKEY from orders where o_orderkey < 50000; Change 50000 to whichever number of partitions needed.
        Hide
        joemcdonnell Joe McDonnell added a comment -

        The dictionary encoder definitely allocates a lot of memory outside of our memory pools. However, it seems to me that this may be legitimate memory usage and not a leak. For every partition, there is a parquet table writer, which has a dictionary encoder. The writers for all partitions are in memory simultaneously. The rough equation of memory consumption for an individual dictionary encoder is:

        memory = (# buckets * 2 bytes) + (# dictionary entries * (2 bytes + size of datatype)) + (# buffered indices * 4 bytes)

        1. buckets = 65535
        2. dictionary entries < 40000, depends on number of distinct values
        3. size of datatype for BIGINT, DOUBLE = 8, for STRING = 16
          Consider a bigint datatype with 10000 values and 10000 buffered indices, this would be:
          memory = (65535 * 2 bytes) + (10000 * 10 bytes) + (10000 * 4 bytes) = 131070 + 100000 + 40000 = 271070 bytes
          Multiplying this by 50000 partitions is ~13GB. Given multiple columns, I think it is plausible for the dictionaries to use 40GB when processing 50,000 partitions.
        Show
        joemcdonnell Joe McDonnell added a comment - The dictionary encoder definitely allocates a lot of memory outside of our memory pools. However, it seems to me that this may be legitimate memory usage and not a leak. For every partition, there is a parquet table writer, which has a dictionary encoder. The writers for all partitions are in memory simultaneously. The rough equation of memory consumption for an individual dictionary encoder is: memory = (# buckets * 2 bytes) + (# dictionary entries * (2 bytes + size of datatype)) + (# buffered indices * 4 bytes) buckets = 65535 dictionary entries < 40000, depends on number of distinct values size of datatype for BIGINT, DOUBLE = 8, for STRING = 16 Consider a bigint datatype with 10000 values and 10000 buffered indices, this would be: memory = (65535 * 2 bytes) + (10000 * 10 bytes) + (10000 * 4 bytes) = 131070 + 100000 + 40000 = 271070 bytes Multiplying this by 50000 partitions is ~13GB. Given multiple columns, I think it is plausible for the dictionaries to use 40GB when processing 50,000 partitions.
        Hide
        joemcdonnell Joe McDonnell added a comment -

        Additional info from Mostafa:
        This is about the sorted codepath where multiple writers should not be open simultaneously. In this case, the memory should be limited to a single dictionary per column.

        Show
        joemcdonnell Joe McDonnell added a comment - Additional info from Mostafa: This is about the sorted codepath where multiple writers should not be open simultaneously. In this case, the memory should be limited to a single dictionary per column.
        Hide
        joemcdonnell Joe McDonnell added a comment -

        The problem is that various objects are added to the RuntimeState object pool. This includes the ColumnWriters for HdfsParquetTableWriter's columns_ and the OutputPartitions (which have smart pointers to HdfsParquetTableWriters) in HdfsTableSink. This will only be freed at the end, so memory accumulates from all of the partitions.

        Show
        joemcdonnell Joe McDonnell added a comment - The problem is that various objects are added to the RuntimeState object pool. This includes the ColumnWriters for HdfsParquetTableWriter's columns_ and the OutputPartitions (which have smart pointers to HdfsParquetTableWriters) in HdfsTableSink. This will only be freed at the end, so memory accumulates from all of the partitions.
        Hide
        joemcdonnell Joe McDonnell added a comment -

        commit 642b8f1b5d5493dc9e3aa55a973ef92094d4dbc9
        Author: Joe McDonnell <joemcdonnell@cloudera.com>
        Date: Mon Feb 27 16:13:38 2017 -0800

        IMPALA-4899: Fix parquet table writer dictionary leak

        Currently, in HdfsTableSink, OutputPartitions are added to the RuntimeState
        object pool to be freed at the end of the query. However, for clustered inserts
        into a partitioned table, the OutputPartitions are only used one at a time.
        They can be immediately freed once done writing to that partition.

        In addition, the HdfsParquetTableWriter's ColumnWriters are also added to
        this object pool. These constitute a significant amount of memory, as they
        contain the dictionaries for Parquet encoding.

        This change makes HdfsParquetTableWriter's ColumnWriters use unique_ptrs so
        that they are cleaned up when the HdfsParquetTableWriter is deleted. It also
        uses a unique_ptr on the PartitionPair for the OutputPartition.

        The table writers maintain a pointer to the OutputPartition. This remains a
        raw pointer. This is safe, because OutputPartition has a scoped_ptr to the
        table writer. The table writer will never outlive the OutputPartition.

        Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
        Reviewed-on: http://gerrit.cloudera.org:8080/6181
        Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
        Tested-by: Impala Public Jenkins

        Show
        joemcdonnell Joe McDonnell added a comment - commit 642b8f1b5d5493dc9e3aa55a973ef92094d4dbc9 Author: Joe McDonnell <joemcdonnell@cloudera.com> Date: Mon Feb 27 16:13:38 2017 -0800 IMPALA-4899 : Fix parquet table writer dictionary leak Currently, in HdfsTableSink, OutputPartitions are added to the RuntimeState object pool to be freed at the end of the query. However, for clustered inserts into a partitioned table, the OutputPartitions are only used one at a time. They can be immediately freed once done writing to that partition. In addition, the HdfsParquetTableWriter's ColumnWriters are also added to this object pool. These constitute a significant amount of memory, as they contain the dictionaries for Parquet encoding. This change makes HdfsParquetTableWriter's ColumnWriters use unique_ptrs so that they are cleaned up when the HdfsParquetTableWriter is deleted. It also uses a unique_ptr on the PartitionPair for the OutputPartition. The table writers maintain a pointer to the OutputPartition. This remains a raw pointer. This is safe, because OutputPartition has a scoped_ptr to the table writer. The table writer will never outlive the OutputPartition. Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f Reviewed-on: http://gerrit.cloudera.org:8080/6181 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Impala Public Jenkins
        Hide
        srus@cloudera.com Silvius Rus added a comment -

        Joe McDonnell, does this affect versions before 2.9? If so, do you know which ones?

        Show
        srus@cloudera.com Silvius Rus added a comment - Joe McDonnell , does this affect versions before 2.9? If so, do you know which ones?
        Hide
        joemcdonnell Joe McDonnell added a comment -

        Silvius Rus, IMPALA-2523 implemented clustered inserts into a partitioned table. That is the functionality that is most impacted by this issue, and it was merged in 2.8.

        Show
        joemcdonnell Joe McDonnell added a comment - Silvius Rus , IMPALA-2523 implemented clustered inserts into a partitioned table. That is the functionality that is most impacted by this issue, and it was merged in 2.8.

          People

          • Assignee:
            joemcdonnell Joe McDonnell
            Reporter:
            tarmstrong Tim Armstrong
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development