Hive
  1. Hive
  2. HIVE-2082

Reduce memory consumption in preparing MapReduce job

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: Query Processor
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Hive client side consume a lot of memory when the number of input partitions is large. One reason is that each partition maintains a list of FieldSchema which are intended to deal with schema evolution. However they are not used currently and Hive uses the table level schema for all partitions. This will be fixed in HIVE-2050. The memory consumption by this part will be reduced by almost half (1.2GB to 700BM for 20k partitions).

      Another large chunk of memory consumption is in the MapReduce job setup phase when a PartitionDesc is created from each Partition object. A property object is maintained in PartitionDesc which contains a full list of columns and types. Due to the same reason, these should be the same as in the table level schema. Also the deserializer initialization takes large amount of memory, which should be avoided. My initial testing for these optimizations cut the memory consumption in half (700MB to 300MB for 20k partitions).

      1. HIVE-2082.patch
        286 kB
        Ning Zhang
      2. HIVE-2082.patch
        286 kB
        Ning Zhang
      3. HIVE-2082.patch
        286 kB
        Ning Zhang

        Activity

        Hide
        Namit Jain added a comment -

        Committed. Thanks Ning

        Show
        Namit Jain added a comment - Committed. Thanks Ning
        Hide
        Namit Jain added a comment -

        OK

        +1

        Show
        Namit Jain added a comment - OK +1
        Hide
        Namit Jain added a comment -

        minor comments in review board

        Show
        Namit Jain added a comment - minor comments in review board
        Hide
        Ning Zhang added a comment -

        @Edward, HIVE-1913 fixed a bug in PartitionDesc where previously table properties are returned even if partition properties are present. This patch doesn't change that.

        What this patch changed is how the PartitionDesc.properties is constructed. Previously properties is constructed using part.getSchema(), which will construct a new Properties object for each partition. The most memory consuming part is the colNames, colTypes and partStrings (see MetaStoreUtils.getSchema()). Since they are constructed using the table level StorageDescriptor, all partitions have the same colNames, colTypes and partStrings. So we could use the same objects for all partitions.

        This patch introduces a new PartitionDesc constructor with an additional TableDesc argument. The properties is constructed by using part.getSchemaFromTableSchema(tblDesc.getProperties()), which construct the properties by cloning the table level properties to the partiton level properties first and then overwrite it with partition specific arguments. Basically all except the colNames, colTypes and partStrings will be overwritten with the partition level Properties.

        Show
        Ning Zhang added a comment - @Edward, HIVE-1913 fixed a bug in PartitionDesc where previously table properties are returned even if partition properties are present. This patch doesn't change that. What this patch changed is how the PartitionDesc.properties is constructed. Previously properties is constructed using part.getSchema(), which will construct a new Properties object for each partition. The most memory consuming part is the colNames, colTypes and partStrings (see MetaStoreUtils.getSchema()). Since they are constructed using the table level StorageDescriptor, all partitions have the same colNames, colTypes and partStrings. So we could use the same objects for all partitions. This patch introduces a new PartitionDesc constructor with an additional TableDesc argument. The properties is constructed by using part.getSchemaFromTableSchema(tblDesc.getProperties()), which construct the properties by cloning the table level properties to the partiton level properties first and then overwrite it with partition specific arguments. Basically all except the colNames, colTypes and partStrings will be overwritten with the partition level Properties.
        Hide
        Namit Jain added a comment -

        Edward, I havent reviewed the patch in detail - but the general idea is as follows:

        Partition inherits some properties from the Table (for eg. columns), and
        others can be overwritten (for eg. serde).

        Today, we treat all the properties similarly - this patch should optimize
        for the inherited properties by maintaining just 1 copy.

        Show
        Namit Jain added a comment - Edward, I havent reviewed the patch in detail - but the general idea is as follows: Partition inherits some properties from the Table (for eg. columns), and others can be overwritten (for eg. serde). Today, we treat all the properties similarly - this patch should optimize for the inherited properties by maintaining just 1 copy.
        Hide
        Edward Capriolo added a comment -

        I am curious as to how this is compatible with https://issues.apache.org/jira/browse/HIVE-1913.

        Show
        Edward Capriolo added a comment - I am curious as to how this is compatible with https://issues.apache.org/jira/browse/HIVE-1913 .
        Hide
        Ning Zhang added a comment -

        Attaching a patch for review. The review board is at https://reviews.apache.org/r/556/

        This patch also passed all unit tests.

        Show
        Ning Zhang added a comment - Attaching a patch for review. The review board is at https://reviews.apache.org/r/556/ This patch also passed all unit tests.
        Hide
        Ning Zhang added a comment -

        Attaching a patch for review. The review board: https://reviews.apache.org/r/556/

        It also passed all unit tests.

        Show
        Ning Zhang added a comment - Attaching a patch for review. The review board: https://reviews.apache.org/r/556/ It also passed all unit tests.
        Hide
        Ning Zhang added a comment -

        Uploading a patch for review. The review board request is here: https://reviews.apache.org/r/556/

        It also passed all unit tests.

        Show
        Ning Zhang added a comment - Uploading a patch for review. The review board request is here: https://reviews.apache.org/r/556/ It also passed all unit tests.

          People

          • Assignee:
            Ning Zhang
            Reporter:
            Ning Zhang
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development