Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-25800

loadDynamicPartitions in Hive.java should not load all partitions of a managed table

    XMLWordPrintableJSON

Details

    Description

      HIVE-20661 added an improvement in loadDynamicPartitions() api in Hive.java to not add partitions one by one in HMS. As part of that improvement, following code was introduced: 

      // fetch all the partitions matching the part spec using the partition iterable
      // this way the maximum batch size configuration parameter is considered
      PartitionIterable partitionIterable = new PartitionIterable(Hive.get(), tbl, partSpec,
                conf.getInt(MetastoreConf.ConfVars.BATCH_RETRIEVE_MAX.getVarname(), 300));
      Iterator<Partition> iterator = partitionIterable.iterator();
      
      // Match valid partition path to partitions
      while (iterator.hasNext()) {
        Partition partition = iterator.next();
        partitionDetailsMap.entrySet().stream()
                .filter(entry -> entry.getValue().fullSpec.equals(partition.getSpec()))
                .findAny().ifPresent(entry -> {
                  entry.getValue().partition = partition;
                  entry.getValue().hasOldPartition = true;
                });
      } 

      The above code fetches all the existing partitions for a table from HMS and compare that dynamic partitions list to decide old and new partitions to be added to HMS (in batches). The call to fetch all partitions has introduced a performance regression for tables with large number of partitions (of the order of 100K). 

       

      This is fixed for external tables in https://issues.apache.org/jira/browse/HIVE-25178.  However for ACID tables there is an open Jira(HIVE-25187). Until we have an appropriate fix in HIVE-25187, we can apply the following: 

      Skip fetching all partitions. Instead, in the threadPool which loads each partition individually,  call get_partition() to check if the partition already exists in HMS or not.  

      This will introduce additional getPartition() call for every partition to be loaded dynamically but removes fetching all existing partitions for a table. 

      I believe this is fine since for tables with small number of existing partitions in HMS - getPartitions() won't add too much overhead but for tables with large number of existing partitions, it will certainly avoid getting all partitions from HMS 

      cc - lpinter ngangam 

      Attachments

        Issue Links

          Activity

            People

              sourabh912 Sourabh Goyal
              sourabh912 Sourabh Goyal
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m