Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-24948

Enhancing performance of OrcInputFormat.getSplits with bucket pruning

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      The summarized flow of generating input splits at Tez AM is like below; (by calling HiveSplitGenerator.initialize())

      1. Perform dynamic partition pruning
      2. Get the list of InputSplit by calling InputFormat.getSplits()
        https://github.com/apache/hive/blob/624f62aadc08577cafaa299cfcf17c71fa6cdb3a/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L260-L260
      3. Perform bucket pruning with the list above if it's possible
        https://github.com/apache/hive/blob/624f62aadc08577cafaa299cfcf17c71fa6cdb3a/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L299-L301

      But I observed that the action 2, getting the list of InputSplit, can make big overhead when the inputs are ORC files in HDFS. 

      For example, there is a ORC table T partitioned by 'log_date' and each partition is bucketed by a column 'q'. There are 240 buckets in each partition and the size of each bucket(ORC file) is, let's say, 100MB.

      The SQL is like this.  

      set hive.tez.bucket.pruning=true;
      select count(*) from T
      where log_date between '2020-01-01' and '2020-06-30'
          and q = 'foobar';

      It means there are 240 * 183(days) = 43680 ORC files in the input paths, but thanks to bucket pruning, only 183 files should be processed.

      In my company's environment, the whole processing time of the SQL was roughly 5 minutes. However, I've checked that it took more than 3 minutes to make the list of OrcSplit for 43680 ORC files. The logs with tez.am.log.level=DEBUG showed like below;

      2021-03-25 01:21:31,850 [DEBUG] [InputInitializer {Map 1} #0] |orc.OrcInputFormat|: getSplits started
      ...
      2021-03-25 01:24:51,435 [DEBUG] [InputInitializer {Map 1} #0] |orc.OrcInputFormat|: getSplits finished
      2021-03-25 01:24:51,444 [INFO] [InputInitializer {Map 1} #0] |io.HiveInputFormat|: number of splits 43680
      2021-03-25 01:24:51,444 [DEBUG] [InputInitializer {Map 1} #0] |log.PerfLogger|: </PERFLOG method=getSplits start=1616602891776 end=1616602891776 duration=199668 from=org.apache.hadoop.hive.ql.io.HiveInputFormat>
      ...
      2021-03-25 01:26:03,385 [INFO] [Dispatcher thread {Central}] |app.DAGAppMaster|: DAG completed, dagId=dag_1615862187190_731117_1, dagState=SUCCEEDED 

      43680 - 183 = 43497 InputSplits which consume about 60% of entire processing time are just simply discarded by the action 3, pruneBuckets().

       

      With bucket pruning, I think making the whole list of ORC input splits is not necessary.

      Therefore, I suggest that the flow would be like this;

      1. Perform dynamic partition pruning
      2. Get the list of InputSplit by calling InputFormat.getSplits()
        1. OrcInputFormat.getSplits() returns the bucket-pruned list if BitSet from FixedBucketPruningOptimizer exists

      Attachments

        1. HIVE-24948_3.1.2.patch
          7 kB
          Eugene Chung

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            euigeun_chung Eugene Chung Assign to me
            euigeun_chung Eugene Chung

            Dates

              Created:
              Updated:

              Time Tracking

              Estimated:
              Original Estimate - Not Specified
              Not Specified
              Remaining:
              Remaining Estimate - 0h
              0h
              Logged:
              Time Spent - 50m
              50m

              Slack

                Issue deployment