Hive
  1. Hive
  2. HIVE-7826

Dynamic partition pruning on Tez

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.14.0
    • Component/s: Tez
    • Labels:

      Description

      It's natural in a star schema to map one or more dimensions to partition columns. Time or location are likely candidates.

      It can also useful to be to compute the partitions one would like to scan via a subquery (where p in select ... from ...).

      The resulting joins in hive require a full table scan of the large table though, because partition pruning takes place before the corresponding values are known.

      On Tez it's relatively straight forward to send the values needed to prune to the application master - where splits are generated and tasks are submitted. Using these values we can strip out any unneeded partitions dynamically, while the query is running.

      The approach is straight forward:

      • Insert synthetic conditions for each join representing "x in (keys of other side in join)"
      • This conditions will be pushed as far down as possible
      • If the condition hits a table scan and the column involved is a partition column:
      • Setup Operator to send key events to AM
      • else:
      • Remove synthetic predicate

      Add these properties :

      Property Default Value
      hive.tez.dynamic.partition.pruning true
      hive.tez.dynamic.partition.pruning.max.event.size 1*1024*1024L
      hive.tez.dynamic.parition.pruning.max.data.size 100*1024*1024L
      1. HIVE-7826.1.patch
        447 kB
        Gunther Hagleitner
      2. HIVE-7826.2.patch
        322 kB
        Gunther Hagleitner
      3. HIVE-7826.3.patch
        328 kB
        Gunther Hagleitner
      4. HIVE-7826.4.patch
        330 kB
        Gunther Hagleitner
      5. HIVE-7826.5.patch
        407 kB
        Gunther Hagleitner
      6. HIVE-7826.6.patch
        421 kB
        Gunther Hagleitner
      7. HIVE-7826.7.patch
        440 kB
        Gunther Hagleitner

        Issue Links

          Activity

          Hide
          Gunther Hagleitner added a comment -

          Initial patch. Still a bunch of issues. (see test files)

          Show
          Gunther Hagleitner added a comment - Initial patch. Still a bunch of issues. (see test files)
          Hide
          Gunther Hagleitner added a comment -

          .2 removes unnecessary addition of hive conf template.

          Show
          Gunther Hagleitner added a comment - .2 removes unnecessary addition of hive conf template.
          Hide
          Gunther Hagleitner added a comment -

          .3 has various fixes. Should be good to go now.

          Show
          Gunther Hagleitner added a comment - .3 has various fixes. Should be good to go now.
          Hide
          Gunther Hagleitner added a comment -
          Show
          Gunther Hagleitner added a comment - Review board link: https://reviews.apache.org/r/25019/
          Hide
          Damien Carol added a comment -

          Gunther Hagleitner I'm really interested by this feature. We use heavily partitioning in my company.
          I wish help you to test it. How can I help you?

          Show
          Damien Carol added a comment - Gunther Hagleitner I'm really interested by this feature. We use heavily partitioning in my company. I wish help you to test it. How can I help you?
          Hide
          Gunther Hagleitner added a comment -

          Damien Carol thank you for your interest. This feature is Tez only right now. But if you are using tez and you have a cluster with tez 0.5 running you can give this a spin. You basically need to use the apache tez branch and apply this patch. The relevant configs are:

          hive.tez.dynamic.partition.pruning=true (turn it on or off)
          hive.tez.dynamic.partition.pruning.max.event.size=size in bytes (maximum size of the event that the task will send to the AM, if it's bigger it will turn itself off)
          hive.tez.dynamic.parition.pruning.max.data.size=size in bytes (maximum total size of expected output in the planning stage, if expected size is bigger, it will turn itself off)

          Any feedback is welcome. Functionality and performance. If you describe your use case to me, I will make sure it's covered in the unit tests. If you're game: Code review is also welcome.

          Show
          Gunther Hagleitner added a comment - Damien Carol thank you for your interest. This feature is Tez only right now. But if you are using tez and you have a cluster with tez 0.5 running you can give this a spin. You basically need to use the apache tez branch and apply this patch. The relevant configs are: hive.tez.dynamic.partition.pruning=true (turn it on or off) hive.tez.dynamic.partition.pruning.max.event.size=size in bytes (maximum size of the event that the task will send to the AM, if it's bigger it will turn itself off) hive.tez.dynamic.parition.pruning.max.data.size=size in bytes (maximum total size of expected output in the planning stage, if expected size is bigger, it will turn itself off) Any feedback is welcome. Functionality and performance. If you describe your use case to me, I will make sure it's covered in the unit tests. If you're game: Code review is also welcome.
          Hide
          Gunther Hagleitner added a comment -

          .4 fixes small issue with stats annotation for event operators.

          Show
          Gunther Hagleitner added a comment - .4 fixes small issue with stats annotation for event operators.
          Hide
          Gunther Hagleitner added a comment -

          .5 fixes problems with unions. Also addresses Vikram Dixit K's review comments.

          Show
          Gunther Hagleitner added a comment - .5 fixes problems with unions. Also addresses Vikram Dixit K 's review comments.
          Hide
          Damien Carol added a comment -

          Gunther Hagleitner We used apache tez branch and deployed tez 0.5 to test this patch.
          We haven't seen any problems of performance. Simply we weren't able to activate the pruning (we don't see anything in the logs).
          Maybe our use case doesn't fit well.
          We use tez for OLAP analysis. Some queries like that one :

          SELECT d1.label, count(*), sum(agg.amount) 
          FROM agg_01 agg,
          dim_shops d1
          WHERE agg.dim_shops_id = d1.id
          and
          d1.label in ('foo', 'bar')
          GROUP BY d1.label
          ORDER BY d1.label
          

          I was expecting that if agg_01 is partitioned by dim_shops_id, dynamic pruning will be activated.

          Show
          Damien Carol added a comment - Gunther Hagleitner We used apache tez branch and deployed tez 0.5 to test this patch. We haven't seen any problems of performance. Simply we weren't able to activate the pruning (we don't see anything in the logs). Maybe our use case doesn't fit well. We use tez for OLAP analysis. Some queries like that one : SELECT d1.label, count(*), sum(agg.amount) FROM agg_01 agg, dim_shops d1 WHERE agg.dim_shops_id = d1.id and d1.label in ('foo', 'bar') GROUP BY d1.label ORDER BY d1.label I was expecting that if agg_01 is partitioned by dim_shops_id, dynamic pruning will be activated.
          Hide
          Gunther Hagleitner added a comment -

          Damien Carol your case fits very well. If agg_01 is partitioned by dim_shops_id it should trigger the dynamic pruning. An easy way to verify is to check the explain plan: You should see something like this:

          Dynamic Partitioning Event Operator
            Target Input: agg_01
            Partition key expr: dim_shops_id
            Target column: dim_shops_id
            Target Vertex: Map 1
          

          If the optimization kicks in. I'll try to create a test case in the unit tests for your query later tonight - let me see if I can get this to work on my end.

          Show
          Gunther Hagleitner added a comment - Damien Carol your case fits very well. If agg_01 is partitioned by dim_shops_id it should trigger the dynamic pruning. An easy way to verify is to check the explain plan: You should see something like this: Dynamic Partitioning Event Operator Target Input: agg_01 Partition key expr: dim_shops_id Target column: dim_shops_id Target Vertex: Map 1 If the optimization kicks in. I'll try to create a test case in the unit tests for your query later tonight - let me see if I can get this to work on my end.
          Hide
          Gunther Hagleitner added a comment -

          Damien Carol - I've included your test in patch .6. Look at dynamic_partition_pruning_2.q

          The dynamic_partition_pruning_2.q.out file shows the extra operator I was talking about. Also, if you want to see the DynamicPartitionPruner in the logs you have to check the AM log (not hive.log) which is where the pruning takes place.

          Seems your use case works fine - hope this helps.

          Show
          Gunther Hagleitner added a comment - Damien Carol - I've included your test in patch .6. Look at dynamic_partition_pruning_2.q The dynamic_partition_pruning_2.q.out file shows the extra operator I was talking about. Also, if you want to see the DynamicPartitionPruner in the logs you have to check the AM log (not hive.log) which is where the pruning takes place. Seems your use case works fine - hope this helps.
          Hide
          Gunther Hagleitner added a comment -

          Damien Carol - here's a link to the new testcase: https://reviews.apache.org/r/25019/diff/2-3/

          Show
          Gunther Hagleitner added a comment - Damien Carol - here's a link to the new testcase: https://reviews.apache.org/r/25019/diff/2-3/
          Hide
          Vikram Dixit K added a comment -

          +1 LGTM. Minor comment left on the review board which can only be addressed later.

          Show
          Vikram Dixit K added a comment - +1 LGTM. Minor comment left on the review board which can only be addressed later.
          Hide
          Gunther Hagleitner added a comment -

          .7 is rebased.

          Show
          Gunther Hagleitner added a comment - .7 is rebased.
          Hide
          Gunther Hagleitner added a comment -

          Committed to branch. Thanks Vikram Dixit K. Damien Carol thanks for trying it out. Let me know if you're still having problems with this. I'll address in follow up if need be.

          Show
          Gunther Hagleitner added a comment - Committed to branch. Thanks Vikram Dixit K . Damien Carol thanks for trying it out. Let me know if you're still having problems with this. I'll address in follow up if need be.
          Hide
          Damien Carol added a comment -

          I tested again with the last version of the tez branch.
          I can confirm that it works. Massive performance improvement with this patch.
          Many of our OLAP cubes are partitioned by year.
          We can now filter just 1 or 2 years which lowers the time of queries.
          Thanks a lot Gunther Hagleitner

          Show
          Damien Carol added a comment - I tested again with the last version of the tez branch. I can confirm that it works. Massive performance improvement with this patch. Many of our OLAP cubes are partitioned by year. We can now filter just 1 or 2 years which lowers the time of queries. Thanks a lot Gunther Hagleitner
          Hide
          Gunther Hagleitner added a comment -

          Thanks Damien Carol. Your last comment definitely made my day

          Show
          Gunther Hagleitner added a comment - Thanks Damien Carol . Your last comment definitely made my day
          Hide
          Lefty Leverenz added a comment -

          Doc note: This adds three configuration parameters to HiveConf.java, so they need to be documented in the wiki with a link to this JIRA ticket: hive.tez.dynamic.partition.pruning, hive.tez.dynamic.partition.pruning.max.event.size, and hive.tez.dynamic.parition.pruning.max.data.size.

          What other documentation is needed? Will there be a release note?

          Show
          Lefty Leverenz added a comment - Doc note: This adds three configuration parameters to HiveConf.java, so they need to be documented in the wiki with a link to this JIRA ticket: hive.tez.dynamic.partition.pruning , hive.tez.dynamic.partition.pruning.max.event.size , and hive.tez.dynamic.parition.pruning.max.data.size . Configuration Properties – Tez What other documentation is needed? Will there be a release note?
          Hide
          Lefty Leverenz added a comment -

          Typo alert: hive.tez.dynamic.parition.pruning.max.data.size is misspelled (parition) here and in HIVE-7976 (merge Tez branch). It's even misspelled in the description and the doc comment above. So much for eagle eyes, sigh.

          Does this need a new JIRA ticket or can it be fixed in HIVE-6586 (various HiveConf.java fixes)? The string "hive.tez.dynamic.parition.pruning.max.data.size" only occurs once in each patch – this one and the Tez merge.

          Show
          Lefty Leverenz added a comment - Typo alert: hive.tez.dynamic.parition.pruning.max.data.size is misspelled (parition) here and in HIVE-7976 (merge Tez branch). It's even misspelled in the description and the doc comment above. So much for eagle eyes, sigh. Does this need a new JIRA ticket or can it be fixed in HIVE-6586 (various HiveConf.java fixes)? The string "hive.tez.dynamic.parition.pruning.max.data.size" only occurs once in each patch – this one and the Tez merge.
          Hide
          Lefty Leverenz added a comment -

          Typo fixed: HIVE-8018. The parameter name is now hive.tez.dynamic.partition.pruning.max.data.size for release 0.14.0.

          Show
          Lefty Leverenz added a comment - Typo fixed: HIVE-8018 . The parameter name is now hive.tez.dynamic.partition.pruning.max.data.size for release 0.14.0.
          Hide
          Thejas M Nair added a comment -

          This has been fixed in 0.14 release. Please open new jira if you see any issues.

          Show
          Thejas M Nair added a comment - This has been fixed in 0.14 release. Please open new jira if you see any issues.

            People

            • Assignee:
              Gunther Hagleitner
              Reporter:
              Gunther Hagleitner
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development