Hive
  1. Hive
  2. HIVE-7826

Dynamic partition pruning on Tez

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.14.0
    • Component/s: Tez
    • Labels:

      Description

      It's natural in a star schema to map one or more dimensions to partition columns. Time or location are likely candidates.

      It can also useful to be to compute the partitions one would like to scan via a subquery (where p in select ... from ...).

      The resulting joins in hive require a full table scan of the large table though, because partition pruning takes place before the corresponding values are known.

      On Tez it's relatively straight forward to send the values needed to prune to the application master - where splits are generated and tasks are submitted. Using these values we can strip out any unneeded partitions dynamically, while the query is running.

      The approach is straight forward:

      • Insert synthetic conditions for each join representing "x in (keys of other side in join)"
      • This conditions will be pushed as far down as possible
      • If the condition hits a table scan and the column involved is a partition column:
      • Setup Operator to send key events to AM
      • else:
      • Remove synthetic predicate

      Add these properties :

      Property Default Value
      hive.tez.dynamic.partition.pruning true
      hive.tez.dynamic.partition.pruning.max.event.size 1*1024*1024L
      hive.tez.dynamic.parition.pruning.max.data.size 100*1024*1024L
      1. HIVE-7826.7.patch
        440 kB
        Gunther Hagleitner
      2. HIVE-7826.6.patch
        421 kB
        Gunther Hagleitner
      3. HIVE-7826.5.patch
        407 kB
        Gunther Hagleitner
      4. HIVE-7826.4.patch
        330 kB
        Gunther Hagleitner
      5. HIVE-7826.3.patch
        328 kB
        Gunther Hagleitner
      6. HIVE-7826.2.patch
        322 kB
        Gunther Hagleitner
      7. HIVE-7826.1.patch
        447 kB
        Gunther Hagleitner

        Issue Links

          Activity

          Thejas M Nair made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Gunther Hagleitner made changes -
          Issue Type Bug [ 1 ] New Feature [ 2 ]
          Lefty Leverenz made changes -
          Link This issue relates to HIVE-8018 [ HIVE-8018 ]
          Gunther Hagleitner made changes -
          Fix Version/s 0.14.0 [ 12326450 ]
          Gunther Hagleitner made changes -
          Link This issue relates to HIVE-7976 [ HIVE-7976 ]
          Damien Carol made changes -
          Component/s Tez [ 12320810 ]
          Gunther Hagleitner made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Gunther Hagleitner made changes -
          Attachment HIVE-7826.7.patch [ 12666205 ]
          Gunther Hagleitner made changes -
          Link This issue is related to TEZ-1447 [ TEZ-1447 ]
          Gunther Hagleitner made changes -
          Attachment HIVE-7826.6.patch [ 12665833 ]
          Damien Carol made changes -
          Description It's natural in a star schema to map one or more dimensions to partition columns. Time or location are likely candidates.

          It can also useful to be to compute the partitions one would like to scan via a subquery (where p in select ... from ...).

          The resulting joins in hive require a full table scan of the large table though, because partition pruning takes place before the corresponding values are known.

          On Tez it's relatively straight forward to send the values needed to prune to the application master - where splits are generated and tasks are submitted. Using these values we can strip out any unneeded partitions dynamically, while the query is running.

          The approach is straight forward:

          - Insert synthetic conditions for each join representing "x in (keys of other side in join)"
          - This conditions will be pushed as far down as possible
          - If the condition hits a table scan and the column involved is a partition column:
             - Setup Operator to send key events to AM
          - else:
             - Remove synthetic predicate

          Add these properties :
          ||Property||Default Value||Com||
          |{{hive.tez.dynamic.partition.pruning}}|true||
          |{{hive.tez.dynamic.partition.pruning.max.event.size}}|1*1024*1024L||
          |{{hive.tez.dynamic.partition.pruning.max.event.size}}|1*1024*1024L||
          It's natural in a star schema to map one or more dimensions to partition columns. Time or location are likely candidates.

          It can also useful to be to compute the partitions one would like to scan via a subquery (where p in select ... from ...).

          The resulting joins in hive require a full table scan of the large table though, because partition pruning takes place before the corresponding values are known.

          On Tez it's relatively straight forward to send the values needed to prune to the application master - where splits are generated and tasks are submitted. Using these values we can strip out any unneeded partitions dynamically, while the query is running.

          The approach is straight forward:

          - Insert synthetic conditions for each join representing "x in (keys of other side in join)"
          - This conditions will be pushed as far down as possible
          - If the condition hits a table scan and the column involved is a partition column:
             - Setup Operator to send key events to AM
          - else:
             - Remove synthetic predicate

          Add these properties :
          ||Property||Default Value||
          |{{hive.tez.dynamic.partition.pruning}}|true|
          |{{hive.tez.dynamic.partition.pruning.max.event.size}}|1*1024*1024L|
          |{{hive.tez.dynamic.parition.pruning.max.data.size}}|100*1024*1024L|
          Damien Carol made changes -
          Description It's natural in a star schema to map one or more dimensions to partition columns. Time or location are likely candidates.

          It can also useful to be to compute the partitions one would like to scan via a subquery (where p in select ... from ...).

          The resulting joins in hive require a full table scan of the large table though, because partition pruning takes place before the corresponding values are known.

          On Tez it's relatively straight forward to send the values needed to prune to the application master - where splits are generated and tasks are submitted. Using these values we can strip out any unneeded partitions dynamically, while the query is running.

          The approach is straight forward:

          - Insert synthetic conditions for each join representing "x in (keys of other side in join)"
          - This conditions will be pushed as far down as possible
          - If the condition hits a table scan and the column involved is a partition column:
             - Setup Operator to send key events to AM
          - else:
             - Remove synthetic predicate

          It's natural in a star schema to map one or more dimensions to partition columns. Time or location are likely candidates.

          It can also useful to be to compute the partitions one would like to scan via a subquery (where p in select ... from ...).

          The resulting joins in hive require a full table scan of the large table though, because partition pruning takes place before the corresponding values are known.

          On Tez it's relatively straight forward to send the values needed to prune to the application master - where splits are generated and tasks are submitted. Using these values we can strip out any unneeded partitions dynamically, while the query is running.

          The approach is straight forward:

          - Insert synthetic conditions for each join representing "x in (keys of other side in join)"
          - This conditions will be pushed as far down as possible
          - If the condition hits a table scan and the column involved is a partition column:
             - Setup Operator to send key events to AM
          - else:
             - Remove synthetic predicate

          Add these properties :
          ||Property||Default Value||Com||
          |{{hive.tez.dynamic.partition.pruning}}|true||
          |{{hive.tez.dynamic.partition.pruning.max.event.size}}|1*1024*1024L||
          |{{hive.tez.dynamic.partition.pruning.max.event.size}}|1*1024*1024L||
          Damien Carol made changes -
          Remote Link This issue links to "Review Board #25019 (Web Link)" [ 17910 ]
          Gunther Hagleitner made changes -
          Attachment HIVE-7826.5.patch [ 12665712 ]
          Gunther Hagleitner made changes -
          Attachment HIVE-7826.4.patch [ 12664516 ]
          Damien Carol made changes -
          Labels tez TODOC14 tez
          Gunther Hagleitner made changes -
          Attachment HIVE-7826.3.patch [ 12664115 ]
          Gunther Hagleitner made changes -
          Attachment HIVE-7826.2.patch [ 12663348 ]
          Gunther Hagleitner made changes -
          Link This issue is blocked by HIVE-6988 [ HIVE-6988 ]
          Gunther Hagleitner made changes -
          Labels tez
          Gunther Hagleitner made changes -
          Field Original Value New Value
          Attachment HIVE-7826.1.patch [ 12663346 ]
          Gunther Hagleitner created issue -

            People

            • Assignee:
              Gunther Hagleitner
              Reporter:
              Gunther Hagleitner
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development