Description
It's natural in a star schema to map one or more dimensions to partition columns. Time or location are likely candidates.
It can also useful to be to compute the partitions one would like to scan via a subquery (where p in select ... from ...).
The resulting joins in hive require a full table scan of the large table though, because partition pruning takes place before the corresponding values are known.
On Tez it's relatively straight forward to send the values needed to prune to the application master - where splits are generated and tasks are submitted. Using these values we can strip out any unneeded partitions dynamically, while the query is running.
The approach is straight forward:
- Insert synthetic conditions for each join representing "x in (keys of other side in join)"
- This conditions will be pushed as far down as possible
- If the condition hits a table scan and the column involved is a partition column:
- Setup Operator to send key events to AM
- else:
- Remove synthetic predicate
Add these properties :
Property | Default Value |
---|---|
hive.tez.dynamic.partition.pruning | true |
hive.tez.dynamic.partition.pruning.max.event.size | 1*1024*1024L |
hive.tez.dynamic.parition.pruning.max.data.size | 100*1024*1024L |
Attachments
Attachments
Issue Links
- is blocked by
-
HIVE-6988 Hive changes for tez-0.5.x compatibility
- Closed
- is duplicated by
-
HIVE-5119 MapJoin & Partition Pruning (MapJoin can take advantage of materialized data to prune partitions of big table)
- Resolved
- is related to
-
TEZ-1447 Provide a mechanism for InputInitializers to know about interesting Vertex state changes
- Closed
- relates to
-
HIVE-12228 Hive 0.13.1 Error for nested query with UDF returns Struct type
- Resolved
-
HIVE-7976 Merge tez branch into trunk (tez 0.5.0)
- Closed
-
HIVE-8018 Fix typo in config var name for dynamic partition pruning
- Closed
- links to