Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Impala 2.5.0, Impala 2.4.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0
Description
TBH I'm not sure how general this issue is, so rather than attempting to express it as abstractly as possible, I'll just start with a specific example (apologies in advance for formatting...horrible JIRA artist here):
# create two tables -- the main point here is that t1 has 2 partitions > create table t1 (i int) partitioned by (part int) ; > create table t2 (i int) partitioned by (part int) ; > insert into t1 (part, i) values (1, 1), (2, 2) ; > insert into t2 (part, i) values (1, 1), (2, 2) ; # query 1: coalesce t1 partition column with literal value — partition pruning kicks in > explain with results as ( select coalesce(t1.i, t2.i) as i, coalesce(t1.part, 666) as part from t1 full outer join t2 on (t1.i = t2.i) ) select * from results where part = 1; Query: explain with results as ( select coalesce(t1.i, t2.i) as i, coalesce(t1.part, 666) as part from t1 full outer join t2 on (t1.i = t2.i) ) select * from results where part = 1 +------------------------------------------------------------------------------------+ | Explain String | +------------------------------------------------------------------------------------+ <SNIP> | 00:SCAN HDFS [default.t1] | | partitions=1/2 files=1 size=2B | +------------------------------------------------------------------------------------+ # ^^^note # partitions scanned^^^ # query 2: coalesce t1 partition column against dynamic value — partition pruning does NOT kick in > explain with results as ( select coalesce(t1.i, t2.i) as i, coalesce(t1.part, t2.part) as part from t1 full outer join t2 on (t1.i = t2.i) ) select * from results where part = 1; Query: explain with results as ( select coalesce(t1.i, t2.i) as i, coalesce(t1.part, t2.part) as part from t1 full outer join t2 on (t1.i = t2.i) ) select * from results where part = 1 +------------------------------------------------------------------------------------+ | Explain String | +------------------------------------------------------------------------------------+ <SNIP> | 00:SCAN HDFS [default.t1] | | partitions=2/2 files=2 size=4B | +------------------------------------------------------------------------------------+ # ^^^note # partitions scanned^^^
In both of those queries, we're applying a filter predicate to a column that is defined as follows:
coalesce(t1.part, _)
Since t1.part is a partition column in t1, every row in t1 has a non-null value for t1.part. Furthermore, because t1.part appears as the first parameter to coalesce, every row in the result set that contains any data from t1 will get its "part" value from t1. Thus, all t1 data in the result set will be subject to the filter predicate in t1.part – i.e. all t1 partitions other than (part=1) could have been pruned during query planning.
Question: what is wrong with the above reasoning? am I missing something fundamental/obvious here?
Thanks!
Attachments
Issue Links
- breaks
-
IMPALA-5725 coalesce() not being fully applied with outer joins on kudu tables
- Resolved
- causes
-
IMPALA-7419 NullPointerException in SimplifyConditionalsRule
- Resolved