The problem was with the predicates that were created according to HiveJoinAddNotNullRule. This rule is creating predicates from fields that take part in join filter, no matter if this fields are used as parameters of functions or not.
Create predicate based on functions that take part in filters as well as fields. The point is to check if left part and right part of the filter is not null, not just fields that are part of the join filter. I.e we have to tables test1(a1 int, a2 int) and test2(b1). When we execute following query select * from ct1 c1 inner join ct2 c2 on (COALESCE(a1,b1)=a2); we get to predicates for filter operator:
b1 is not null — right part
a1 is not null and a2 is not null – left part
Applying predicate for left part of join will result in data loss as we exclude rows with null fields. COALESCE is a good example for this case as the main purpose of COALESCE function is to get not null values from tables. To fix the data loss we need to check that coalesce won't bring us null values as we can't join nulls. My fix will check that left part and right part will look like:
b1 is not null – right part (still checking fields on null condition)
COALESCE(a1,a2) is not null (checking that whole function won't bring us null values)
In next patch I'm going to change related failed tests with the fixed stage plans.