[SPARK-33152] SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours - ASF JIRA

Details

Type: Improvement
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.5.0, 4.0.0
Fix Version/s: None
Component/s: SQL
Labels:
- SPIP
- pull-request-available

Description

Q1. What are you trying to do? Articulate your objectives using absolutely no jargon.

Proposing new algorithm to create, store and use constraints for removing redundant filters & inferring new filters.
The current algorithm has subpar performance in complex expression scenarios involving aliases( with certain use cases the compilation time can go into hours), potential to cause OOM, may miss removing redundant filters in different scenarios, may miss creating IsNotNull constraints in different scenarios, does not push compound predicates in Join.

This issue if not fixed can cause OutOfMemory issue or unacceptable query compilation times.
Have added a test "plan equivalence with case statements and performance comparison with benefit of more than 10x conservatively" in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. With this PR the compilation time is 247 ms vs 13958 ms without the change
It is more effective in filter pruning as is evident in some of the tests in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite where current code is not able to identify the redundant filter in some cases.
It is able to generate a better optimized plan for join queries as it can push compound predicates.
The current logic can miss a lot of possible cases of removing redundant predicates, as it fails to take into account if same attribute or its aliases are repeated multiple times in a complex expression.
There are cases where some of the optimizer rules involving removal of redundant predicates fail to remove on the basis of constraint data. In some cases the rule works, just by the virtue of previous rules helping it out to cover the inaccuracy. That the ConstraintPropagation rule & its function of removal of redundant filters & addition of new inferred filters is dependent on the working of some of the other unrelated previous optimizer rules is behaving, is indicative of issues.
It does away with all the EqualNullSafe constraints as this logic does not need those constraints to be created.
There is at least one test in existing ConstraintPropagationSuite which is missing a IsNotNull constraints because the code incorrectly generated a EqualsNullSafeConstraint instead of EqualTo constraint, when using the existing Constraints code. With these changes, the test correctly creates an EqualTo constraint, resulting in an inferred IsNotNull constraint
It does away with the current combinatorial logic of evaluation all the constraints can cause compilation to run into hours or cause OOM. The number of constraints stored is exactly the same as the number of filters encountered

Q2. What problem is this proposal NOT designed to solve?

It mainly focuses on compile time performance, but in some cases can benefit run time characteristics too, like inferring IsNotNull filter or pushing down compound predicates on the join, which currently may get missed/ does not happen , respectively, by the present code.

Q3. How is it done today, and what are the limits of current practice?

Current ConstraintsPropagation code, pessimistically tries to generates all the possible combinations of constraints , based on the aliases ( even then it may miss a lot of combinations if the expression is a complex expression involving same attribute repeated multiple times within the expression and there are many aliases to that column). There are query plans in our production env, which can result in intermediate number of constraints going into hundreds of thousands, causing OOM or taking time running into hours. Also there are cases where it incorrectly generates an EqualNullSafe constraint instead of EqualTo constraint , thus missing a possible IsNull constraint on column.
Also it only pushes single column predicate on the other side of the join.
The constraints generated , in some cases, are missing the required ones, and the plan apparently is behaving correctly only due to the preceding unrelated optimizer rule. Have Test which show that with the bare mnimum rules containing RemoveRedundantPredicate, it misses the removal of redundant predicate.

Q4. What is new in your approach and why do you think it will be successful?

It solves all the above mentioned issues.

The number of constraints created are same as the number of filters. No combinatorial creation of constraints. No need for EqualsNullSafe constraint on aliases.
Can remove redundant predicates on any expression involving aliases irrespective of the number of repeat occurences in all possible combination.
Brings down query compilation time to few minutes from hours.
Can push compound predicates on Joins & infer right number of IsNotNull constraints which can impact query runtime also positively.
The proposed algorithm has been running successfully in our env. (WorkDay) for months & has solved all the above issues.

Q5. Who cares? If you are successful, what difference will it make?

For My company WorkDay, it has solved the previously failing plans due to OOM & compilation time running into 10 hrs or so. I suppose there have been previous attempts too, to fix this issue, but did not make progress due to complexity of change.
The PR for the same is

https://github.com/apache/spark/pull/49117

Q6. What are the risks?

Well the changes are little extensive, but thoroughly tested ( old & many new tests added). Have added a lot of tests for Union node, as found that current constraints tests were not sufficient for Union case.
So in that sense , given that all existing tests as well as new tests are clean, this is a safe PR.

Q7. How long will it take?

The PR is already there. Implementation already done. whatever time needed is for review and discussion.

https://github.com/apache/spark/pull/49117

Q8. What are the mid-term and final “exams” to check for success?

All tests should pass.
The perf benefit should justify the changes.

Attachments

Issue Links

is duplicated by

SPARK-35379 Improve InferFiltersFromConstraints rule performance when parsing spark sql

Resolved

is related to

SPARK-29606 Improve EliminateOuterJoin performance

In Progress

links to

[Github] Pull Request #30894 (tanelk)

GitHub Pull Request #49117

SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours