[SPARK-27105] Prevent exponential complexity in ORC `createFilter` - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.0
Fix Version/s: 3.0.0
Component/s: SQL
Labels:
- performance

Description

`OrcFilters.createFilters` currently has complexity that's exponential in the height of the filter tree. There are multiple places in Spark that try to prevent the generation of skewed trees so as to not trigger this behaviour, for example:

`org.apache.spark.sql.catalyst.parser.AstBuilder.visitLogicalBinary` combines a number of binary logical expressions into a balanced tree.
https://github.com/apache/spark/pull/22313 introduced a change to `OrcFilters` to create a balanced tree instead of a skewed tree.

However, the underlying exponential behaviour can still be triggered by code paths that don't go through any of the tree balancing methods. For example, if one generates a tree of `Column`s directly in user code, there's nothing in Spark that automatically balances that tree and, hence, skewed trees hit the exponential behaviour. We have hit this in production with jobs mysteriously taking hours on the Spark driver with no worker activity, with as few as ~30 OR filters.

I have a fix locally that makes the underlying logic have linear complexity instead of exponential complexity. With this fix, the code can handle thousands of filters in milliseconds. I'll send a PR with the fix soon.

Attachments

Issue Links

links to

GitHub Pull Request #24068

GitHub Pull Request #24783

Activity

People

Assignee:: Ivan Vergiliev

Reporter:: Ivan Vergiliev

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 08/Mar/19 10:24

Updated:: 19/Jun/19 16:00

Resolved:: 19/Jun/19 02:48