[SPARK-24865] Remove AnalysisBarrier - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.0, 2.3.1
Fix Version/s: 2.4.0
Component/s: SQL
Labels:
None

Target Version/s:

2.4.0

Description

AnalysisBarrier was introduced in ~~SPARK-20392~~ to improve analysis speed (don't re-analyze nodes that have already been analyzed).

Before AnalysisBarrier, we already had some infrastructure in place, with analysis specific functions (resolveOperators and resolveExpressions). These functions do not recursively traverse down subplans that are already analyzed (with a mutable boolean flag _analyzed). The issue with the old system was that developers started using transformDown, which does a top-down traversal of the plan tree, because there was not top-down resolution function, and as a result analyzer performance became pretty bad.

In order to fix the issue in ~~SPARK-20392~~, AnalysisBarrier was introduced as a special node and for this special node, transform/transformUp/transformDown don't traverse down. However, the introduction of this special node caused a lot more troubles than it solves. This implicit node breaks assumptions and code in a few places, and it's hard to know when analysis barrier would exist, and when it wouldn't. Just a simple search of AnalysisBarrier in PR discussions demonstrates it is a source of bugs and additional complexity.

Instead, I think a much simpler fix to the original issue is to introduce resolveOperatorsDown, and change all places that call transformDown in the analyzer to use that. We can also ban accidental uses of the various transform* methods by using a linter (which can only lint specific packages), or in test mode inspect the stack trace and fail explicitly if transform* are called in the analyzer.

Attachments

Issue Links

is part of

SPARK-20392 Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

Resolved

is related to

SPARK-25051 where clause on dataset gives AnalysisException

Resolved

links to

[Github] Pull Request #21822 (rxin)

[Github] Pull Request #21896 (rxin)

[Github] Pull Request #21962 (gatorsmile)

Activity

People

Assignee:: Reynold Xin

Reporter:: Reynold Xin

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 19/Jul/18 21:48

Updated:: 14/Aug/18 09:39

Resolved:: 27/Jul/18 06:29