[SPARK-32474] NullAwareAntiJoin multi-column support - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Duplicate
Affects Version/s: 3.0.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

This is a follow up improvement of Issue ~~SPARK-32290~~.

In ~~SPARK-32290~~, we already optimize NAAJ from BroadcastNestedLoopJoin to BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but it's only targeting on Single Column Case, because it's much more complicate in multi column support.

See. http://www.vldb.org/pvldb/vol2/vldb09-423.pdf Section 6

FYI, code logical for single and multi column is defined at

~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql

~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql

For supporting multi column, I throw the following idea and see if is it worth to do multi-column support with some trade off. I would need to do some data expansion in HashedRelation, and i would call this new type of HashedRelation as NullAwareHashedRelation.

In NullAwareHashedRelation, key with null column is allowed, which is opposite in LongHashedRelation and UnsafeHashedRelation; And single key might be expanded into 2^N - 1 records, (N refer to columnNum of the key). for example, if there is a record

(1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), C(2,3) as a combination to copy origin key row, and setNull at target position, and then insert into NullAwareHashedRelation. including the origin key row, there will be 7 key row inserted as follow.

(null, 2, 3)

(1, null, 3)

(1, 2, null)

(null, null, 3)

(null, 2, null)

(1, null, null)

(1, 2, 3)

with the expanded data we can extract a common pattern for both single and multi column. allNull refer to a unsafeRow which has all null columns.

buildSide is empty input => return all rows
allNullColumnKey Exists In buildSide input => reject all rows
if streamedSideRow.allNull is true => drop the row
if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation => drop the row
if streamedSideRow.allNull is false & notFindMatch in NullAwareHashedRelation => return the row

this solution will sure make buildSide data expand to 2^N-1 times, but since it is normally up to 2~3 column in NAAJ in normal production query, i suppose that it's acceptable to expand buildSide data to around 7X. I would also have a limitation of max column support for NAAJ, basically should not more than 3.

Attachments

Issue Links

links to

[Github] Pull Request #29301 (leanken)

Activity

People

Assignee:: Unassigned

Reporter:: Leanken.Lin

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 29/Jul/20 03:59

Updated:: 30/Jul/20 09:54

Resolved:: 30/Jul/20 09:54