[SPARK-8682] Range Join for Spark SQL - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: None
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed

Description

Currently Spark SQL uses a Broadcast Nested Loop join (or a filtered Cartesian Join) when it has to execute the following range query:

SELECT A.*,
       B.*
FROM   tableA A
       JOIN tableB B
        ON A.start <= B.end
         AND A.end > B.start

This is horribly inefficient. The performance of this query can be greatly improved, when one of the tables can be broadcasted, by creating a range index. A range index is basically a sorted map containing the rows of the smaller table, indexed by both the high and low keys. using this structure the complexity of the query would go from O(N * M) to O(N * 2 * LOG(M)), N = number of records in the larger table, M = number of records in the smaller (indexed) table.

I have created a pull request for this. According to the Spark SQL: Relational Data Processing in Spark paper similar work (page 11, section 7.2) has already been done by the ADAM project (cannot locate the code though).

Any comments and/or feedback are greatly appreciated.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

perf_testing.scala
16/Jul/15 22:23
2 kB
Herman van Hövell

Issue Links

links to

[Github] Pull Request #7379 (hvanhovell)

Activity

People

Assignee:: Unassigned

Reporter:: Herman van Hövell

Shepherd:: Michael Armbrust

Votes:: 11 Vote for this issue

Watchers:: 26 Start watching this issue

Dates

Created:: 27/Jun/15 21:53

Updated:: 25/May/21 01:49

Resolved:: 25/May/21 01:42