[ARROW-16389] [C++] Support hash-join on larger than memory datasets - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: C++
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/31769

Description

The current implementation of the hash-join node current queues in memory the hashtable, the entire build side input, and the entire probe side input (e.g. the entire dataset). This means the current implementation will run out of memory and crash if the input dataset is larger than the memory on the system.

By spilling to disk when memory starts to fill up we can allow the hash-join node to process datasets larger than the available memory on the machine.

Attachments

Issue Links

supercedes

ARROW-14163 [C++] Naive spillover implementation for join

Closed

links to

GitHub Pull Request #13669

Activity

People

Assignee:: Unassigned

Reporter:: Weston Pace

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 28/Apr/22 01:57

Updated:: 11/Jan/23 11:43

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

10h 40m