Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16389

[C++] Support hash-join on larger than memory datasets

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • None
    • None
    • C++

    Description

      The current implementation of the hash-join node current queues in memory the hashtable, the entire build side input, and the entire probe side input (e.g. the entire dataset). This means the current implementation will run out of memory and crash if the input dataset is larger than the memory on the system.

      By spilling to disk when memory starts to fill up we can allow the hash-join node to process datasets larger than the available memory on the machine.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              westonpace Weston Pace
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 10h 40m
                  10h 40m