Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9423 [Rust][DataFusion] Add join
  3. ARROW-9555

[Rust] [DataFusion] Add inner (hash) equijoin physical plan

    XMLWordPrintableJSON

Details

    Description

      Here is an overview of how I think we should implement support for equijoins, at least for the initial implementation.

      • Read all batches from the left-side of the join into a single Vec<RecordBatch>
      • Create a map something like HashMap<Vec<ScalarValue>, Vec<(usize,usize)>> to map keys to batch/row indices
      • Iterate over this Vec<RecordBatch> and create an entry in a hash map, mapping the join keys to the index of the batch and row in the Vec<RecordBatch>
      • For each input partition on the right-side of the join, return an output partition that is an iterator/stream that:
        • For each input row, evaluate the join keys
        • Look up those join keys in the hash map
        • If a match is found:
          • For each (batch, row) index create an output row which has the values from both the left and right row and emit it
        • If no match is found:
          • Do not emit a row

      Attachments

        Activity

          People

            jorgecarleitao Jorge Leitão
            jorgecarleitao Jorge Leitão
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 10h
                10h