Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11030

[Rust] [DataFusion] HashJoinExec slow with many batches

    XMLWordPrintableJSON

Details

    Description

      Performance of joins slows down dramatically with smaller batches.

      The issue is related to slow performance of MutableDataArray::new() when passed a high number of batches. This happens when passing in all of the batches from the build side of the join and this happens once per build-side join key for each probe-side batch.

      It seems to get exponentially slower as the number of arrays increases even though the number of rows is the same.

      I modified hash_join.rs to have this debug code:

      let start = Instant::now();
      let row_count: usize = arrays.iter().map(|arr| arr.len()).sum();
      let num_arrays = arrays.len();
      let mut mutable = MutableArrayData::new(arrays, true, capacity);
      if num_arrays > 0 {
          debug!("MutableArrayData::new() with {} arrays containing {} rows took {} ms", num_arrays, row_count, start.elapsed().as_millis());
      } 

      Batch size 131072:

      MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
      MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
      MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms 

      Batch size 16384:

      MutableArrayData::new() with 36624 arrays containing 3115341 rows took 19 ms
      MutableArrayData::new() with 36624 arrays containing 3115341 rows took 16 ms
      MutableArrayData::new() with 36624 arrays containing 3115341 rows took 17 ms 

      Batch size 4096:

      MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms
      MutableArrayData::new() with 146496 arrays containing 3115341 rows took 89 ms
      MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms 

       

       

       

       

       

      Attachments

        Issue Links

          Activity

            People

              Dandandan Daniël Heres
              andygrove Andy Grove
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 7h 50m
                  7h 50m