Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15838

[C++] Key column behavior in joins

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 10.0.0
    • C++

    Description

      By default in dplyr (and possibly in pandas too?) coalesces the key column for full joins to be the (non-null) values from both key columns:

      > left <- tibble::tibble(
        key = c(1, 2),
        A = c(0, 1),                  
      )                      
      left_tab <- Table$create(left)
      > right <- tibble::tibble(
        key = c(2, 3),
        B = c(0, 1),
      )                      
      right_tab <- Table$create(right)
      
      > left %>% full_join(right) 
      Joining, by = "key"
      # A tibble: 3 × 3
          key     A     B
        <dbl> <dbl> <dbl>
      1     1     0    NA
      2     2     1     0
      3     3    NA     1
      
      > left_tab %>% full_join(right_tab) %>% collect()
      # A tibble: 3 × 3
          key     A     B
        <dbl> <dbl> <dbl>
      1     2     1     0
      2     1     0    NA
      3    NA    NA     1
      

      And right join, we would expect the key from the right table to be in the result, but we get the key from the left instead:

      > left <- tibble::tibble(
        key = c(1, 2),
        A = c(0, 1),                  
      )                      
      left_tab <- Table$create(left)
      > right <- tibble::tibble(
        key = c(2, 3),
        B = c(0, 1),
      )                      
      right_tab <- Table$create(right)
      > left %>% right_join(right)
      Joining, by = "key"
      # A tibble: 2 × 3
          key     A     B
        <dbl> <dbl> <dbl>
      1     2     1     0
      2     3    NA     1
      > left_tab %>% right_join(right_tab) %>% collect()
      # A tibble: 2 × 3
          key     A     B
        <dbl> <dbl> <dbl>
      1     2     1     0
      2    NA    NA     1
      

      Additionally, we should be able to keep both key columns with an option (cf https://github.com/apache/arrow/blob/9719eae66dcf38c966ae769215d27020a6dd5550/r/R/dplyr-join.R#L32)

      Attachments

        Issue Links

          Activity

            People

              wjones127 Will Jones
              wjones127 Will Jones
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 20m
                  2h 20m