Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16897

[R][C++] Full join on Arrow objects is incorrect

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Duplicate
    • 8.0.0, 9.0.0
    • 10.0.0
    • C++, R
    • Linux

    Description

      Hello,

      I am trying to do a full join on a dataset. It produces the correct number of observations, but not the correct result (the resulting data.frame is just filled up with NA-rows).

      My use case: I want to include the 'full' year range for every factor value:

      library(data.table)
      library(arrow)
      library(dplyr)
      
      year_range <- 2000:2019
      group_n <- 100
      N <- 1000 ## the resulting data should have 100 groups * 20 years
      
      dt <- data.table(value = rnorm(N),
                       group = rep(paste0("g", 1:group_n), length.out = N))
      ## there are only observations for some years in every group
      dt[, year := sample(year_range, size = N / group_n), by = .(group)]
      dt[group == "g1", ]
      
      ## this would be the 'full' data.table
      group_years <- data.table(group = rep(unique(dt$group), each = 20),
                                year = rep(year_range, times = 10))
      group_years[group == "g1", ]
      
      write_dataset(dt, path = "parquet_db")
      db <- open_dataset(sources = "parquet_db")
      
      ## full_join using data.table -> expected result
      db_full <- merge(dt, group_years,
                       by = c("group", "year"),
                       all = TRUE)
      setorder(db_full, group, year)
      db_full[group == "g1", ]
      
      ## try to do the full_join with arrow -> incorrect result
      db_full_arrow <- db |>
        full_join(group_years, by = c("group", "year")) |>
        collect() |>
        setDT()
      setorder(db_full_arrow, group, year)
      db_full_arrow[group == "g1", ]
      
      ## or: convert data.table to arrow_table beforehand -> incorrect result
      group_years_arrow <- group_years |>
        as_arrow_table()
      db_full_arrow <- db |>
        full_join(group_years_arrow, by = c("group", "year")) |>
        collect() |>
        setDT()
      setorder(db_full_arrow, group, year)
      db_full_arrow[group == "g1", ]

      The documentation says equality joins are supported, which should hold also for `full_join` I guess?

      Thanks for your time and work!

       

      Oliver

      Attachments

        Issue Links

          Activity

            People

              westonpace Weston Pace
              zauster Oliver Reiter
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: