Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-14908

[R] join on dataset crashes on Windows

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 6.0.0
    • 7.0.2, 8.0.0
    • R
    • R version 4.0.4

    Description

      library(tidyverse)
      library(arrow)
      
      car_info <- rownames_to_column(mtcars, "car_info") 
      
      cars_arrow_table <- arrow_table(car_info)
      
      other_mtcars_data <- select(car_info, 1) %>% 
        mutate(main_color = sample( c("red", "blue", "white", "black"), size = n(), replace = TRUE)) %>% 
        arrow::arrow_table()
      
      temp <- tempdir()
      par_temp <- paste0(temp, "\\parquet")
      
      car_info %>% arrow::write_dataset(par_temp)
      cars_arrow <- arrow::open_dataset(par_temp) 
      
      # using arrow tables works ------------------------------------------------------
      cars_arrow_table %>% left_join(other_mtcars_data) %>% count(main_color) %>% collect()
      
      # using open dataset crashes R ------------------------------------------------------------------
      other_mtcars_data %>% 
        left_join(cars_arrow) %>% 
        count(main_color) %>% 
        collect()
      
      #other variation also crash
      cars_arrow %>% 
        left_join(other_mtcars_data) %>% 
        count(main_color) %>% 
        collect()
      
      cars_arrow %>% 
        left_join(other_mtcars_data) %>% 
        group_by(main_color) %>% 
        summarise(n = n()) %>% 
        collect()
      
      #compute also crashes
      cars_arrow %>% 
        left_join(other_mtcars_data) %>% 
        count(main_color) %>% 
        compute()
      
      # workaround with duckdb ------------------------------------------------------
      ##this works
      cars_duck <- to_duckdb(cars_arrow, auto_disconnect = TRUE)
      other_cars_duck <- to_duckdb(other_mtcars_data, auto_disconnect = TRUE)
          
      cars_duck %>% 
        left_join(other_cars_duck) %>%
        count(main_color) %>%
        collect()
      
      ##this doesn't (don't know if expected to work actually)
      cars_arrow %>% 
        left_join(other_mtcars_data) %>% 
        to_duckdb() 

      Attachments

        Issue Links

          Activity

            People

              wjones127 Will Jones
              wjones127 Will Jones
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 5h 40m
                  5h 40m