Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-14583

[R][C++] Crash when summarizing after filtering to no rows on partitioned data

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 6.0.0
    • 6.0.1, 7.0.0
    • C++, R
    • I am using a windows 10 machine, R 4.1.0, up to date R packages, and latest RStudio IDE.

    Description

      Original issue report is below; here's an even more minimal example:

      library(arrow)
      library(dplyr)
      td <- tempfile()
      dir.create(td)
      # if there is no partitioning in data data, this won't segfault
      # write_dataset(iris, td) - swap this in and won't segfault
      write_dataset(group_by(iris, Species), td)
      open_dataset(td) %>%
        filter(Species == "tulip") %>%
        group_by(Sepal.Length) %>%
        summarise(n = n()) %>%
        collect()
      
      

      I was trying the new features introduced in latest arrow (6.0.2) package based on examples from the “New Directions for Apache Arrow” talk.

      The RStudio IDE was crashing and the R session was aborted.

      Looking closely I found that I downloaded only 2 years of data (2018 & 2019) and after the first filter (year == 2015) no data remains to be processed further.

      After some debugging, by replacing the collect() function, it turns out that the summarize() is the one which function is causing the crash.

       

      as_dataset <- open_dataset("c:/Rproj_learn/nyc-taxi/", 
                                      partitioning = c("year", "month")) %>%
        filter(total_amount > 100 & year == 2015) %>%
        select(tip_amount, total_amount, passenger_count) %>%
        mutate(tip_pct = tip_amount / total_amount * 100) %>%
        group_by(passenger_count) %>%
        summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
        filter(n > 5000) %>%
        arrange(desc(avg_tip_pct)) %>%
        collect()

       

      I would expect to get an error message (without crashing the IDE), which can be handled in code.

      Another alternative result would be an empty data.frame, like in case when the parquet file was read in as a data.frame. I simulated this situation by setting a high total_amount value when filtering. Note: when using an Arrow table an error message is generated.

       

       library(tidyverse)
      #> Warning: package 'tibble' was built under R version 4.1.1
      #> Warning: package 'tidyr' was built under R version 4.1.1
      #> Warning: package 'readr' was built under R version 4.1.1
      library(arrow)
      #> Warning: package 'arrow' was built under R version 4.1.1
      #> 
      #> Attaching package: 'arrow'
      #> The following object is masked from 'package:utils':
      #> 
      #>     timestamp
      
      read_parquet("c:/Rproj_learn/nyc-taxi/2018/01/data.parquet", 
                   as_data_frame = FALSE) %>%
        # filter(total_amount > 100) %>%
        filter(total_amount > 1e10) %>%
        select(tip_amount, total_amount, passenger_count) %>%
        mutate(tip_pct = tip_amount / total_amount * 100) %>%
        group_by(passenger_count) %>%
        summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
        filter(n > 500) %>%
        arrange(desc(avg_tip_pct)) %>%
        collect()
      
      #> Error: Invalid: Must pass at least one array
      
      
      read_parquet("c:/Rproj_learn/nyc-taxi/2018/01/data.parquet", 
                   as_data_frame = TRUE) %>%
        # filter(total_amount > 100) %>%
        filter(total_amount > 1e10) %>%
        select(tip_amount, total_amount, passenger_count) %>%
        mutate(tip_pct = tip_amount / total_amount * 100) %>%
        group_by(passenger_count) %>%
        summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
        filter(n > 500) %>%
        arrange(desc(avg_tip_pct)) %>%
        collect()
      
      #> # A tibble: 0 x 3
      #> # ... with 3 variables: passenger_count <int>, avg_tip_pct <dbl>, n <int>
      

      Attachments

        Issue Links

          Activity

            People

              lidavidm David Li
              kbzsl Zsolt Kegyes-Brassai
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 10m
                  2h 10m