Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11756

[R] passing a partition as a schema leads to segfaults

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 4.0.0
    • R

    Description

      The command to open a dataset in R can accept both a schema and a partitioning argument. If one accidentally passes a partitioning as the schema, the result looks like the dataset was read, but operating on the dataset results in segfaults after.

      Though this is input error, we should add a validation checking that the schema argument is, in fact, a Schema object and error if it is not so that someone doesn't find themselves confronted with a segfault later.

      ### begin setup 
      # note: this exact code is called in test-dataset.R lines 18-87) So when adding
      # the test to that file, you don't need to copy this, but can use the code at
      # the bottom of this chunk in that test if you want.
      library(dplyr)
      
      make_temp_dir <- function() {
        path <- tempfile()
        dir.create(path)
        normalizePath(path, winslash = "/")
      }
      
      hive_dir <- make_temp_dir()
      
      first_date <- lubridate::ymd_hms("2015-04-29 03:12:39")
      df1 <- tibble(
        int = 1:10,
        dbl = as.numeric(1:10),
        lgl = rep(c(TRUE, FALSE, NA, TRUE, FALSE), 2),
        chr = letters[1:10],
        fct = factor(LETTERS[1:10]),
        ts = first_date + lubridate::days(1:10)
      )
      
      second_date <- lubridate::ymd_hms("2017-03-09 07:01:02")
      df2 <- tibble(
        int = 101:110,
        dbl = c(as.numeric(51:59), NaN),
        lgl = rep(c(TRUE, FALSE, NA, TRUE, FALSE), 2),
        chr = letters[10:1],
        fct = factor(LETTERS[10:1]),
        ts = second_date + lubridate::days(10:1)
      )
      
      dir.create(file.path(hive_dir, "subdir", "group=1", "other=xxx"), recursive = TRUE)
      dir.create(file.path(hive_dir, "subdir", "group=2", "other=yyy"), recursive = TRUE)
      write_parquet(df1, file.path(hive_dir, "subdir", "group=1", "other=xxx", "file1.parquet"))
      write_parquet(df2, file.path(hive_dir, "subdir", "group=2", "other=yyy", "file2.parquet"))
      
      ### end setup
      
      # This (the correct specification) works just fine
      ds <- open_dataset(hive_dir, partitioning = hive_partition(other = utf8(), group = uint8()))
      ds$schema
      
      # But if you aren't explicit with ther argument names it looks like everything works...
      ds <- open_dataset(hive_dir, hive_partition(other = utf8(), group = uint8()))
      
      # but the dataset is malformed and will have segfaults when trying to interact with it for example:
      ds$schema
      

      Attachments

        Issue Links

          Activity

            People

              pachamaltese Mauricio 'PachĂĄ' Vargas SepĂșlveda
              jonkeane Jonathan Keane
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2.5h
                  2.5h

                  Slack

                    Issue deployment