Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-7740

[C++] Array internals corruption in StructArray::Flatten

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 0.17.0
    • C++

    Description

      Reading a nested ndjson file using arrow::read_json_arrow with the default `as_data_frame=TRUE` causes an immediate session crash, but switching to `as_data_frame=FALSE` works fine and the resulting arrow object schema is correct.

      library(tidyr)
      library(arrow)
      library(jsonlite)
      # Create two test datasets: long_df and a variant that nests long_df into
      # a dataframe with a list-column 'nest_level1' containing a dataframe
      long_df <- tidyr::expand_grid(ABC = LETTERS[1:3], xyz = letters[24:26], num = 1:3)
      long_df[["ftr1"]] <- runif(nrow(long_df))
      long_df[["ftr2"]] <- rpois(nrow(long_df), 100)
      nested_frame_level1 <- tidyr::nest(long_df, nest_level1 = c(num, ftr1, ftr2))
      # Write and validate nested ndjson
      jsonlite::stream_out(nested_frame_level1, con = file("nested_frame_level1.json"))
      readLines("nested_frame_level1.json", n = 2) # check we have valid ndjson here
      # This does not cause a session crash
      nested_arrow <- arrow::read_json_arrow(file = "nested_frame_level1.json", as_data_frame = FALSE)
      nested_arrow$schema # correctly interprets 'nest_level1` as `list<item: struct<num: int64, ftr1: double, ftr2: int64>>`
      # This causes a session crash
      nested_df <- arrow::read_json_arrow(file = "nested_frame_level1.json", as_data_frame = TRUE)
       
      

      The R package version of Arrow is latest CRAN release (arrow * 0.15.1.1, 2019-11-05, CRAN (R 3.5.2)). I'm running this code in a slightly older R version (3.5.1), macOS 10.14.6, x86_64, darwin15.6.0, via RStudio 1.2.5001. 

      [edit: formatting fix]

      Attachments

        Issue Links

          Activity

            People

              fsaintjacques Francois Saint-Jacques
              jms John Sheffield
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 40m
                  1h 40m