Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-14743

[C++] Error reading in dataset when partitioning variable in schema

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 7.0.0
    • C++

    Description

      If partitioned data is read back in and a schema is used (containing the partitioning variable), there is an error - see below. The error occurs whether or not the argument partitioning is specified or not. I think this is happening at the C++ level not the R level, though I'm a little unsure.

      library(arrow)
      library(dplyr)
      
      data(diamonds, package='ggplot2')
      write_dataset(diamonds, path='diamonds', format='csv', partitioning='cut')
      
      diamond_schema <- schema(
          carat=float64(),
          cut=string(),
          color=string(),
          clarity=string(),
          depth=float64(),
          table=float64(),
          price=float64(),
          x=float64(),
          y=float64(),
          z=float64(),
      )
      
      open_dataset('diamonds', format='csv', schema=diamond_schema, partitioning = "cut") %>%
        collect()
      
      # Error: Invalid: Could not open CSV input source '/home/nic2/arrow/r/diamonds/cut=Fair/part-0.csv': Invalid: CSV parse error: Row #1: Expected 10 columns, got 9: "carat","color","clarity","depth","table","price","x","y","z"
      
      

      Attachments

        Issue Links

          Activity

            People

              npr Neal Richardson
              thisisnic Nicola Crane
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: