[ARROW-14743] [C++] Error reading in dataset when partitioning variable in schema - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 7.0.0
Component/s: C++
Labels:
- dataset

External issue URL:
https://github.com/apache/arrow/issues/30280

Description

If partitioned data is read back in and a schema is used (containing the partitioning variable), there is an error - see below. The error occurs whether or not the argument partitioning is specified or not. I think this is happening at the C++ level not the R level, though I'm a little unsure.

library(arrow)
library(dplyr)

data(diamonds, package='ggplot2')
write_dataset(diamonds, path='diamonds', format='csv', partitioning='cut')

diamond_schema <- schema(
    carat=float64(),
    cut=string(),
    color=string(),
    clarity=string(),
    depth=float64(),
    table=float64(),
    price=float64(),
    x=float64(),
    y=float64(),
    z=float64(),
)

open_dataset('diamonds', format='csv', schema=diamond_schema, partitioning = "cut") %>%
  collect()

# Error: Invalid: Could not open CSV input source '/home/nic2/arrow/r/diamonds/cut=Fair/part-0.csv': Invalid: CSV parse error: Row #1: Expected 10 columns, got 9: "carat","color","clarity","depth","table","price","x","y","z"

Attachments

Issue Links

is fixed by

ARROW-10485 [R] Accept partitioning in open_dataset when file paths are hive-style

Resolved

Activity

People

Assignee:: Neal Richardson

Reporter:: Nicola Crane

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 17/Nov/21 22:00

Updated:: 11/Jan/23 08:42

Resolved:: 13/Jan/22 22:36