[ARROW-16783] [R] write_dataset fails with an uninformative message when duplicated column names - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 8.0.0
Fix Version/s: 9.0.0
Component/s: R
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/32116

Description

write_dataset() fails when the object being written has duplicated column names. This is probably reasonable behaviour, but the error message is misleading:

library(arrow, warn.conflicts = FALSE)

df <- data.frame(
  id = c("a", "b", "c"),
  x = 1:3, 
  x = 4:6,
  check.names = FALSE
)

write_dataset(df, "df")
#> Error: 'dataset' must be a Dataset, RecordBatch, Table, arrow_dplyr_query, or data.frame, not "data.frame"

write_dataset() calls as_adq() inside a tryCatch() statement, so any error from as_adq() is swallowed and the error emitted is about the class of the object.

The real error comes from here:

arrow:::as_adq(df)
#> Error in `arrow_dplyr_query()`:
#> ! Duplicated field names
#> ✖ The following field names were found more than once in the data: "x"

I'm not sure what your preferred fix is here... two options that come to mind are:

1. Explicitly check for compatible classes before calling as_adq() instead of using tryCatch(), allowing `as_adq()` to emit its own errors.

2. Check for duplicate column names before the tryCatch block

My thought is that option 1 is better, as option 2 means that checking for duplicates would happen twice (once inside write_dataset() and once again inside as_adq()).

I'm happy to work a fix if you like!

Attachments

Issue Links

links to

GitHub Pull Request #13336

Activity

People

Assignee:: Unassigned

Reporter:: Andy Teucher

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 07/Jun/22 20:27

Updated:: 11/Jan/23 11:46

Resolved:: 28/Jun/22 19:01

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: