Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
8.0.0
Description
write_dataset() fails when the object being written has duplicated column names. This is probably reasonable behaviour, but the error message is misleading:
library(arrow, warn.conflicts = FALSE) df <- data.frame( id = c("a", "b", "c"), x = 1:3, x = 4:6, check.names = FALSE ) write_dataset(df, "df") #> Error: 'dataset' must be a Dataset, RecordBatch, Table, arrow_dplyr_query, or data.frame, not "data.frame"
write_dataset() calls as_adq() inside a tryCatch() statement, so any error from as_adq() is swallowed and the error emitted is about the class of the object.
The real error comes from here:
arrow:::as_adq(df) #> Error in `arrow_dplyr_query()`: #> ! Duplicated field names #> ✖ The following field names were found more than once in the data: "x"
I'm not sure what your preferred fix is here... two options that come to mind are:
1. Explicitly check for compatible classes before calling as_adq() instead of using tryCatch(), allowing `as_adq()` to emit its own errors.
OR
2. Check for duplicate column names before the tryCatch block
My thought is that option 1 is better, as option 2 means that checking for duplicates would happen twice (once inside write_dataset() and once again inside as_adq()).
I'm happy to work a fix if you like!
Attachments
Issue Links
- links to