Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16783

[R] write_dataset fails with an uninformative message when duplicated column names

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 8.0.0
    • 9.0.0
    • R

    Description

      write_dataset() fails when the object being written has duplicated column names. This is probably reasonable behaviour, but the error message is misleading:

      library(arrow, warn.conflicts = FALSE)
      
      df <- data.frame(
        id = c("a", "b", "c"),
        x = 1:3, 
        x = 4:6,
        check.names = FALSE
      )
      
      write_dataset(df, "df")
      #> Error: 'dataset' must be a Dataset, RecordBatch, Table, arrow_dplyr_query, or data.frame, not "data.frame"
      

      write_dataset() calls as_adq() inside a tryCatch() statement, so any error from as_adq() is swallowed and the error emitted is about the class of the object.

      The real error comes from here:

      arrow:::as_adq(df)
      #> Error in `arrow_dplyr_query()`:
      #> ! Duplicated field names
      #> ✖ The following field names were found more than once in the data: "x"
      

      I'm not sure what your preferred fix is here... two options that come to mind are:

      1. Explicitly check for compatible classes before calling as_adq() instead of using tryCatch(), allowing `as_adq()` to emit its own errors.

      OR

      2. Check for duplicate column names before the tryCatch block

      My thought is that option 1 is better, as option 2 means that checking for duplicates would happen twice (once inside write_dataset() and once again inside as_adq()).

      I'm happy to work a fix if you like!

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ateucher Andy Teucher
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h
                  2h