Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-12083

[R] schema use in open_dataset

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.0.0
    • 5.0.0
    • R
    • Windows

    Description

      I have a directory of split .csvs that I'm importing with open_dataset(). Between files, a column is imported as either int64 (e.g. -2) and the other string (1986CD), and this throws an error when unify_schemas = T

      {{ arrow::open_dataset('./split-csvs/nswcr/', format = 'csv', unify_schemas = T)}}

      Error: Invalid: Unable to merge: Field SEIFACalcMethod has incompatible types: int64 vs string

      If I use the schema parameter, and only want to specify this column, I only am able to import this column

      arrow::open_dataset('./split-csvs/nswcr/', format = 'csv', schema = schema(SEIFACalcMethod = string()))

      {{ }}
      FileSystemDataset with 45 csv files
      SEIFACalcMethod: string

      I was expecting that could set the class of a select few columns, while the rest would be imported as-is. Similar to readr::read_csv(col_types = cols()) approach.

      Not sure if this is expected behaviour, a bug, or a possible avenue for improvement. I've tagged this as the latter. 

       

      Attachments

        Issue Links

          Activity

            People

              lidavidm David Li
              Shaunson26 Shaun Nielsen
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m