Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-14705

[C++] unify_schemas can't handle int64 + double, affects CSV dataset

    XMLWordPrintableJSON

Details

    Description

      Twitter question of "how can I make arrow's csv reader not make int64 for integers", turns out to be originating from the scenario where some csvs in a directory may have all integer values for a column but there are decimals in others, and you can't use them together in a dataset.

      library(arrow, warn.conflicts = FALSE)
      library(dplyr, warn.conflicts = FALSE)
      
      ds_dir <- tempfile()
      dir.create(ds_dir)
      cat("a\n1", file = file.path(ds_dir, "1.csv"))
      cat("a\n1.1", file = file.path(ds_dir, "2.csv"))
      
      ds <- open_dataset(ds_dir, format = "csv")
      ds
      #> FileSystemDataset with 2 csv files
      #> a: int64
      
      ## It just picked the schema of the first file
      collect(ds)
      #> Error: Invalid: Could not open CSV input source '/private/var/folders/yv/b6mwztyj0r11r8pnsbmpltx00000gn/T/RtmpzENOMb/filea9c3292e06dd/2.csv': Invalid: In CSV column #0: Row #2: CSV conversion error to int64: invalid value '1.1'
      #> ../src/arrow/csv/converter.cc:492  decoder_.Decode(data, size, quoted, &value)
      #> ../src/arrow/csv/parser.h:123  status
      #> ../src/arrow/csv/converter.cc:496  parser.VisitColumn(col_index, visit)
      #> ../src/arrow/csv/reader.cc:462  internal::UnwrapOrRaise(maybe_decoded_arrays)
      #> ../src/arrow/compute/exec/exec_plan.cc:398  iterator_.Next()
      #> ../src/arrow/record_batch.cc:318  ReadNext(&batch)
      #> ../src/arrow/record_batch.cc:329  ReadAll(&batches)
      
      ## Let's try again and tell it to unify schemas. Should result in a float64 type
      ds <- open_dataset(ds_dir, format = "csv", unify_schemas = TRUE)
      #> Error: Invalid: Unable to merge: Field a has incompatible types: int64 vs double
      #> ../src/arrow/type.cc:1621  fields_[i]->MergeWith(field)
      #> ../src/arrow/type.cc:1684  AddField(field)
      #> ../src/arrow/type.cc:1755  builder.AddSchema(schema)
      #> ../src/arrow/dataset/discovery.cc:251  Inspect(options.inspect_options)
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              npr Neal Richardson
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 40m
                  1h 40m