Details
-
Bug
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
Twitter question of "how can I make arrow's csv reader not make int64 for integers", turns out to be originating from the scenario where some csvs in a directory may have all integer values for a column but there are decimals in others, and you can't use them together in a dataset.
library(arrow, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE) ds_dir <- tempfile() dir.create(ds_dir) cat("a\n1", file = file.path(ds_dir, "1.csv")) cat("a\n1.1", file = file.path(ds_dir, "2.csv")) ds <- open_dataset(ds_dir, format = "csv") ds #> FileSystemDataset with 2 csv files #> a: int64 ## It just picked the schema of the first file collect(ds) #> Error: Invalid: Could not open CSV input source '/private/var/folders/yv/b6mwztyj0r11r8pnsbmpltx00000gn/T/RtmpzENOMb/filea9c3292e06dd/2.csv': Invalid: In CSV column #0: Row #2: CSV conversion error to int64: invalid value '1.1' #> ../src/arrow/csv/converter.cc:492 decoder_.Decode(data, size, quoted, &value) #> ../src/arrow/csv/parser.h:123 status #> ../src/arrow/csv/converter.cc:496 parser.VisitColumn(col_index, visit) #> ../src/arrow/csv/reader.cc:462 internal::UnwrapOrRaise(maybe_decoded_arrays) #> ../src/arrow/compute/exec/exec_plan.cc:398 iterator_.Next() #> ../src/arrow/record_batch.cc:318 ReadNext(&batch) #> ../src/arrow/record_batch.cc:329 ReadAll(&batches) ## Let's try again and tell it to unify schemas. Should result in a float64 type ds <- open_dataset(ds_dir, format = "csv", unify_schemas = TRUE) #> Error: Invalid: Unable to merge: Field a has incompatible types: int64 vs double #> ../src/arrow/type.cc:1621 fields_[i]->MergeWith(field) #> ../src/arrow/type.cc:1684 AddField(field) #> ../src/arrow/type.cc:1755 builder.AddSchema(schema) #> ../src/arrow/dataset/discovery.cc:251 Inspect(options.inspect_options)
Attachments
Issue Links
- duplicates
-
ARROW-14695 [C++] allow unify schema to coalesce int64 and float64
- Closed
- relates to
-
ARROW-14528 [R] Add option to attempt 32-bit integer type inference in CSV reader
- Open
- links to