Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-18352 [R] Datasets API interface improvements
  3. ARROW-18200

[R] Misleading error message if opening CSV dataset with invalid file in directory

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • R
    • None

    Description

      I made a mistake before where I thought a dataset contained CSVs which were, in fact, Parquet files, but the error message I got was super unhelpful

      library(arrow)
      
      download.file(
        url = "https://github.com/djnavarro/arrow-user2022/releases/download/v0.1/nyc-taxi-tiny.zip",
        destfile = here::here("data/nyc-taxi-tiny.zip")
      )
       # (unzip the zip file into the data directory but don't delete it after)
      
      open_dataset("data", format = "csv")
      
      Error in nchar(x) : invalid multibyte string, element 1
      In addition: Warning message:
      In grepl("No match for FieldRef.Name(__filename)", msg, fixed = TRUE) :
        input string 1 is invalid in this locale
      

      Note, this only occurs with format="csv" and omitting this argument (i.e. the default of format="parquet" leaves us with the much better error:

      Error in `open_dataset()`:
      ! Invalid: Error creating dataset. Could not read schema from '/home/nic2/arrow_10_twitter/data/nyc-taxi-tiny.zip': Could not open Parquet input source '/home/nic2/arrow_10_twitter/data/nyc-taxi-tiny.zip': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
      /home/nic2/arrow/cpp/src/arrow/dataset/file_parquet.cc:338  GetReader(source, scan_options). Is this a 'parquet' file?
      /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:44  InspectSchemas(std::move(options))
      /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:265  Inspect(options.inspect_options)
      ℹ Did you mean to specify a 'format' other than the default (parquet)?
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            thisisnic Nicola Crane
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: