Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-12791

[R] Better error handling for DatasetFactory$Finish() when no format specified

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 5.0.0
    • R

    Description

      When I call the following code:

       

      tf <- tempfile()
      dir.create(tf)
      on.exit(unlink(tf))
      write_csv_arrow(mtcars[1:5,], file.path(tf, "file1.csv"))
      write_csv_arrow(mtcars[6:11,], file.path(tf, "file2.csv"))
      ds <- open_dataset(c(file.path(tf, "file1.csv"), file.path(tf, "file2.csv")))
      

      I get the following error: 

       Error: IOError: Could not open parquet input source '/tmp/RtmpSug6P8/file714931976ac54/file1.csv': Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
      

      However, in the documentation for open_dataset(), there is nothing saying that the input source cannot be a CSV or must be a Parquet file.  

      I think this is due to calling DataSetFactory$Finish() when schema is NULL and input files have no inherent schema (i.e. are CSVs).

      Attachments

        Issue Links

          Activity

            People

              thisisnic Nicola Crane
              thisisnic Nicola Crane
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 5h
                  5h