Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8987

[C++][Python] Make reading functions to return consistent exceptions

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 0.17.1
    • None
    • C++, Python
    • None

    Description

      Reading functions like dataset.dataset and read_table functions in feather, parquet, and csv modules return different exceptions when reading an "empty file" or "missing file", respectively. See table below.

      It would be idea if all the reading functions return FileNotFound error when the file is missing and return ArrowInvalid when the file's empty.

      Most interesting is the case of dataset.dataset since the format parameter modifies the exception behaviour when reading an empty file.

       

      Function Missing file Empty File
      feather.read_table FileNotFoundError ArrowInvalid
      parquet.read_table OSError ArrowInvalid
      csv.read_csv FileNotFoundError ArrowInvalid
      dataset.dataset "feather" FileNotFoundError ArrowInvalid
      dataset.dataset "parquet" FileNotFoundError OSError
      dataset.dataset "csv" FileNotFoundError ArrowInvalid

       

      Code to reproduce issue:

      import pathlib
      import sys
      import tempfile
      
      import pyarrow as pa
      
      import pyarrow.csv as csv
      import pyarrow.dataset as dataset
      import pyarrow.feather as feather
      import pyarrow.parquet as parquet
      
      tempdir = pathlib.Path(tempfile.mkdtemp())
      
      with open(str(tempdir / "empty_feather.feather"), 'wb') as f:
          pass
      
      with open(str(tempdir / "empty_parquet.parquet"), 'wb') as f:
          pass
      
      with open(str(tempdir / "empty_csv.csv"), 'wb') as f:
          pass
      
      # Empty File
      feather.read_table(str(tempdir / "empty_feather.feather"))
      parquet.read_table(str(tempdir / "empty_parquet.parquet"))
      csv.read_csv(str(tempdir / "empty_csv.csv"))
      dataset.dataset(str(tempdir / "empty_feather.feather"), format="feather")
      dataset.dataset(str(tempdir / "empty_parquet.parquet"), format="parquet")
      dataset.dataset(str(tempdir / "empty_csv.csv"), format="csv")
      
      # Missing File
      feather.read_table(str(tempdir / "non_existent.feather"))
      parquet.read_table(str(tempdir / "non_existent.parquet"))
      csv.read_csv(str(tempdir / "non_existent.csv"))
      dataset.dataset(str(tempdir / "non_existent.feather"), format="feather")
      dataset.dataset(str(tempdir / "non_existent.parquet"), format="parquet")
      dataset.dataset(str(tempdir / "non_existent.csv"), format="csv")
      
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            gire German I. Ramirez-Espinoza
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: