Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
0.17.1
-
None
-
None
Description
Reading functions like dataset.dataset and read_table functions in feather, parquet, and csv modules return different exceptions when reading an "empty file" or "missing file", respectively. See table below.
It would be idea if all the reading functions return FileNotFound error when the file is missing and return ArrowInvalid when the file's empty.
Most interesting is the case of dataset.dataset since the format parameter modifies the exception behaviour when reading an empty file.
Function | Missing file | Empty File |
---|---|---|
feather.read_table | FileNotFoundError | ArrowInvalid |
parquet.read_table | OSError | ArrowInvalid |
csv.read_csv | FileNotFoundError | ArrowInvalid |
dataset.dataset "feather" | FileNotFoundError | ArrowInvalid |
dataset.dataset "parquet" | FileNotFoundError | OSError |
dataset.dataset "csv" | FileNotFoundError | ArrowInvalid |
Code to reproduce issue:
import pathlib import sys import tempfile import pyarrow as pa import pyarrow.csv as csv import pyarrow.dataset as dataset import pyarrow.feather as feather import pyarrow.parquet as parquet tempdir = pathlib.Path(tempfile.mkdtemp()) with open(str(tempdir / "empty_feather.feather"), 'wb') as f: pass with open(str(tempdir / "empty_parquet.parquet"), 'wb') as f: pass with open(str(tempdir / "empty_csv.csv"), 'wb') as f: pass # Empty File feather.read_table(str(tempdir / "empty_feather.feather")) parquet.read_table(str(tempdir / "empty_parquet.parquet")) csv.read_csv(str(tempdir / "empty_csv.csv")) dataset.dataset(str(tempdir / "empty_feather.feather"), format="feather") dataset.dataset(str(tempdir / "empty_parquet.parquet"), format="parquet") dataset.dataset(str(tempdir / "empty_csv.csv"), format="csv") # Missing File feather.read_table(str(tempdir / "non_existent.feather")) parquet.read_table(str(tempdir / "non_existent.parquet")) csv.read_csv(str(tempdir / "non_existent.csv")) dataset.dataset(str(tempdir / "non_existent.feather"), format="feather") dataset.dataset(str(tempdir / "non_existent.parquet"), format="parquet") dataset.dataset(str(tempdir / "non_existent.csv"), format="csv")