Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8652

[Python] Test error message when discovering dataset with invalid files

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • Python

    Description

      There is comment in the test_parquet.py about the Dataset API needing a better error message for invalid files:

      https://github.com/apache/arrow/blob/ff92a6886ca77515173a50662a1949a792881222/python/pyarrow/tests/test_parquet.py#L3633-L3648

      Although, this seems to work now:

      import tempfile 
      import pathlib
      import pyarrow.dataset as ds                                                                                                                                                                               
      
      tempdir = pathlib.Path(tempfile.mkdtemp()) 
      
      with open(str(tempdir / "data.parquet"), 'wb') as f: 
          pass 
      
      In [10]: ds.dataset(str(tempdir / "data.parquet"), format="parquet")                                                                                                                                               
      ...
      OSError: Could not open parquet input source '/tmp/tmp312vtjmw/data.parquet': Invalid: Parquet file size is 0 bytes
      

      So we need update the test to actually test it instead of skipping.

      The only difference with the python ParquetDataset implementation is that the datasets API raises an OSError and not an ArrowInvalid error.

      Attachments

        Activity

          People

            Unassigned Unassigned
            jorisvandenbossche Joris Van den Bossche
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: