Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15910

[Python] pyarrow.parquet.read_table either returns FileNotFound or ArrowInvalid

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 6.0.1, 7.0.0
    • None
    • Parquet, Python
    • None
    • GCP JupyterLab notebooks

    Description

      running below results in "GetFileIno() yielded path 'myBucket/features/MyParquet.parquet/year=2022/part-0019.snappy.parquet' which is outside base dir 'gs://myBucket/features/MyParquet.parquet/' "

      import pyarrow.parquet as pq
      import gcsfs
      file_path="gs://myBucket/features/MyParquet.parquet/"
      fs=gcsfs.GCSFileSystem()
      table=pq.read_table(file_path,filesystem=fs) 
      

      Removing the gs:// from file_path results in a FileNotFoundError. Any variation of / or // at the beginning of the path gives me the 'outside base dir' error.

      I also ran the below and got valid results using both file_path patterns, so I know it finds the path just fine.

      from pyarrow.fs import FileSelector, PyFileSystem, FSSpecHandler
      filesys = PyFileSystem(FSSpecHandler(fs))
      selector = FileSelector(file_path, recursive=True)
      filesys.get_file_info(selector)
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            crogers923 Callista Rogers
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: