Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17898

pyarrow.parquet.read_table fs (filesystem) argument does not work with fsspec.implementations.arrow.ArrowFSWrapper objects

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 8.0.1
    • None
    • Parquet, Python
    • None
    • Python 3.8.10

    Description

      My version of PyArrow=8.0.0. When I attempt to use a PyArrow LocalFileSystem object wrapped with ArrowFSWrapper it results in the following error:

      import pyarrow as pa
      import pyarrow.parquet as pq
      from fsspec.implementations.arrow import ArrowFSWrapper
      
      lfs = pa.fs.LocalFileSystem()
      fs = ArrowFSWrapper(lfs)
      pat = pq.read_table("some/file/location.parquet", filesystem=fs)
      
      {{OSError Traceback (most recent call last) Cell In [12], line 1 ----> 1 pat = pq.read_table(  *2* "some/file/location.parquet",  *3* filesystem=fs) File /usr/local/lib/python3.8/dist-packages/pyarrow/parquet/__init__.py:2737, in read_table(source, columns, use_threads, metadata, schema, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size, partitioning, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties)  *2730* raise ValueError(  *2731* "The 'metadata' keyword is no longer supported with the new "  *2732* "datasets-based implementation. Specify "  *2733* "'use_legacy_dataset=True' to temporarily recover the old "  *2734* "behaviour."  *2735* )  *2736* try: -> 2737 dataset = _ParquetDatasetV2(  *2738* source,  *2739* schema=schema,  *2740* filesystem=filesystem,  *2741* partitioning=partitioning,  *2742* memory_map=memory_map,  *2743* read_dictionary=read_dictionary,  *2744* buffer_size=buffer_size,  *2745* filters=filters,  *2746* ignore_prefixes=ignore_prefixes,}}
      ...
      File /usr/local/lib/python3.8/dist-packages/pyarrow/io.pxi:193, in pyarrow.lib.NativeFile.get_random_access_file() File /usr/local/lib/python3.8/dist-packages/pyarrow/io.pxi:222, in pyarrow.lib.NativeFile._assert_seekable() OSError: only valid on seekable files
      

      If I instead use just the LocalFileSystem object without the ArrowFSWrapper, it works as expected.

      Attachments

        Activity

          alenka Alenka Frim added a comment -

          I haven't used ArrowFSWrapper or fsspec before but looking at the docs this should work:

          import pyarrow as pa
          import pyarrow.parquet as pq
          
          from pyarrow import fs
          local = fs.LocalFileSystem()
          from fsspec.implementations.arrow import ArrowFSWrapper
          local_fsspec = ArrowFSWrapper(local)
          
          table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
                            'n_legs': [2, 2, 4, 4, 5, 100]})
          pq.write_table(table, 'example.parquet', filesystem=local_fsspec)
          
          pq.read_table("example.parquet", filesystem=local_fsspec)
          

          and it is also erroring for me (pyarrow 8.0.0 and 9.0.0)

          Traceback (most recent call last):
            File "<stdin>", line 1, in <module>
            File "/Users/alenkafrim/repos/pyarrow-triaging-9/lib/python3.9/site-packages/pyarrow/parquet/__init__.py", line 2780, in read_table
              dataset = _ParquetDatasetV2(
            File "/Users/alenkafrim/repos/pyarrow-triaging-9/lib/python3.9/site-packages/pyarrow/parquet/__init__.py", line 2368, in __init__
              [fragment], schema=schema or fragment.physical_schema,
            File "pyarrow/_dataset.pyx", line 898, in pyarrow._dataset.Fragment.physical_schema.__get__
            File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
            File "pyarrow/io.pxi", line 265, in pyarrow.lib.NativeFile.tell
            File "pyarrow/io.pxi", line 197, in pyarrow.lib.NativeFile.get_random_access_file
            File "pyarrow/io.pxi", line 226, in pyarrow.lib.NativeFile._assert_seekable
          OSError: only valid on seekable files
          
          alenka Alenka Frim added a comment - I haven't used ArrowFSWrapper or fsspec before but looking at the docs this should work: import pyarrow as pa import pyarrow.parquet as pq from pyarrow import fs local = fs.LocalFileSystem() from fsspec.implementations.arrow import ArrowFSWrapper local_fsspec = ArrowFSWrapper(local) table = pa.table({ 'year' : [2020, 2022, 2021, 2022, 2019, 2021], 'n_legs' : [2, 2, 4, 4, 5, 100]}) pq.write_table(table, 'example.parquet' , filesystem=local_fsspec) pq.read_table( "example.parquet" , filesystem=local_fsspec) and it is also erroring for me (pyarrow 8.0.0 and 9.0.0) Traceback (most recent call last): File "<stdin>" , line 1, in <module> File "/Users/alenkafrim/repos/pyarrow-triaging-9/lib/python3.9/site-packages/pyarrow/parquet/__init__.py" , line 2780, in read_table dataset = _ParquetDatasetV2( File "/Users/alenkafrim/repos/pyarrow-triaging-9/lib/python3.9/site-packages/pyarrow/parquet/__init__.py" , line 2368, in __init__ [fragment], schema=schema or fragment.physical_schema, File "pyarrow/_dataset.pyx" , line 898, in pyarrow._dataset.Fragment.physical_schema.__get__ File "pyarrow/error.pxi" , line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/io.pxi" , line 265, in pyarrow.lib.NativeFile.tell File "pyarrow/io.pxi" , line 197, in pyarrow.lib.NativeFile.get_random_access_file File "pyarrow/io.pxi" , line 226, in pyarrow.lib.NativeFile._assert_seekable OSError: only valid on seekable files
          rokm Rok Mihevc added a comment -

          This issue has been migrated to issue #33110 on GitHub. Please see the migration documentation for further details.

          rokm Rok Mihevc added a comment - This issue has been migrated to issue #33110 on GitHub. Please see the migration documentation for further details.

          People

            Unassigned Unassigned
            akanakia Anshul Kanakia
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: