Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10937

[Python] ArrowInvalid error on reading partitioned parquet files from S3 (arrow-2.0.0)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 2.0.0
    • None
    • Python
    • None

    Description

      Hello

      It looks like pyarrow-2.0.0 could not read partitioned datasets from S3 buckets: 

      import s3fs
      import pyarrow as pa
      import pyarrow.parquet as pq
      
      filesystem = s3fs.S3FileSystem()
      
      d = pd.date_range('1990-01-01', freq='D', periods=10000)
      vals = np.random.randn(len(d), 4)
      x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D'])
      x['Year'] = x.index.year
      
      table = pa.Table.from_pandas(x, preserve_index=True)
      pq.write_to_dataset(table, root_path='s3://bucket/test_pyarrow.parquet', partition_cols=['Year'], filesystem=filesystem)
      

       

       Now, reading it via pq.read_table:

      pq.read_table('s3://bucket/test_pyarrow.parquet', filesystem=filesystem, use_pandas_metadata=True)
      

      Raises exception: 

      ArrowInvalid: GetFileInfo() yielded path 'bucket/test_pyarrow.parquet/Year=2017/ffcc136787cf46a18e8cc8f72958453f.parquet', which is outside base dir 's3://bucket/test_pyarrow.parquet'
      

       

      Direct read in pandas:

      pd.read_parquet('s3://bucket/test_pyarrow.parquet')

      returns empty DataFrame.

       

      The issue does not exist in pyarrow-1.0.1

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Filimonov Vladimir
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: