Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
2.0.0
-
None
-
None
Description
Hello
It looks like pyarrow-2.0.0 could not read partitioned datasets from S3 buckets:
import s3fs import pyarrow as pa import pyarrow.parquet as pq filesystem = s3fs.S3FileSystem() d = pd.date_range('1990-01-01', freq='D', periods=10000) vals = np.random.randn(len(d), 4) x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D']) x['Year'] = x.index.year table = pa.Table.from_pandas(x, preserve_index=True) pq.write_to_dataset(table, root_path='s3://bucket/test_pyarrow.parquet', partition_cols=['Year'], filesystem=filesystem)
Now, reading it via pq.read_table:
pq.read_table('s3://bucket/test_pyarrow.parquet', filesystem=filesystem, use_pandas_metadata=True)
Raises exception:
ArrowInvalid: GetFileInfo() yielded path 'bucket/test_pyarrow.parquet/Year=2017/ffcc136787cf46a18e8cc8f72958453f.parquet', which is outside base dir 's3://bucket/test_pyarrow.parquet'
Direct read in pandas:
pd.read_parquet('s3://bucket/test_pyarrow.parquet')
returns empty DataFrame.
The issue does not exist in pyarrow-1.0.1
Attachments
Issue Links
- relates to
-
ARROW-10998 [C++] Filesystems: detect if URI is passed where a file path is required and raise informative error
- Resolved