[ARROW-6058] [Python][Parquet] Failure when reading Parquet file from S3 with s3fs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.14.1
Fix Version/s: 0.15.0
Component/s: C++
Labels:
- parquet
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/22460

Description

I am reading parquet data from S3 and get ArrowIOError error.

Size of the data: 32 part files 90 MB each (3GB approx)

Number of records: Approx 100M

Code Snippet:

from s3fs import S3FileSystem
import pyarrow.parquet as pq

s3 = S3FileSystem()

dataset = pq.ParquetDataset("s3://location", filesystem=s3)

df = dataset.read_pandas().to_pandas()

Stack Trace:

df = dataset.read_pandas().to_pandas()
File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 1113, in read_pandas
return self.read(use_pandas_metadata=True, **kwargs)
File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 1085, in read
use_pandas_metadata=use_pandas_metadata)
File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, in read
table = reader.read(**options)
File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, in read
use_threads=use_threads)
File "pyarrow/_parquet.pyx", line 1086, in pyarrow._parquet.ParquetReader.read_all
File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) than expected (263929)

Note: Same code works on relatively smaller dataset (approx < 50M records)

Attachments

Issue Links

links to

GitHub Pull Request #5137

Upstream s3fs issue

Activity

People

Assignee:: Wes McKinney

Reporter:: Siddharth

Votes:: 3 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 29/Jul/19 09:31

Updated:: 11/Jan/23 07:44

Resolved:: 21/Aug/19 13:40

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: