[ARROW-7244] [Python] Inconsistent behavior with reading in S3 parquet objects - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: 0.15.1
Fix Version/s: None
Component/s: Python
Labels:
None
Environment:
running in a lambda, compiled on an EC2 using linux

External issue URL:
https://github.com/apache/arrow/issues/23538

Description

We are piloting using pyarrow to reaching parquet files from AWS S3.

We got it working in combination with s3fs as the filesystem. However, we are seeing very inconsistent results when reading in parquet objects with

s3=s3fs.S3FileSystem()

ParquetDataset(url, filesystem=s3)

The read inconsistently throws this error:

[ERROR] OSError: Passed non-file path: s3://bucket/schedule/sxaup/fms_db_aub/adn_master/trunc/20191122024436.parquet
Traceback (most recent call last):
  File "/var/task/file_check.py", line 35, in lambda_handler
    main(event, context)
  File "/var/task/file_check.py", line 260, in main
    validate_resp['object_type'])
  File "/opt/python/utils.py", line 80, in schema_check
    stage_pya_dataset = ParquetDataset(full_URL_stage, filesystem=s3)
  File "/opt/python/lib/python3.7/site-packages/pyarrow/parquet.py", line 1030, in _init_
    open_file_func=partial(_open_dataset_file, self._metadata)
  File "/opt/python/lib/python3.7/site-packages/pyarrow/parquet.py", line 1229, in _make_manifest
    .format(path))

As you can see, the path is valid and sometimes works, others times does not (no modification of the file between those successful and error runs). Does ParquetDataset actually open the file and validate it and so the error is in regards to the data?

Willing to do any troubleshooting for get this solved.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: William Tardio

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 22/Nov/19 19:58

Updated:: 11/Jan/23 07:52

Resolved:: 05/Aug/21 16:08