[ARROW-10862] [Python] Overriding Parquet schema when loading to SageMaker to inspect bad data - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Python
Labels:
- dataset
- dataset-parquet-read
Environment:
AWS SageMaker / S3

External issue URL:
https://github.com/apache/arrow/issues/26797

Description

Following SO post: https://stackoverflow.com/questions/53725691/pyarrow-lib-schema-vs-pyarrow-parquet-schema

I am attempting to find a way to override a parquet schema for a parquet file stored in S3. One date column has some bad dates which causes the load of the entire parquet file to fail.

I have tried defining a schema, but get AttributeError: 'pyarrow.lib.schema' object has no attribute 'to_arrow_schema'. Same error as in SO post above,

I have attempted to use the workaround suggested by Wes McKinney above by creating a dummy df, saving that to parquet, reading the schema from it and replacing the embedded schema in the parquet file with my replacement:

pq.ParquetDataset(my_filepath,filesystem = s3, schema=dummy_schema).read_pandas().to_pandas()

I get an error message telling me that me schema is different! (It was supposed to be!)

Can you either allow schemas to be overridden or, even better, suggest a way to load a Parquet file where some of the dates in a date column are ='0001-01-01'

Thanks,

James Kelly

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Mr James Kelly

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 09/Dec/20 12:45

Updated:: 11/Jan/23 08:16