Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10862

[Python] Overriding Parquet schema when loading to SageMaker to inspect bad data

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Python
    • AWS SageMaker / S3

    Description

      Following SO post: https://stackoverflow.com/questions/53725691/pyarrow-lib-schema-vs-pyarrow-parquet-schema

      I am attempting to find a way to override a parquet schema for a parquet file stored in S3. One date column has some bad dates which causes the load of the entire parquet file to fail.

      I have tried defining a schema, but get AttributeError: 'pyarrow.lib.schema' object has no attribute 'to_arrow_schema'. Same error as in SO post above,

      I have attempted to use the workaround suggested by Wes McKinney above by creating a dummy df, saving that to parquet, reading the schema from it and replacing the embedded schema in the parquet file with my replacement:

      pq.ParquetDataset(my_filepath,filesystem = s3, schema=dummy_schema).read_pandas().to_pandas()

      I get an error message telling me that me schema is different! (It was supposed to be!)

       

      Can you either allow schemas to be overridden or, even better, suggest a way to load a Parquet file where some of the dates in a date column are ='0001-01-01'

       

      Thanks,

      James Kelly

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            jimmy_ds Mr James Kelly
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: