Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-2659

[Python] More graceful reading of empty String columns in ParquetDataset

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 0.9.0
    • Fix Version/s: 1.0.0
    • Component/s: C++, Python
    • Labels:

      Description

      When currently saving a ParquetDataset from Pandas, we don't get consistent schemas, even if the source was a single DataFrame. This is due to the fact that in some partitions object columns like string can become empty. Then the resulting Arrow schema will differ. In the central metadata, we will store this column as pa.string whereas in the partition file with the empty columns, this columns will be stored as pa.null.

      The two schemas are still a valid match in terms of schema evolution and we should respect that in https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754 Instead of doing a pa.Schema.equals in https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778 we should introduce a new method pa.Schema.can_evolve_to that is more graceful and returns True if a dataset piece has a null column where the main metadata states a nullable column of any type.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                xhochy Uwe L. Korn
              • Votes:
                2 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated: