[ARROW-2659] [Python] More graceful reading of empty String columns in ParquetDataset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 0.9.0
Fix Version/s: None
Component/s: C++, Python
Labels:

External issue URL:
https://github.com/apache/arrow/issues/19053

Description

When currently saving a ParquetDataset from Pandas, we don't get consistent schemas, even if the source was a single DataFrame. This is due to the fact that in some partitions object columns like string can become empty. Then the resulting Arrow schema will differ. In the central metadata, we will store this column as pa.string whereas in the partition file with the empty columns, this columns will be stored as pa.null.

The two schemas are still a valid match in terms of schema evolution and we should respect that in https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754 Instead of doing a pa.Schema.equals in https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778 we should introduce a new method pa.Schema.can_evolve_to that is more graceful and returns True if a dataset piece has a null column where the main metadata states a nullable column of any type.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

read_parquet_dataset.error.read_table.txt
01/Jun/18 15:54
0.1 kB
Aldrin Montana
read_parquet_dataset.error.read_table.novalidation.txt
01/Jun/18 15:54
18 kB
Aldrin Montana

Issue Links

depends upon

ARROW-8039 [Python][Dataset] Support using dataset API in pyarrow.parquet with a minimal ParquetDataset shim

Resolved

ARROW-9147 [C++][Dataset] Support null -> other type promotion in Dataset scanning

Resolved

is related to

ARROW-2860 [Python][Parquet][C++] Null values in a single partition of Parquet dataset, results in invalid schema on read

Open

ARROW-2366 [Python][C++][Parquet] Support reading Parquet files having a permutation of column order

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Uwe Korn

Votes:: 4 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 01/Jun/18 08:15

Updated:: 11/Jan/23 07:22