[ARROW-9147] [C++][Dataset] Support null -> other type promotion in Dataset scanning - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.0
Component/s: C++
Labels:

External issue URL:
https://github.com/apache/arrow/issues/25255

Description

With regarding schema evolution / normalization, we support inserting nulls for a missing column or changing nullability, or normalizing column order, but we do not yet seem to support promotion of null type to any other type.

Small python example:

In [11]: df = pd.DataFrame({"col": np.array([None, None, None, None], dtype='object')})
    ...: df.to_parquet("test_filter_schema.parquet", engine="pyarrow")
    ...:
    ...: import pyarrow.dataset as ds
    ...: dataset = ds.dataset("test_filter_schema.parquet", format="parquet", schema=pa.schema([("col", pa.int64())]))
    ...: dataset.to_table()
...
~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table()
~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.to_table()
~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowTypeError: fields had matching names but differing types. From: col: null To: col: int64

Attachments

Issue Links

is depended upon by

ARROW-2659 [Python] More graceful reading of empty String columns in ParquetDataset

Open

ARROW-2860 [Python][Parquet][C++] Null values in a single partition of Parquet dataset, results in invalid schema on read

Open

links to

GitHub Pull Request #8343

Activity

People

Assignee:: Ben Kietzman

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 16/Jun/20 14:00

Updated:: 11/Jan/23 08:04

Resolved:: 07/Oct/20 09:50

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 40m