Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
9.0.0
Description
Different parquet implementations use different field names for internal fields of ListType and MapType, which can sometimes cause silly conflicts. For example, we use item as the field name for list, but Spark uses element. Fortunately, we can automatically cast between List and Map Types with different field names. Unfortunately, it only works at the top level. We should get it to work at arbitrary levels of nesting.
This was discovered in delta-rs: https://github.com/delta-io/delta-rs/pull/684#discussion_r935099285
Here's a reproduction in Python:
import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds def roundtrip_scanner(in_arr, out_type): table = pa.table({"arr": in_arr}) pq.write_table(table, "test.parquet") schema = pa.schema({"arr": out_type}) ds.dataset("test.parquet", schema=schema).to_table() # MapType ty_named = pa.map_(pa.field("x", pa.int32(), nullable=False), pa.int32()) ty = pa.map_(pa.int32(), pa.int32()) arr_named = pa.array([[(1, 2), (2, 4)]], type=ty_named) roundtrip_scanner(arr_named, ty) # ListType ty_named = pa.list_(pa.field("x", pa.int32(), nullable=False)) ty = pa.list_(pa.int32()) arr_named = pa.array([[1, 2, 4]], type=ty_named) roundtrip_scanner(arr_named, ty) # Combination MapType and ListType ty_named = pa.map_(pa.string(), pa.field("x", pa.list_(pa.field("x", pa.int32(), nullable=True)), nullable=False)) ty = pa.map_(pa.string(), pa.list_(pa.int32())) arr_named = pa.array([[("string", [1, 2, 3])]], type=ty_named) roundtrip_scanner(arr_named, ty) # Traceback (most recent call last): # File "<stdin>", line 1, in <module> # File "<stdin>", line 5, in roundtrip_scanner # File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table # File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table # File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status # File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status # pyarrow.lib.ArrowNotImplementedError: Unsupported cast to map<string, list<item: int32>> from map<string, list<x: int32> ('arr')>