Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
9.0.0
-
None
Description
When the dataset has nested sturcts, "list<struct>", we can not use `pyarrow.field(..)` to get the reference of the sub-field of the struct.
For example
import pyarrow as pa import pyarrow.dataset as ds import pandas as pd schema = pa.schema( [ pa.field( "objects", pa.list_( pa.struct( [ pa.field("name", pa.utf8()), pa.field("attr1", pa.float32()), pa.field("attr2", pa.int32()), ] ) ), ) ] ) table = pa.Table.from_pandas( pd.DataFrame([{"objects": [{"name": "a", "attr1": 5.0, "attr2": 20}]}]) ) print(table) dataset = ds.dataset(table) print(dataset) dataset.scanner(columns=["objects.attr2"]).to_table()
which throws exception:
Traceback (most recent call last): File "foo.py", line 31, in <module> dataset.scanner(columns=["objects.attr2"]).to_table() File "pyarrow/_dataset.pyx", line 298, in pyarrow._dataset.Dataset.scanner File "pyarrow/_dataset.pyx", line 2356, in pyarrow._dataset.Scanner.from_dataset File "pyarrow/_dataset.pyx", line 2202, in pyarrow._dataset._populate_builder File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(objects.attr2) in objects: list<item: struct<attr1: double, attr2: int64, name: string>> __fragment_index: int32 __batch_index: int32 __last_in_fragment: bool __filename: string
Attachments
Issue Links
- relates to
-
ARROW-14596 [Python] parquet.read_table nested fields in columns does not work for use_legacy_dataset=False
- In Progress