Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.8.0
Description
We have many nested parquet files generated from Apache Spark for ranking problems, and we would like to load them in python for other programs to consume.
The schema looks like
root |-- profile_id: long (nullable = true) |-- country_iso_code: string (nullable = true) |-- items: array (nullable = false) | |-- element: struct (containsNull = false) | | |-- show_title_id: integer (nullable = true) | | |-- duration: double (nullable = true)
And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got the following error.
Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) [GCC 7.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import numpy as np >>> import pandas as pd >>> import pyarrow as pa >>> import pyarrow.parquet as pq >>> table2 = pq.read_table('part-00000') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 823, in read_table use_pandas_metadata=use_pandas_metadata) File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 119, in read nthreads=nthreads) File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all File "error.pxi", line 85, in pyarrow.lib.check_status pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.
I somehow get the impression that after https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be able to load the nested parquet in pyarrow.
Any insight about this?
Thanks.
Attachments
Issue Links
- duplicates
-
ARROW-6737 Nested column branch had multiple children
- Closed
-
ARROW-7845 [C++] Reading list from parquet files
- Closed
- is duplicated by
-
PARQUET-1352 [CPP] Trying to write an arrow table with structs to a parquet file
- Resolved
- is related to
-
ARROW-2587 [Python] Unable to write StructArrays with multiple children to parquet
- Resolved
-
ARROW-5799 [Python] Fail to write nested data to Parquet via BigQuery API
- Closed
- links to