[ARROW-1644] [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.8.0
Fix Version/s: 2.0.0
Component/s: C++, Python
Labels:
- parquet
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/17654

Description

We have many nested parquet files generated from Apache Spark for ranking problems, and we would like to load them in python for other programs to consume.

The schema looks like

root
 |-- profile_id: long (nullable = true)
 |-- country_iso_code: string (nullable = true)
 |-- items: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- show_title_id: integer (nullable = true)
 |    |    |-- duration: double (nullable = true)

And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got the following error.

Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> import pandas as pd
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> table2 = pq.read_table('part-00000')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 823, in read_table
    use_pandas_metadata=use_pandas_metadata)
  File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 119, in read
    nthreads=nthreads)
  File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all
  File "error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.

I somehow get the impression that after https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be able to load the nested parquet in pyarrow.

Any insight about this?

Thanks.

Attachments

Issue Links

duplicates

ARROW-6737 Nested column branch had multiple children

Closed

ARROW-7845 [C++] Reading list from parquet files

Closed

is duplicated by

PARQUET-1352 [CPP] Trying to write an arrow table with structs to a parquet file

Resolved

is related to

ARROW-2587 [Python] Unable to write StructArrays with multiple children to parquet

Resolved

ARROW-5799 [Python] Fail to write nested data to Parquet via BigQuery API

Closed

links to

GitHub Pull Request #462

(1 links to)

Sub-Tasks

1.

[C++] Create performance benchmark for parquet reading

Closed

Micah Kornfield

2.

[C++] Rebase https://github.com/apache/parquet-cpp/pull/462# onto arrow repo

Closed

Unassigned

3.

[C++][Parquet] Add a basic disabled unit test to excercise nesting functionality

Resolved

Micah Kornfield

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 0.5h

4.

[C++][Parquet] Incorporate new level generation logic in parquet write path with a flag to revert back to old logic

Resolved

Micah Kornfield

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 6h

5.

[C++] Add schema conversion support for map type

Resolved

Micah Kornfield

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 3h 20m

6.

[C++][Parquet] Add a new level builder capable of handling nested data

Resolved

Micah Kornfield

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 11h 10m

7.

[C++] Refactor DefLevelsToBitmap

Resolved

Micah Kornfield

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 14h

8.

[C++] Cleanup Parquet Arrow Schema code

Resolved

Micah Kornfield

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 1h

9.

[C++] Expose a ReadValuesSpaced method that accepts a validity bitmap.

Closed

Unassigned

10.

[C++] Create unified schema resolution code for Array reconstruction.

Resolved

Micah Kornfield

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 2h 50m

11.

[C++] Add hand-crafted Parquet to Arrow reconstruction test for nested reading

Resolved

Antoine Pitrou

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 1h 50m

12.

[C++][Parquet] Generalize existing null bitmap generation

Open

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 2h 10m

13.

[C++] Implement basic array-by-array reassembly logic

Resolved

Micah Kornfield

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 9h 50m

14.

[C++][Parquet] Create randomized nested data generation round trip read/write unit tests

Open

Unassigned

15.

[C++][Parquet] Add support for schema translation from parquet nodes back to arrow for missing types

Resolved

Micah Kornfield

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 3h 20m

16.

[C++][Parquet] Implement non-vectorized array reconstruction logic.

Open

Unassigned

17.

[C++][Parquet] Add EngineVersion to properties to allow for toggling new vs old logic

Closed

Micah Kornfield

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 1h 50m

18.

[Python][Parquet] Expose EngineVersion in python arrow reader properties

Closed

Unassigned

19.

[C++][Parquet] Create nested reading benchmarks

Resolved

Antoine Pitrou

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 5.5h

20.

[C++] Add Parquet-Arrow roundtrip tests for nested data

Resolved

Antoine Pitrou

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 1h 10m

21.

[C++] Investigate performance of LevelsToBitmap without BMI2

Resolved

Antoine Pitrou

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 2h

22.

[C++][Parquet] Create reading benchmarks for 2-level nested data

Resolved

Antoine Pitrou

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 1h 20m

Activity

People

Assignee:: Micah Kornfield

Reporter:: DB Tsai

Votes:: 42 Vote for this issue

Watchers:: 46 Start watching this issue

Dates

Created:: 05/Oct/17 00:43

Updated:: 12/May/24 14:29

Resolved:: 22/Oct/20 03:13

Time Tracking

Estimated:

Not Specified

Remaining:

0h

Logged:

67h 50m

Include sub-tasks