Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1644

[C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels

    XMLWordPrintableJSON

    Details

      Description

      We have many nested parquet files generated from Apache Spark for ranking problems, and we would like to load them in python for other programs to consume.

      The schema looks like

      root
       |-- profile_id: long (nullable = true)
       |-- country_iso_code: string (nullable = true)
       |-- items: array (nullable = false)
       |    |-- element: struct (containsNull = false)
       |    |    |-- show_title_id: integer (nullable = true)
       |    |    |-- duration: double (nullable = true)
      

      And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got the following error.

      Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) 
      [GCC 7.2.0] on linux
      Type "help", "copyright", "credits" or "license" for more information.
      >>> import numpy as np
      >>> import pandas as pd
      >>> import pyarrow as pa
      >>> import pyarrow.parquet as pq
      >>> table2 = pq.read_table('part-00000')
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 823, in read_table
          use_pandas_metadata=use_pandas_metadata)
        File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 119, in read
          nthreads=nthreads)
        File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all
        File "error.pxi", line 85, in pyarrow.lib.check_status
      pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.
      

      I somehow get the impression that after https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be able to load the nested parquet in pyarrow.

      Any insight about this?

      Thanks.

        Attachments

          Issue Links

          1.
          [C++] Create performance benchmark for parquet reading Sub-task Resolved Micah Kornfield  
          2.
          [C++] Rebase https://github.com/apache/parquet-cpp/pull/462# onto arrow repo Sub-task Resolved Unassigned  
          3.
          [C++][Parquet] Add a basic disabled unit test to excercise nesting functionality Sub-task Resolved Micah Kornfield

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 0.5h
          4.
          [C++][Parquet] Incorporate new level generation logic in parquet write path with a flag to revert back to old logic Sub-task Resolved Micah Kornfield

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 6h
          5.
          [C++] Add schema conversion support for map type Sub-task Resolved Micah Kornfield

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 3h 20m
          6.
          [C++][Parquet] Add a new level builder capable of handling nested data Sub-task Resolved Micah Kornfield

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 11h 10m
          7.
          [C++] Refactor DefLevelsToBitmap Sub-task Resolved Micah Kornfield

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 14h
          8.
          [C++] Cleanup Parquet Arrow Schema code Sub-task Resolved Micah Kornfield

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h
          9.
          [C++] Expose a ReadValuesSpaced method that accepts a validity bitmap. Sub-task Resolved Unassigned  
          10.
          [C++] Create unified schema resolution code for Array reconstruction. Sub-task Resolved Micah Kornfield

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 2h 50m
          11.
          [C++] Add hand-crafted Parquet to Arrow reconstruction test for nested reading Sub-task Resolved Antoine Pitrou

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h 50m
          12.
          [C++][Parquet] Generalize existing null bitmap generation Sub-task Open Micah Kornfield

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 2h 10m
          13.
          [C++] Implement basic array-by-array reassembly logic Sub-task Resolved Micah Kornfield

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 9h 50m
          14.
          [C++][Parquet] Create randomized nested data generation round trip read/write unit tests Sub-task Open Unassigned  
          15.
          [C++][Parquet] Add support for schema translation from parquet nodes back to arrow for missing types Sub-task Resolved Micah Kornfield

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 3h 20m
          16.
          [C++][Parquet] Implement non-vectorized array reconstruction logic. Sub-task Open Unassigned  
          17.
          [C++][Parquet] Add EngineVersion to properties to allow for toggling new vs old logic Sub-task Resolved Micah Kornfield

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h 50m
          18.
          [Python][Parquet] Expose EngineVersion in python arrow reader properties Sub-task Resolved Unassigned  
          19.
          [C++][Parquet] Create nested reading benchmarks Sub-task Resolved Antoine Pitrou

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 5.5h
          20.
          [C++] Add Parquet-Arrow roundtrip tests for nested data Sub-task Resolved Antoine Pitrou

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h 10m
          21.
          [C++] Investigate performance of LevelsToBitmap without BMI2 Sub-task Resolved Antoine Pitrou

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 2h
          22.
          [C++][Parquet] Create reading benchmarks for 2-level nested data Sub-task Resolved Antoine Pitrou

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h 20m

            Activity

              People

              • Assignee:
                emkornfield@gmail.com Micah Kornfield
                Reporter:
                dbtsai DB Tsai
              • Votes:
                42 Vote for this issue
                Watchers:
                62 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 67h 50m
                  67h 50m