Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1644

[C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels

    XMLWordPrintableJSON

Details

    Description

      We have many nested parquet files generated from Apache Spark for ranking problems, and we would like to load them in python for other programs to consume.

      The schema looks like

      root
       |-- profile_id: long (nullable = true)
       |-- country_iso_code: string (nullable = true)
       |-- items: array (nullable = false)
       |    |-- element: struct (containsNull = false)
       |    |    |-- show_title_id: integer (nullable = true)
       |    |    |-- duration: double (nullable = true)
      

      And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got the following error.

      Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) 
      [GCC 7.2.0] on linux
      Type "help", "copyright", "credits" or "license" for more information.
      >>> import numpy as np
      >>> import pandas as pd
      >>> import pyarrow as pa
      >>> import pyarrow.parquet as pq
      >>> table2 = pq.read_table('part-00000')
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 823, in read_table
          use_pandas_metadata=use_pandas_metadata)
        File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 119, in read
          nthreads=nthreads)
        File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all
        File "error.pxi", line 85, in pyarrow.lib.check_status
      pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.
      

      I somehow get the impression that after https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be able to load the nested parquet in pyarrow.

      Any insight about this?

      Thanks.

      Attachments

        Issue Links

          Activity

            People

              emkornfield@gmail.com Micah Kornfield
              dbtsai DB Tsai
              Votes:
              42 Vote for this issue
              Watchers:
              46 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 67h 50m
                  67h 50m