Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6844

[C++][Parquet][Python] List<scalar type> columns read broken with 0.15.0

    XMLWordPrintableJSON

Details

    Description

      Columns of type array<primitive type> (such as `array<int32>`, `array<int64>`...) are not readable anymore using pyarrow == 0.15.0 (but were with pyarrow == 0.14.1) when the original writer of the parquet file is parquet-mr 1.9.1.

      import pyarrow.parquet as pq
      
      pf = pq.ParquetFile('sample.gz.parquet')
      
      print(pf.read(columns=['profile_ids']))
      

      with 0.14.1:

      pyarrow.Table
      profile_ids: list<element: int64>
       child 0, element: int64
      
      ...
      

      with 0.15.0:

      Traceback (most recent call last):
       File "<string>", line 1, in <module>
       File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyarrow/parquet.py", line 253, in read
       use_threads=use_threads)
       File "pyarrow/_parquet.pyx", line 1131, in pyarrow._parquet.ParquetReader.read_all
       File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
      pyarrow.lib.ArrowInvalid: Column data for field 0 with type list<item: int64> is inconsistent with schema list<element: int64>
      

      I've tested parquet files coming from multiple tables (with various schemas) created with `parquet-mr`, couldn't read any `array<primitive type>` column anymore.

       

      I think the bug was introduced with [this commit|https://github.com/apache/arrow/commit/06fd2da5e8e71b660e6eea4b7702ca175e31f3f5].

      I think the root of the issue comes from the fact that `parquet-mr` writes the inner struct name as `"element"` by default (see here), whereas `parquet-cpp` (or `pyarrow`?) assumes `"item"` (see for example this test). The round-tripping tests write/read in pyarrow only obviously won't catch this.

       

       

      Attachments

        1. dbg_sample2.gz.parquet
          0.8 kB
          Benoit Rostykus
        2. dbg_sample.gz.parquet
          0.8 kB
          Benoit Rostykus

        Issue Links

          Activity

            People

              wesm Wes McKinney
              brostykus Benoit Rostykus
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m