Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6849

[Python] can not read a parquet store containing a list of integers

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 0.15.0
    • None
    • Python
    • None

    Description

      A field having a type of list-of-ints can not be read using parrow.parquet.read_table function. Also failed with other field types (observed strings, for example).

      This happens only in pyarrow 0.15.0. When downgrading to 0.14.1, the issue is not observed.

      pyspark version: 2.4.4test_bad_parquet.tgz

      Minimal snippet to reproduce the issue:

       

      import pyarrow.parquet as pq
      from pyspark.sql import SparkSession
      from pyspark.sql.types import StructType, StructField, IntegerType, ArrayType, Row
      
      output_url = '/tmp/test_bad_parquet'
      spark = SparkSession.builder.getOrCreate()
      
      schema = StructType([StructField('int_fixed_size_list', ArrayType(IntegerType(), False), False)])
      rows = [Row(int_fixed_size_list=[1, 2, 3])]
      dataframe = spark.createDataFrame(rows, schema).write.mode('overwrite').parquet(output_url)
      
      pq.read_table(output_url)
      
      

      I get an error:

      Traceback (most recent call last):
        File "/home/yevgeni/uatc/dataset-toolkit/repro_failure.py", line 13, in <module>
          pq.read_table(output_url)
        File "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py", line 1281, in read_table
          use_pandas_metadata=use_pandas_metadata)
        File "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py", line 1137, in read
          use_pandas_metadata=use_pandas_metadata)
        File "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py", line 605, in read
          table = reader.read(**options)
        File "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py", line 253, in read
          use_threads=use_threads)
        File "pyarrow/_parquet.pyx", line 1136, in pyarrow._parquet.ParquetReader.read_all
        File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
      pyarrow.lib.ArrowInvalid: Column data for field 0 with type list<item: int32 not null> is inconsistent with schema list<element: int32 not null>Process finished with exit code 1
      
      

       

      Column data for field 0 with type list<item: int32 not null> is inconsistent with schema list<element: int32 not null>

       

      A parquet store, as generated by the snippet is attached.

      Attachments

        1. test_bad_parquet.tgz
          0.7 kB
          Yevgeni Litvin

        Issue Links

          Activity

            People

              Unassigned Unassigned
              selitvin Yevgeni Litvin
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: