Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
0.15.0
-
None
-
None
Description
A field having a type of list-of-ints can not be read using parrow.parquet.read_table function. Also failed with other field types (observed strings, for example).
This happens only in pyarrow 0.15.0. When downgrading to 0.14.1, the issue is not observed.
pyspark version: 2.4.4test_bad_parquet.tgz
Minimal snippet to reproduce the issue:
import pyarrow.parquet as pq from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, IntegerType, ArrayType, Row output_url = '/tmp/test_bad_parquet' spark = SparkSession.builder.getOrCreate() schema = StructType([StructField('int_fixed_size_list', ArrayType(IntegerType(), False), False)]) rows = [Row(int_fixed_size_list=[1, 2, 3])] dataframe = spark.createDataFrame(rows, schema).write.mode('overwrite').parquet(output_url) pq.read_table(output_url)
I get an error:
Traceback (most recent call last): File "/home/yevgeni/uatc/dataset-toolkit/repro_failure.py", line 13, in <module> pq.read_table(output_url) File "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py", line 1281, in read_table use_pandas_metadata=use_pandas_metadata) File "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py", line 1137, in read use_pandas_metadata=use_pandas_metadata) File "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py", line 605, in read table = reader.read(**options) File "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py", line 253, in read use_threads=use_threads) File "pyarrow/_parquet.pyx", line 1136, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Column data for field 0 with type list<item: int32 not null> is inconsistent with schema list<element: int32 not null>Process finished with exit code 1
Column data for field 0 with type list<item: int32 not null> is inconsistent with schema list<element: int32 not null>
A parquet store, as generated by the snippet is attached.
Attachments
Attachments
Issue Links
- is duplicated by
-
ARROW-6844 [C++][Parquet][Python] List<scalar type> columns read broken with 0.15.0
- Resolved