[HIVE-25494] Hive query fails with IndexOutOfBoundsException when a struct type column's field is missing in parquet file schema but present in table schema - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.1.2
Fix Version/s: None
Component/s: Parquet
Labels:
- schema-evolution

Description

When a struct type column's field is missing in parquet file schema but present in table schema and columns are accessed by names, the requestedSchema getting sent from Hive to Parquet storage layer has type even for missing field since we always add type as primitive type if a field is missing in file schema (Ref: code). On a parquet side, this missing field gets pruned and since this field belongs to struct type, it ends up creating a GroupColumnIO without any children. This causes query to fail with IndexOutOfBoundsException, stack trace is given below.

Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file test-struct.parquet
 at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
 at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
 at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:98)
 at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:60)
 at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
 at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
 at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
 at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
 ... 15 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
 at java.util.ArrayList.rangeCheck(ArrayList.java:657)
 at java.util.ArrayList.get(ArrayList.java:433)
 at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
 at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
 at org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
 at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
 at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:277)
 at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
 at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
 at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
 at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
 at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
 at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)

Steps to reproduce:

CREATE TABLE parquet_struct_test(
`parent` struct<child:string,extracol:string> COMMENT '',
`toplevel` string COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
 
-- Use the attached test-struct.parquet data file to load data to this table

LOAD DATA LOCAL INPATH 'test-struct.parquet' INTO TABLE parquet_struct_test;

hive> select parent.extracol, toplevel from parquet_struct_test;
OK
Failed with exception java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://${host}/user/hive/warehouse/parquet_struct_test/test-struct.parquet

Expected Result: NULL toplevel

Same query works fine in the following scenarios:

1) Accessing parquet file columns by index instead of names

hive> set parquet.column.index.access=true;
hive>  select parent.extracol, toplevel from parquet_struct_test;
OK
NULL toplevel

2) When VectorizedParquetRecordReader is used

hive> set hive.fetch.task.conversion=none;
hive> select parent.extracol, toplevel from parquet_struct_test;
Query ID = hadoop_20210831154424_19aa6f7f-ab72-4c1e-ae37-4f985e72fce9Total 
jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1630412697229_0031)
----------------------------------------------------------------------------------------------        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0----------------------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 3.06 s----------------------------------------------------------------------------------------------
OK
NULL toplevel

3) Create a copy of the same table and run the same query on the newly created table.

hive> create table parquet_struct_test_copy like parquet_struct_test;
OK
hive> insert into parquet_struct_test_copy select * from parquet_struct_test;
Query ID = hadoop_20210831154709_954d0abf-d713-498e-8696-27fb9c457dc8Total jobs = 1Launching Job 1 out of 1Status: Running (Executing on YARN cluster with App id application_1630412697229_0031)
----------------------------------------------------------------------------------------------        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0----------------------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 3.81 s----------------------------------------------------------------------------------------------
Loading data to table default.parquet_struct_test_copy
OK
hive> select parent.extracol, toplevel from parquet_struct_test_copy;
OK
NULL toplevel

Also, this issue doesn't exist when only missing struct type column's field is selected or all the fields in table are selected. This issue exists only when combination of missing struct type column's field and another existing column are selected.

Hive query fails with IndexOutOfBoundsException when a struct type column's field is missing in parquet file schema but present in table schema

Details

Description

Attachments

Attachments

Activity

People

Dates