Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-25494

Hive query fails with IndexOutOfBoundsException when a struct type column's field is missing in parquet file schema but present in table schema

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.1.2
    • None
    • Parquet

    Description

      When a struct type column's field is missing in parquet file schema but present in table schema and columns are accessed by names, the requestedSchema getting sent from Hive to Parquet storage layer has type even for missing field since we always add type as primitive type if a field is missing in file schema (Ref: code). On a parquet side, this missing field gets pruned and since this field belongs to struct type, it ends up creating a GroupColumnIO without any children. This causes query to fail with IndexOutOfBoundsException, stack trace is given below.

       

      Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file test-struct.parquet
       at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
       at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
       at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:98)
       at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:60)
       at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
       at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
       at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
       at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
       ... 15 more
      Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
       at java.util.ArrayList.rangeCheck(ArrayList.java:657)
       at java.util.ArrayList.get(ArrayList.java:433)
       at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
       at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
       at org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
       at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
       at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:277)
       at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
       at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
       at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
       at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
       at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
       at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214) 

       

      Steps to reproduce:

       

      CREATE TABLE parquet_struct_test(
      `parent` struct<child:string,extracol:string> COMMENT '',
      `toplevel` string COMMENT '')
      ROW FORMAT SERDE
      'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
      STORED AS INPUTFORMAT
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
      OUTPUTFORMAT
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
       
      -- Use the attached test-struct.parquet data file to load data to this table
      
      LOAD DATA LOCAL INPATH 'test-struct.parquet' INTO TABLE parquet_struct_test;
      
      hive> select parent.extracol, toplevel from parquet_struct_test;
      OK
      Failed with exception java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://${host}/user/hive/warehouse/parquet_struct_test/test-struct.parquet 
      

       

      Expected Result:  NULL toplevel

       

      Same query works fine in the following scenarios:

      1) Accessing parquet file columns by index instead of names

      hive> set parquet.column.index.access=true;
      hive>  select parent.extracol, toplevel from parquet_struct_test;
      OK
      NULL toplevel

       

      2) When VectorizedParquetRecordReader is used

      hive> set hive.fetch.task.conversion=none;
      hive> select parent.extracol, toplevel from parquet_struct_test;
      Query ID = hadoop_20210831154424_19aa6f7f-ab72-4c1e-ae37-4f985e72fce9Total 
      jobs = 1
      Launching Job 1 out of 1
      Status: Running (Executing on YARN cluster with App id application_1630412697229_0031)
      ----------------------------------------------------------------------------------------------        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED----------------------------------------------------------------------------------------------
      Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0----------------------------------------------------------------------------------------------
      VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 3.06 s----------------------------------------------------------------------------------------------
      OK
      NULL toplevel

       

      3) Create a copy of the same table and run the same query on the newly created table. 

      hive> create table parquet_struct_test_copy like parquet_struct_test;
      OK
      hive> insert into parquet_struct_test_copy select * from parquet_struct_test;
      Query ID = hadoop_20210831154709_954d0abf-d713-498e-8696-27fb9c457dc8Total jobs = 1Launching Job 1 out of 1Status: Running (Executing on YARN cluster with App id application_1630412697229_0031)
      ----------------------------------------------------------------------------------------------        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED----------------------------------------------------------------------------------------------
      Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0----------------------------------------------------------------------------------------------
      VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 3.81 s----------------------------------------------------------------------------------------------
      Loading data to table default.parquet_struct_test_copy
      OK
      hive> select parent.extracol, toplevel from parquet_struct_test_copy;
      OK
      NULL toplevel

       

      Also, this issue doesn't exist when only missing struct type column's field is selected or all the fields in table are selected. This issue exists only when combination of missing struct type column's field and another existing column are selected.

       

      Attachments

        1. test-struct.parquet
          0.7 kB
          Ganesha Shreedhara

        Activity

          People

            Unassigned Unassigned
            ganeshas Ganesha Shreedhara
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: