Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.3.6, 3.1.2
Description
ArrayIndexOutOfBoundsException is getting thrown while decoding dictionaryIds of a row group in parquet file with vectorization enabled.
Exception stack trace:
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 at org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:122) at org.apache.hadoop.hive.ql.io.parquet.vector.ParquetDataColumnReaderFactory$DefaultParquetDataColumnReader.readString(ParquetDataColumnReaderFactory.java:95) at org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedPrimitiveColumnReader.decodeDictionaryIds(VectorizedPrimitiveColumnReader.java:467) at org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedPrimitiveColumnReader.readBatch(VectorizedPrimitiveColumnReader.java:68) at org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:410) at org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.next(VectorizedParquetRecordReader.java:353) at org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.next(VectorizedParquetRecordReader.java:92) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:365) ... 24 more
This issue seems to be caused by re-using the same dictionary column vector while reading consecutive row groups. This looks like one of the corner case bug which occurs for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure.
Similar issue issue was reported in spark (Ref: https://issues.apache.org/jira/browse/SPARK-16334)
Attachments
Attachments
Issue Links
- links to