Description
Given this ORC file:
Rows: 57 Compression: ZLIB Compression size: 262144 Calendar: Julian/Gregorian Type: struct<_col0:timestamp,_col1:string,_col2:int,_col3:int,_col4:string,_col5:float,_col6:timestamp> Stripe Statistics: Stripe 1: Column 0: count: 27 hasNull: false Column 1: count: 27 hasNull: false min: 2019-02-22 10:39:52.0 max: 2019-02-22 10:39:52.0 Column 2: count: 27 hasNull: false min: I max: I sum: 27 Column 3: count: 27 hasNull: false min: 19752356 max: 20524679 sum: 551077013 Column 4: count: 27 hasNull: false min: 34 max: 154 sum: 2568 Column 5: count: 27 hasNull: false min: max: 692 sum: 29 Column 6: count: 27 hasNull: false min: -99988.0 max: 0.0 sum: -2299724.0 Column 7: count: 27 hasNull: false min: 1899-12-30 06:00:00.0 max: 1899-12-30 06:00:00.0 Stripe 2: Column 0: count: 30 hasNull: false Column 1: count: 30 hasNull: false min: 2019-02-22 10:39:52.0 max: 2019-02-22 10:39:52.0 Column 2: count: 30 hasNull: false min: I max: I sum: 30 Column 3: count: 30 hasNull: false min: 19752356 max: 20524679 sum: 611106400 Column 4: count: 30 hasNull: false min: 34 max: 154 sum: 2923 Column 5: count: 30 hasNull: false min: max: 692 sum: 21 Column 6: count: 30 hasNull: false min: -99988.0 max: 0.0 sum: -2699676.0 Column 7: count: 30 hasNull: false min: 1899-12-30 06:00:00.0 max: 1899-12-30 06:00:00.0 ...
this leads to a read of a batch of size 27 and then another of size 30
on the second batch we get:
Caused by: java.lang.ArrayIndexOutOfBoundsException: 27 at org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:306) at org.apache.orc.impl.TreeReaderFactory$FloatTreeReader.nextVector(TreeReaderFactory.java:690) at org.apache.orc.impl.ConvertTreeReaderFactory$DecimalFromDoubleTreeReader.nextVector(ConvertTreeReaderFactory.java:867) at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2047) at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1219) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.ensureBatch(RecordReaderImpl.java:88) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.hasNext(RecordReaderImpl.java:104) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:265) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:241) at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:589)
this is thrown from here (ignore line numbers above, those belong to another distro)
https://github.com/apache/orc/blob/d41c3a678307f10d3cc8799abb5d55e9922115a8/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L388
I simply fixed this problem by adding another ensure call here:
https://github.com/apache/orc/blob/d41c3a678307f10d3cc8799abb5d55e9922115a8/java/core/src/java/org/apache/orc/impl/ConvertTreeReaderFactory.java#L901
doubleColVector.ensureSize(batchSize, false);
in general, in ConvertTreeReader instances we use multiple vector variables (because of conversion), and we only ensure the size of one of them while reading:
https://github.com/apache/orc/blob/b5945001f670a5a44250e76aea1ea704bfd0e29d/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L2046
I've set 1.6.9 as affected version, as I'm able to reproduce it on hive/master which depends on ORC 1.6.9 at the moment
on main branch, I haven't seen the corresponding ensure call, I need to check what changed since branch-1.6
Attachments
Attachments
Issue Links
- fixes
-
SPARK-39830 Add a test case to read ORC table that requires type promotion
- Resolved
- links to