Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-1205

Size of batches in some ConvertTreeReaders should be ensured before using

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.6.14, 1.7.5
    • 1.8.0, 1.7.6
    • None
    • None

    Description

      Given this ORC file:

      Rows: 57
      Compression: ZLIB
      Compression size: 262144
      Calendar: Julian/Gregorian
      Type: struct<_col0:timestamp,_col1:string,_col2:int,_col3:int,_col4:string,_col5:float,_col6:timestamp>
      
      Stripe Statistics:
        Stripe 1:
          Column 0: count: 27 hasNull: false
          Column 1: count: 27 hasNull: false min: 2019-02-22 10:39:52.0 max: 2019-02-22 10:39:52.0
          Column 2: count: 27 hasNull: false min: I max: I sum: 27
          Column 3: count: 27 hasNull: false min: 19752356 max: 20524679 sum: 551077013
          Column 4: count: 27 hasNull: false min: 34 max: 154 sum: 2568
          Column 5: count: 27 hasNull: false min:  max: 692 sum: 29
          Column 6: count: 27 hasNull: false min: -99988.0 max: 0.0 sum: -2299724.0
          Column 7: count: 27 hasNull: false min: 1899-12-30 06:00:00.0 max: 1899-12-30 06:00:00.0
        Stripe 2:
          Column 0: count: 30 hasNull: false
          Column 1: count: 30 hasNull: false min: 2019-02-22 10:39:52.0 max: 2019-02-22 10:39:52.0
          Column 2: count: 30 hasNull: false min: I max: I sum: 30
          Column 3: count: 30 hasNull: false min: 19752356 max: 20524679 sum: 611106400
          Column 4: count: 30 hasNull: false min: 34 max: 154 sum: 2923
          Column 5: count: 30 hasNull: false min:  max: 692 sum: 21
          Column 6: count: 30 hasNull: false min: -99988.0 max: 0.0 sum: -2699676.0
          Column 7: count: 30 hasNull: false min: 1899-12-30 06:00:00.0 max: 1899-12-30 06:00:00.0
      ...
      

      this leads to a read of a batch of size 27 and then another of size 30

      on the second batch we get:

      Caused by: java.lang.ArrayIndexOutOfBoundsException: 27
      	at org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:306)
      	at org.apache.orc.impl.TreeReaderFactory$FloatTreeReader.nextVector(TreeReaderFactory.java:690)
      	at org.apache.orc.impl.ConvertTreeReaderFactory$DecimalFromDoubleTreeReader.nextVector(ConvertTreeReaderFactory.java:867)
      	at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2047)
      	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1219)
      	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.ensureBatch(RecordReaderImpl.java:88)
      	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.hasNext(RecordReaderImpl.java:104)
      	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:265)
      	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:241)
      	at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:589)
      

      this is thrown from here (ignore line numbers above, those belong to another distro)
      https://github.com/apache/orc/blob/d41c3a678307f10d3cc8799abb5d55e9922115a8/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L388

      I simply fixed this problem by adding another ensure call here:
      https://github.com/apache/orc/blob/d41c3a678307f10d3cc8799abb5d55e9922115a8/java/core/src/java/org/apache/orc/impl/ConvertTreeReaderFactory.java#L901

      doubleColVector.ensureSize(batchSize, false);
      

      in general, in ConvertTreeReader instances we use multiple vector variables (because of conversion), and we only ensure the size of one of them while reading:
      https://github.com/apache/orc/blob/b5945001f670a5a44250e76aea1ea704bfd0e29d/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L2046

      I've set 1.6.9 as affected version, as I'm able to reproduce it on hive/master which depends on ORC 1.6.9 at the moment

      on main branch, I haven't seen the corresponding ensure call, I need to check what changed since branch-1.6

      Attachments

        1. 000000_0.orc
          2 kB
          László Bodor

        Issue Links

          Activity

            People

              abstractdog László Bodor
              abstractdog László Bodor
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: