Hive
  1. Hive
  2. HIVE-5922

In orc.InStream.CompressedStream, the desired position passed to seek can equal offsets[i] + bytes[i].remaining() when ORC predicate pushdown is enabled

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: File Formats
    • Labels:
      None

      Description

      Two stack traces ...

      java.io.IOException: IO error in map input file hdfs://10.38.55.204:8020/user/hive/warehouse/ssdb_bin_compress_orc_large_0_13.db/cycle/000004_0
      	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236)
      	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
      	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
      	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
      	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
      	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:415)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
      	at org.apache.hadoop.mapred.Child.main(Child.java:249)
      Caused by: java.io.IOException: java.io.IOException: Seek outside of data in compressed stream Stream for column 9 kind DATA position: 21496054 length: 33790900 range: 2 offset: 1048588 limit: 1048588 range 0 = 13893791 to 1048588;  range 1 = 17039555 to 1310735;  range 2 = 20447466 to 1048588;  range 3 = 23855377 to 1048588;  range 4 = 27263288 to 1048588;  range 5 = 30409052 to 1310735 uncompressed: 262144 to 262144 to 21496054
      	at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
      	at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
      	at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:276)
      	at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
      	at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
      	at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:108)
      	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:230)
      	... 9 more
      Caused by: java.io.IOException: Seek outside of data in compressed stream Stream for column 9 kind DATA position: 21496054 length: 33790900 range: 2 offset: 1048588 limit: 1048588 range 0 = 13893791 to 1048588;  range 1 = 17039555 to 1310735;  range 2 = 20447466 to 1048588;  range 3 = 23855377 to 1048588;  range 4 = 27263288 to 1048588;  range 5 = 30409052 to 1310735 uncompressed: 262144 to 262144 to 21496054
      	at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.seek(InStream.java:328)
      	at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:161)
      	at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:205)
      	at org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readInts(SerializationUtils.java:450)
      	at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readDirectValues(RunLengthIntegerReaderV2.java:240)
      	at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:53)
      	at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:288)
      	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$IntTreeReader.next(RecordReaderImpl.java:510)
      	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.next(RecordReaderImpl.java:1581)
      	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:2707)
      	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:110)
      	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:86)
      	at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
      	... 13 more
      
      java.io.IOException: IO error in map input file hdfs://10.38.55.204:8020/user/hive/warehouse/ssdb_bin_compress_orc_large_0_13.db/cycle/000095_0
      	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236)
      	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
      	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
      	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
      	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
      	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:415)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
      	at org.apache.hadoop.mapred.Child.main(Child.java:249)
      Caused by: java.io.IOException: java.lang.IllegalStateException: Can't read header at compressed stream Stream for column 9 kind DATA position: 20447466 length: 20958101 range: 6 offset: 1835029 limit: 1835029 range 0 = 0 to 524294;  range 1 = 1835029 to 2097176;  range 2 = 5242940 to 1835029;  range 3 = 8650851 to 1835029;  range 4 = 11796615 to 2097176;  range 5 = 15204526 to 2097176;  range 6 = 18612437 to 1835029 uncompressed: 262144 to 262144
      	at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
      	at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
      	at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:276)
      	at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
      	at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
      	at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:108)
      	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:230)
      	... 9 more
      Caused by: java.lang.IllegalStateException: Can't read header at compressed stream Stream for column 9 kind DATA position: 20447466 length: 20958101 range: 6 offset: 1835029 limit: 1835029 range 0 = 0 to 524294;  range 1 = 1835029 to 2097176;  range 2 = 5242940 to 1835029;  range 3 = 8650851 to 1835029;  range 4 = 11796615 to 2097176;  range 5 = 15204526 to 2097176;  range 6 = 18612437 to 1835029 uncompressed: 262144 to 262144
      	at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:195)
      	at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:205)
      	at org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readInts(SerializationUtils.java:450)
      	at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readDirectValues(RunLengthIntegerReaderV2.java:240)
      	at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:53)
      	at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:288)
      	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$IntTreeReader.next(RecordReaderImpl.java:510)
      	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.next(RecordReaderImpl.java:1581)
      	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:2707)
      	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:110)
      	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:86)
      	at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
      	... 13 more
      

        Activity

        Hide
        Puneet Gupta added a comment -

        From what is know 0.12.0 does not have vectorization support .So that can not be the issue. Also this happens only on seeking while predicate push-down is enabled . Normal iteration is fine .

        Show
        Puneet Gupta added a comment - From what is know 0.12.0 does not have vectorization support .So that can not be the issue. Also this happens only on seeking while predicate push-down is enabled . Normal iteration is fine .
        Hide
        Puneet Gupta added a comment -

        Hi Prasanth

        I am using Hive Binary from "hive-0.12.0-bin.tar.gz"
        http://apache.claz.org/hive/hive-0.12.0/

        I am using only the ORC file format part to store my data . Its not used along with Hive .

        Show
        Puneet Gupta added a comment - Hi Prasanth I am using Hive Binary from "hive-0.12.0-bin.tar.gz" http://apache.claz.org/hive/hive-0.12.0/ I am using only the ORC file format part to store my data . Its not used along with Hive .
        Hide
        Prasanth J added a comment -

        Hi Puneeth

        The issue might be related to https://issues.apache.org/jira/browse/HIVE-6320 or https://issues.apache.org/jira/browse/HIVE-6287 depends on whether you have enable vectorization or not. Is this issue happening in trunk?

        Show
        Prasanth J added a comment - Hi Puneeth The issue might be related to https://issues.apache.org/jira/browse/HIVE-6320 or https://issues.apache.org/jira/browse/HIVE-6287 depends on whether you have enable vectorization or not. Is this issue happening in trunk?
        Hide
        Puneet Gupta added a comment -

        I got a similar Exception ( on seeking to row 9,103,258 )

        java.io.IOException: Seek outside of data in compressed stream Stream for column 65 kind DATA position: 1572882 length: 2116178 range: 1 offset: 1048588 limit: 1048588 range 0 = 0 to 0; range 1 = 524294 to 1048588; range 2 = 1835029 to 262147 uncompressed: 1048588 to 1048588 to 1572882
        at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.seek(InStream.java:277)
        at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:153)
        at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:197)
        at org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readInts(SerializationUtils.java:450)
        at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readPatchedBaseValues(RunLengthIntegerReaderV2.java:161)
        at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:54)
        at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.skip(RunLengthIntegerReaderV2.java:318)
        at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$IntTreeReader.skipRows(RecordReaderImpl.java:427)
        at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.skipRows(RecordReaderImpl.java:1181)
        at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:2183)
        at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.seekToRow(RecordReaderImpl.java:2284)

        Some observations


        1. I have used Snappy for compression

        2. there are 75 columns in the file (mostly numbers - int,long,byte,short and a few strings). Exception always happens for column 65 which is an int. If I remove this column from include column list, seek works fine .

        2. This issue happens only when I am seeking to row using RecordReader.seekToRow(long). In this flow the RecordReader is created using Reader.rows(long, long, boolean[], SearchArgument, String[]). The SearchArgument is using "IN" construct with 200 long values which are actually the row numbers I want to retrieve (SearchArgument.FACTORY.newBuilder().startOr().in(colName, 200 Long Values).end().build()). Exception happens for seek to row 9103258 (file has about 13 million rows). I tried SearchArgument with just one IN value of 9103258.... BINGO .. got the same Exception. This problem can be reproduced for any rowSeek between 9103258 and 9103279. Rows after this seem to work fine .

        3. I face no Exceptions if the RecordReader is created using Reader.rows(null), and the entire file is iterated using RecordReader.hasNext() and RecordReader.next()

        4. I face no Exceptions if the RecordReader is created using Reader.rows(long, long, boolean[], SearchArgument, String[]) and SearchArgument is passed null. Then the required data (about 200 rows) is retrieved using RecordReader.seekToRow(long) and RecordReader.next()

        5.Obvious WorkAround is not to use predicate push down . In may case since I know the row numbers to be seeked to, the performance let down is not very drastic.
        Read/SeekTO 167 rows in (ms)3609 : Existing usage with predicate push down in ORC
        Read/SeekTo 167 rows in (ms)4626 : WorkAround without predicate/Search-Argument pushdown
        >>>> Difference of 1017 ms = approx 7 ms per row let down in performance (arounf 80% values are fetched from different strides)

        Show
        Puneet Gupta added a comment - I got a similar Exception ( on seeking to row 9,103,258 ) java.io.IOException: Seek outside of data in compressed stream Stream for column 65 kind DATA position: 1572882 length: 2116178 range: 1 offset: 1048588 limit: 1048588 range 0 = 0 to 0; range 1 = 524294 to 1048588; range 2 = 1835029 to 262147 uncompressed: 1048588 to 1048588 to 1572882 at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.seek(InStream.java:277) at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:153) at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:197) at org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readInts(SerializationUtils.java:450) at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readPatchedBaseValues(RunLengthIntegerReaderV2.java:161) at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:54) at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.skip(RunLengthIntegerReaderV2.java:318) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$IntTreeReader.skipRows(RecordReaderImpl.java:427) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.skipRows(RecordReaderImpl.java:1181) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:2183) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.seekToRow(RecordReaderImpl.java:2284) Some observations 1. I have used Snappy for compression 2. there are 75 columns in the file (mostly numbers - int,long,byte,short and a few strings). Exception always happens for column 65 which is an int. If I remove this column from include column list, seek works fine . 2. This issue happens only when I am seeking to row using RecordReader.seekToRow(long). In this flow the RecordReader is created using Reader.rows(long, long, boolean[], SearchArgument, String[]). The SearchArgument is using "IN" construct with 200 long values which are actually the row numbers I want to retrieve (SearchArgument.FACTORY.newBuilder().startOr().in(colName, 200 Long Values).end().build()). Exception happens for seek to row 9103258 (file has about 13 million rows). I tried SearchArgument with just one IN value of 9103258.... BINGO .. got the same Exception. This problem can be reproduced for any rowSeek between 9103258 and 9103279. Rows after this seem to work fine . 3. I face no Exceptions if the RecordReader is created using Reader.rows(null), and the entire file is iterated using RecordReader.hasNext() and RecordReader.next() 4. I face no Exceptions if the RecordReader is created using Reader.rows(long, long, boolean[], SearchArgument, String[]) and SearchArgument is passed null. Then the required data (about 200 rows) is retrieved using RecordReader.seekToRow(long) and RecordReader.next() 5.Obvious WorkAround is not to use predicate push down . In may case since I know the row numbers to be seeked to, the performance let down is not very drastic. Read/SeekTO 167 rows in (ms)3609 : Existing usage with predicate push down in ORC Read/SeekTo 167 rows in (ms)4626 : WorkAround without predicate/Search-Argument pushdown >>>> Difference of 1017 ms = approx 7 ms per row let down in performance (arounf 80% values are fetched from different strides)
        Hide
        Yin Huai added a comment -

        For the first trace, the desired position is 21496054 and the second range is "range 2 = 20447466 to 1048588". For the second trace, the desired position is 20447466 and the sixth range is "range 6 = 18612437 to 1835029".

        When I turned off predicate pushdown or I used predicate pushdown with uncompressed data, I did not see this problem.

        Show
        Yin Huai added a comment - For the first trace, the desired position is 21496054 and the second range is "range 2 = 20447466 to 1048588". For the second trace, the desired position is 20447466 and the sixth range is "range 6 = 18612437 to 1835029". When I turned off predicate pushdown or I used predicate pushdown with uncompressed data, I did not see this problem.

          People

          • Assignee:
            Unassigned
            Reporter:
            Yin Huai
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development