Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-626

Reading Struct Column Having Multiple Fields With Same Name Causes java.io.EOFException

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.5.11, 1.6.4, 1.7.0
    • None
    • None

    Description

      Steps To Repro In Hive:

      set hive.fetch.task.conversion=none;
      set orc.force.positional.evolution=true;
      
      create table complex_orc(device struct<a:string,a:string,b:string>) stored as orc;
      insert into complex_orc select named_struct("a","123","a","823","b","23");
      select * from complex_orc;
      

      Fails with the following exception:

      Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream Stream for column 3 kind LENGTH position: 6 length: 6 range: 0 offset: 16 limit: 16 range 0 = 0 to 6 uncompressed: 3 to 3
      	at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61)
      	at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
      	at org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:369)
      	at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:1299)
      	at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:1336)
      	at org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:1434)
      	at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1280)
      	at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextVector(TreeReaderFactory.java:1836)
      	at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:1818)
      	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1149)
      

      This is caused due to ORC-54 where schema evolution was done based on field names rather than index. Setting orc.force.positional.evolution will force to do a positional schema evolution but the positional level is hardcoded to 1 (for non acid). Even though it doesn't make sense to have multiple fields with same name in in struct, It breaks the backward compatibly with hive 1.2 / hive2.1.

      omalley Can you please share the idea behind setting positional level to 1. Is it really required when orc.force.positional.evolution is set? I mean can't we just do positional schema evolution for all the levels when orc.force.positional.evolution is set.

      Attachments

        Issue Links

          Activity

            People

              srahman Syed Shameerur Rahman
              srahman Syed Shameerur Rahman
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: