Hive
  1. Hive
  2. HIVE-6382

PATCHED_BLOB encoding in ORC will corrupt data in some cases

    Details

      Description

      In PATCHED_BLOB encoding (added in HIVE-4123), gapVsPatchList is an array of long that stores gap (g) between the values that are patched and the patch value (p). The maximum distance of gap can be 511 that require 8 bits to encode. And patch values can take more than 56 bits. When patch values take more than 56 bits, p + g will become > 64 bits which cannot be packed to a long. This will result in data corruption under the case where patch values are > 56 bits.

      Stack trace will look like:

      Caused by: java.lang.ArrayIndexOutOfBoundsException: 3
      at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerWriterV2.preparePatchedBlob(RunLengthIntegerWriterV2.java:593)
      at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerWriterV2.determineEncoding(RunLengthIntegerWriterV2.java:541)
      at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerWriterV2.write(RunLengthIntegerWriterV2.java:746)
      at org.apache.hadoop.hive.ql.io.orc.WriterImpl$IntegerTreeWriter.write(WriterImpl.java:744)
      at org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.write(WriterImpl.java:1320)
      at org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:1849)
      at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:75)
      at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:638)
      at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:501)
      at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
      at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88)
      at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:501)
      at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
      at org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:45)
      at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:501)
      at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:249)
      ... 7 more
      
      1. HIVE-6382.6.patch
        61 kB
        Prasanth Jayachandran
      2. HIVE-6382.5.patch
        33 kB
        Prasanth Jayachandran
      3. HIVE-6382.4.patch
        33 kB
        Prasanth Jayachandran
      4. HIVE-6382.3.patch
        33 kB
        Prasanth Jayachandran
      5. HIVE-6382.2.patch
        13 kB
        Prasanth Jayachandran
      6. HIVE-6382.1.patch
        12 kB
        Prasanth Jayachandran

        Issue Links

          Activity

          Prasanth Jayachandran created issue -
          Prasanth Jayachandran made changes -
          Field Original Value New Value
          Summary PATCHED_BLOB encoding in ORC will corrupt the data in some cases PATCHED_BLOB encoding in ORC will corrupt data in some cases
          Prasanth Jayachandran made changes -
          Description In PATCHED_BLOB encoding, gapVsPatchList is an array of long that stores gap between the values that are patched (g) and the patch value (p). The maximum distance of gap can be 511 that require 8 bits to encode. And patch values can take more than 56 bits. When patch values take more than 56 bits, p + g will become > 64 bits which cannot be packed to a long. This will result in data corruption under the case where patch values are > 56 bits. In PATCHED_BLOB encoding (added in HIVE-4123), gapVsPatchList is an array of long that stores gap between the values that are patched (g) and the patch value (p). The maximum distance of gap can be 511 that require 8 bits to encode. And patch values can take more than 56 bits. When patch values take more than 56 bits, p + g will become > 64 bits which cannot be packed to a long. This will result in data corruption under the case where patch values are > 56 bits.
          Prasanth Jayachandran made changes -
          Description In PATCHED_BLOB encoding (added in HIVE-4123), gapVsPatchList is an array of long that stores gap between the values that are patched (g) and the patch value (p). The maximum distance of gap can be 511 that require 8 bits to encode. And patch values can take more than 56 bits. When patch values take more than 56 bits, p + g will become > 64 bits which cannot be packed to a long. This will result in data corruption under the case where patch values are > 56 bits. In PATCHED_BLOB encoding (added in HIVE-4123), gapVsPatchList is an array of long that stores gap (g) between the values that are patched and the patch value (p). The maximum distance of gap can be 511 that require 8 bits to encode. And patch values can take more than 56 bits. When patch values take more than 56 bits, p + g will become > 64 bits which cannot be packed to a long. This will result in data corruption under the case where patch values are > 56 bits.
          Prasanth Jayachandran made changes -
          Attachment HIVE-6382.1.patch [ 12627290 ]
          Prasanth Jayachandran made changes -
          Link This issue is blocked by HIVE-6347 [ HIVE-6347 ]
          Prasanth Jayachandran made changes -
          Link This issue is related to HIVE-6369 [ HIVE-6369 ]
          Prasanth Jayachandran made changes -
          Remote Link This issue links to "Review Board (Web Link)" [ 14061 ]
          Prasanth Jayachandran made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Prasanth Jayachandran made changes -
          Attachment HIVE-6382.2.patch [ 12627461 ]
          Prasanth Jayachandran made changes -
          Attachment HIVE-6382.3.patch [ 12629886 ]
          Prasanth Jayachandran made changes -
          Attachment HIVE-6382.4.patch [ 12629934 ]
          Prasanth Jayachandran made changes -
          Attachment HIVE-6382.5.patch [ 12629940 ]
          Prasanth Jayachandran made changes -
          Attachment HIVE-6382.6.patch [ 12629962 ]
          Gunther Hagleitner made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Fix Version/s 0.13.0 [ 12324986 ]
          Resolution Fixed [ 1 ]
          Prasanth Jayachandran made changes -
          Link This issue is related to HIVE-5970 [ HIVE-5970 ]
          Prasanth Jayachandran made changes -
          Description In PATCHED_BLOB encoding (added in HIVE-4123), gapVsPatchList is an array of long that stores gap (g) between the values that are patched and the patch value (p). The maximum distance of gap can be 511 that require 8 bits to encode. And patch values can take more than 56 bits. When patch values take more than 56 bits, p + g will become > 64 bits which cannot be packed to a long. This will result in data corruption under the case where patch values are > 56 bits. In PATCHED_BLOB encoding (added in HIVE-4123), gapVsPatchList is an array of long that stores gap (g) between the values that are patched and the patch value (p). The maximum distance of gap can be 511 that require 8 bits to encode. And patch values can take more than 56 bits. When patch values take more than 56 bits, p + g will become > 64 bits which cannot be packed to a long. This will result in data corruption under the case where patch values are > 56 bits.

          Stack trace will look like:
          {code}
          Caused by: java.lang.ArrayIndexOutOfBoundsException: 3
          at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerWriterV2.preparePatchedBlob(RunLengthIntegerWriterV2.java:593)
          at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerWriterV2.determineEncoding(RunLengthIntegerWriterV2.java:541)
          at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerWriterV2.write(RunLengthIntegerWriterV2.java:746)
          at org.apache.hadoop.hive.ql.io.orc.WriterImpl$IntegerTreeWriter.write(WriterImpl.java:744)
          at org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.write(WriterImpl.java:1320)
          at org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:1849)
          at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:75)
          at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:638)
          at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:501)
          at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
          at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88)
          at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:501)
          at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
          at org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:45)
          at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:501)
          at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:249)
          ... 7 more
          {code}

            People

            • Assignee:
              Prasanth Jayachandran
              Reporter:
              Prasanth Jayachandran
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development