Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-616

In Patched Base encoding, the value of headerThirdByte goes beyond the range of byte

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 2.0.0
    • 1.4.6, 1.5.10, 1.6.3
    • Java

    Description

      In Patched Base encoding, the first three bits of headerThirdByte represent the base value width. If Math.abs(min) greater than or equal to 1 << 56, the value of baseBytes is 9, and the value of bb goes beyond range fo byte.

      final boolean isNegative = min < 0 ? true : false;
      if (isNegative) {
        min = -min;
      }
      // find the number of bytes required for base and shift it by 5 bits
      // to accommodate patch width. The additional bit is used to store the sign
      // of the base value.
      final int baseWidth = utils.findClosestNumBits(min) + 1;
      final int baseBytes = baseWidth % 8 == 0 ? baseWidth / 8 : (baseWidth / 8) + 1;
      final int bb = (baseBytes - 1) << 5;
      
      // if the base value is negative then set MSB to 1
      if (isNegative) {
        min |= (1L << ((baseBytes * 8) - 1));
      }
      
      // third byte contains 3 bits for number of bytes occupied by base
      // and 5 bits for patchWidth
      final int headerThirdByte = bb | utils.encodeBitWidth(patchWidth);
      

      The byte to be written is the eight low-order bits of the headerThirdByte, the value read by RunLengthIntegerReaderV2 is incorrect, as well as data of the column is unexpected.

      // extract the number of bytes occupied by base
      int thirdByte = input.read();
      int bw = (thirdByte >>> 5) & 0x07;
      // base width is one off
      bw += 1;
      

      In some cases, RunLengthIntegerReaderV2 fails with EOFExeption.

      Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream Stream for column 2 kind DATA position: 3213835 length: 3213835 range: 0 offset: 3217373 limit: 3217373 range 0 = 0 to 3213835 uncompressed: 184478 to 184478
              at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61)
              at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
              at org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:369)
              at org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:587)
              at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:1815)
              at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1184)
              ... 20 more
      

      For example, consider the following sequence:

      long data[] = {-9007199254740992l,-8725724278030337l,-1125762467889153l,-1l,-9007199254740992l,-9007199254740992l,-497l,127l,-1l,-72057594037927936l,-4194304l,-9007199254740992l,-4503599593816065l,-4194304l,-8936830510563329l,-9007199254740992l, -1l, -70334384439312l,-4063233l, -6755399441973249l};
      

      The min value is -72057594037927936(-1 << 56),RLEv2 writes this sequence with Patched Base encoding, and the data read out by RunLengthIntegerReaderV2 is:

      [281474976710656, 36275087623585792, 247390116249599, 72053196528287743, 72057594037927935, 72022409665839104, 246290604621824, -71776119061217282, 4222124650659840, 36028797018963967, 71776119061217280, 281474976694272, 246290604621824, 263882790797311, 72057594037911552, 246565482528767, 72022409665839104, 281474976710655, 72057319294238719, 67835469387252223]
      

      Attachments

        Issue Links

          Activity

            People

              zortsou Ruochen Zou
              zortsou Ruochen Zou
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 10m
                  10m