Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-18216

When Text is corrupted, processInput() hangs indefinitely

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.3.2
    • None
    • None
    • None

    Description

      When the Text is corrupted, the following loop become infinite.
      This is because in hadoop.io.Text.bytesToCodePoint(), when extraBytesToRead == -1, the index in the ByteBuffer is not moved, and thus, ByteBuffer.remaining() is always > 0.
      And it deletionSet.contains(-1), then this loop become infinite.

        private String processInput(Text input) {
          StringBuilder resultBuilder = new StringBuilder();
          // Obtain the byte buffer from the input string so we can traverse it code point by code point
          ByteBuffer inputBytes = ByteBuffer.wrap(input.getBytes(), 0, input.getLength());
          // Traverse the byte buffer containing the input string one code point at a time
          while (inputBytes.hasRemaining()) {
            int inputCodePoint = Text.bytesToCodePoint(inputBytes);
            // If the code point exists in deletion set, no need to emit out anything for this code point.
            // Continue on to the next code point
            if (deletionSet.contains(inputCodePoint)) {
              continue;
            }
      
            Integer replacementCodePoint = replacementMap.get(inputCodePoint);
            // If a replacement exists for this code point, emit out the replacement and append it to the
            // output string. If no such replacement exists, emit out the original input code point
            char[] charArray = Character.toChars((replacementCodePoint != null) ? replacementCodePoint
                : inputCodePoint);
            resultBuilder.append(charArray);
          }
          String resultString = resultBuilder.toString();
          return resultString;
        }
      

      Here is the hadoop.io.Text.bytesToCodePoint() function.

        public static int bytesToCodePoint(ByteBuffer bytes) {
          bytes.mark();
          byte b = bytes.get();
          bytes.reset();
          int extraBytesToRead = bytesFromUTF8[(b & 0xFF)];
          if (extraBytesToRead < 0) return -1; // trailing byte!
          int ch = 0;
      
          switch (extraBytesToRead) {
          case 5: ch += (bytes.get() & 0xFF); ch <<= 6; /* remember, illegal UTF-8 */
          case 4: ch += (bytes.get() & 0xFF); ch <<= 6; /* remember, illegal UTF-8 */
          case 3: ch += (bytes.get() & 0xFF); ch <<= 6;
          case 2: ch += (bytes.get() & 0xFF); ch <<= 6;
          case 1: ch += (bytes.get() & 0xFF); ch <<= 6;
          case 0: ch += (bytes.get() & 0xFF);
          }
          ch -= offsetsFromUTF8[extraBytesToRead];
      
          return ch;
        }
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            dustinday John Doe
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: