Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-14525

org.apache.hadoop.io.Text Truncate

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.8.1
    • None
    • io
    • None

    Description

      For Apache Hive, VARCHAR fields are much slower than STRING fields when a precision (string length cap) is included. Keep in mind that this precision is the number of UTF-8 characters in the string, not the number of bytes.

      The general procedure is:

      1. Load an entire byte buffer into a Text object
      2. Convert it to a String
      3. Count N number of character code points
      4. Substring the String at the correct place
      5. Convert the String back into a byte array and populate the Text object

      It would be great if the Text object could offer a truncate/substring method based on character count that did not require copying data around. Along the same lines, a "getCharacterLength()" method may also be useful to determine if the precision has been exceeded.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              belugabehr David Mollitor
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: