Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-14525

org.apache.hadoop.io.Text Truncate

Add voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.8.1
    • Fix Version/s: None
    • Component/s: io
    • Labels:
      None

      Description

      For Apache Hive, VARCHAR fields are much slower than STRING fields when a precision (string length cap) is included. Keep in mind that this precision is the number of UTF-8 characters in the string, not the number of bytes.

      The general procedure is:

      1. Load an entire byte buffer into a Text object
      2. Convert it to a String
      3. Count N number of character code points
      4. Substring the String at the correct place
      5. Convert the String back into a byte array and populate the Text object

      It would be great if the Text object could offer a truncate/substring method based on character count that did not require copying data around. Along the same lines, a "getCharacterLength()" method may also be useful to determine if the precision has been exceeded.

        Attachments

        Issue Links

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              belugabehr David Mollitor

              Dates

              • Created:
                Updated:

                Issue deployment