Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-14525

org.apache.hadoop.io.Text Truncate

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.8.1
    • Fix Version/s: None
    • Component/s: io
    • Labels:
      None

      Description

      For Apache Hive, VARCHAR fields are much slower than STRING fields when a precision (string length cap) is included. Keep in mind that this precision is the number of UTF-8 characters in the string, not the number of bytes.

      The general procedure is:

      1. Load an entire byte buffer into a Text object
      2. Convert it to a String
      3. Count N number of character code points
      4. Substring the String at the correct place
      5. Convert the String back into a byte array and populate the Text object

      It would be great if the Text object could offer a truncate/substring method based on character count that did not require copying data around. Along the same lines, a "getCharacterLength()" method may also be useful to determine if the precision has been exceeded.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                belugabehr David Mollitor
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated: