HBase
  1. HBase
  2. HBASE-2432

enhance hbase.util.Bytes.toBytes() with length limit

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Incomplete
    • Affects Version/s: 0.20.1
    • Fix Version/s: None
    • Component/s: util
    • Labels:
      None

      Description

      The following stack trace is seen in our hadoop log:

      java.lang.IllegalArgumentException: Row > 32767
      at org.apache.hadoop.hbase.KeyValue.createByteArray(KeyValue.java:437)
      at org.apache.hadoop.hbase.KeyValue.(KeyValue.java:405)
      at org.apache.hadoop.hbase.KeyValue.(KeyValue.java:374)
      at org.apache.hadoop.hbase.KeyValue.(KeyValue.java:353)
      at org.apache.hadoop.hbase.client.Put.add(Put.java:137)
      at org.apache.hadoop.hbase.client.Put.add(Put.java:108)
      at org.apache.nutch.scoring.webgraph.ScoreUpdater$ScoreUpdaterReducer.reduce(ScoreUpdater.java:170)
      at org.apache.nutch.scoring.webgraph.ScoreUpdater$ScoreUpdaterReducer.reduce(ScoreUpdater.java:127)
      at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
      at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:563)
      at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
      at org.apache.hadoop.mapred.Child.main(Child.java:170)

      Bytes.toBytes(Float.valueOf(score).toString()) may return an array longer than 32767 bytes.

      We should enhance Bytes.toBytes() to include length limit:
      public static byte[] toBytes(String s, int length)

      String.getBytes() doesn't have length limit

        Activity

        Hide
        Ted Yu added a comment -

        The motivation was due to row size limit of 32767.

        The initial patch targeted String which represents Float. So it is not suitable for restoring to arbitrary data types. For multibyte character strings, I checked
        StringCoding.encode(String charsetName, char[] ca, int off, int len) which resorts to StringEncoder, a private class.

        So more work is needed for multibyte character strings.

        Show
        Ted Yu added a comment - The motivation was due to row size limit of 32767. The initial patch targeted String which represents Float. So it is not suitable for restoring to arbitrary data types. For multibyte character strings, I checked StringCoding.encode(String charsetName, char[] ca, int off, int len) which resorts to StringEncoder, a private class. So more work is needed for multibyte character strings.
        Hide
        stack added a comment -

        What are you trying to achieve Ted? You want to make a row that is < 32k? (You realize that maximum row size in 32k in hbase?).

        Regards your patch, you are cutting the byte array at an arbitrary point. Do you want to do that? What if your cut happens in the middle of a multibyte character. You'll have difficulty making a String of it out on the other side (presuming you want to do such a thing which seems likely given your source is a String).

        Show
        stack added a comment - What are you trying to achieve Ted? You want to make a row that is < 32k? (You realize that maximum row size in 32k in hbase?). Regards your patch, you are cutting the byte array at an arbitrary point. Do you want to do that? What if your cut happens in the middle of a multibyte character. You'll have difficulty making a String of it out on the other side (presuming you want to do such a thing which seems likely given your source is a String).
        Hide
        Ted Yu added a comment -

        Added a test in TestBytes for the new method.

        Show
        Ted Yu added a comment - Added a test in TestBytes for the new method.
        Hide
        Ted Yu added a comment -

        Added
        public static byte[] toBytes(String s, int len) {

        Show
        Ted Yu added a comment - Added public static byte[] toBytes(String s, int len) {
        Hide
        stack added a comment -

        Please attach a patch Ted. Include a unit test and it'll make it more likely your patch will be committed. Thanks.

        Show
        stack added a comment - Please attach a patch Ted. Include a unit test and it'll make it more likely your patch will be committed. Thanks.
        Hide
        Ted Yu added a comment -

        String.format() can be used to limit the length of String representation for float or double.

        Show
        Ted Yu added a comment - String.format() can be used to limit the length of String representation for float or double.
        Hide
        Ted Yu added a comment -

        Here is one implementation:

        /**

        • Converts a string to a UTF-8 byte array with limited length.
        • @param s the string
        • @param len the length limit
        • @return the byte array
          */
          public static byte[] toBytes(String s, int len) {
          if (s == null) { throw new IllegalArgumentException("string cannot be null"); }

          if (len <= 0)

          { throw new IllegalArgumentException("string length should be positive"); }

          byte [] result = null;
          byte [] ary = null;
          try

          { ary = s.getBytes(HConstants.UTF8_ENCODING); }

          catch (UnsupportedEncodingException e)

          { e.printStackTrace(); }

          if (ary.length > len)

          { result = new byte[len]; System.arraycopy(ary, 0, result, 0, len); }

          else result = ary;

        return result;
        }

        Show
        Ted Yu added a comment - Here is one implementation: /** Converts a string to a UTF-8 byte array with limited length. @param s the string @param len the length limit @return the byte array */ public static byte[] toBytes(String s, int len) { if (s == null) { throw new IllegalArgumentException("string cannot be null"); } if (len <= 0) { throw new IllegalArgumentException("string length should be positive"); } byte [] result = null; byte [] ary = null; try { ary = s.getBytes(HConstants.UTF8_ENCODING); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } if (ary.length > len) { result = new byte[len]; System.arraycopy(ary, 0, result, 0, len); } else result = ary; return result; }

          People

          • Assignee:
            Unassigned
            Reporter:
            Ted Yu
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development