Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-25357

allow specifying binary row key range to pre-split regions

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • spark
    • None

    Description

      Currently, spark hbase connector use `String` to specify regionStart and regionEnd, but we often have serialized binary row key,  I made a little patch at https://github.com/apache/hbase-connectors/pull/72/files to always treat the `String` in ISO_8859_1, so we can put raw bytes into the String object and get it unchanged.

      This has a drawback,  if your row key is really Unicode strings beyond ISO_8859_1 charset, you should convert it to UTF-8 encoded bytes and then encapsulate it in ISO_8859_1 string. This is a limitation of Spark option interface which allows only string to string map.

      import java.nio.charset.StandardCharsets;
      
      df.write()
        .format("org.apache.hadoop.hbase.spark")
        .option(HBaseTableCatalog.tableCatalog(), catalog)
        .option(HBaseTableCatalog.newTable(), 5)
        .option(HBaseTableCatalog.regionStart(), new String("你好".getBytes(StandardCharsets.UTF_8), StandardCharsets.ISO_8859_1))
        .option(HBaseTableCatalog.regionEnd(), new String("世界".getBytes(StandardCharsets.UTF_8), StandardCharsets.ISO_8859_1))
        .mode(SaveMode.Append)
        .save();
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              liuyb@yahoo-inc.com Yubao Liu
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: