[HBASE-25357] allow specifying binary row key range to pre-split regions - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: spark
Labels:
None

Description

Currently, spark hbase connector use `String` to specify regionStart and regionEnd, but we often have serialized binary row key, I made a little patch at https://github.com/apache/hbase-connectors/pull/72/files to always treat the `String` in ISO_8859_1, so we can put raw bytes into the String object and get it unchanged.

This has a drawback, if your row key is really Unicode strings beyond ISO_8859_1 charset, you should convert it to UTF-8 encoded bytes and then encapsulate it in ISO_8859_1 string. This is a limitation of Spark option interface which allows only string to string map.

import java.nio.charset.StandardCharsets;

df.write()
  .format("org.apache.hadoop.hbase.spark")
  .option(HBaseTableCatalog.tableCatalog(), catalog)
  .option(HBaseTableCatalog.newTable(), 5)
  .option(HBaseTableCatalog.regionStart(), new String("你好".getBytes(StandardCharsets.UTF_8), StandardCharsets.ISO_8859_1))
  .option(HBaseTableCatalog.regionEnd(), new String("世界".getBytes(StandardCharsets.UTF_8), StandardCharsets.ISO_8859_1))
  .mode(SaveMode.Append)
  .save();

Attachments

Issue Links

links to

GitHub Pull Request #72

Activity

People

Assignee:: Unassigned

Reporter:: Yubao Liu

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 04/Dec/20 04:13

Updated:: 31/Jan/22 19:32