Hive
  1. Hive
  2. HIVE-806

Hive with HBase as data store to support MapReduce and direct query

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Incomplete
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: HBase Handler
    • Labels:
    • Tags:
      Data Store, Indexing

      Description

      Current Hive uses only HDFS as the underlayer data store, it can query and analyze files in HDFS via MapReduce.
      But in some engineering cases, our data are stored/organised/indexed in HBase or other data stores. This jira-issue will implement hive to use HBase as data store. And except for supporting MapReduce on HBase, we will support direct query on HBase.

      This is a brother jira-issue of HIVE-705 (Let Hive can analyse hbase's tables, https://issues.apache.org/jira/browse/HIVE-705). Because this implementation and use cases have some differences from HIVE-705, this jira-issue is created to avoid confusions. It is possible to combine the two issues in the future.

      Initial developers: Kula Liao, Stephen Xie, Tao Jiang and Schubert Zhang.

        Issue Links

          Activity

          Hide
          Zheng Shao added a comment -

          Any update on this?

          Show
          Zheng Shao added a comment - Any update on this?
          Hide
          Schubert Zhang added a comment -

          @Zheng, we are in desigining and coding now. and we had a talk with Samuel days ago. Because this is involved in one of our ongoing project, I am sorry the update will be not so quick.
          I describe something of out consideration bellow, and will update when we complete our implementation and verification.

          1. A new HBaseInputFormat.

          The current TableInputFormat always scan the whole HBase HTable, it is usually unnecessary, especially when we know one or more row-range.
          A new HBaseInputFormat will be implemented to provide more parameters to control the behavior of HTable scan. e.g.:
          (1) row-ranges (one or more startRow and endRow paires)
          (2) column list (some times we need not read all columns, HBase is a column-oriented store)
          (3) filter tree (predicate pushdow, filter rows/columns at region server)
          (4) maybe, we can do some computing on region server. (optional)

          2. SerDe

          We use more flexible SerDe for engineering practice.
          (1) we will support the MAP data type to map to HBase's (sparse) column family:column qualifers. This is a rigid mapping between Hive table schema and HTable schema, and sometimes it is not so effective for structurized data.
          (2) use a nested SerDe to implement the codec of RowKey and Columns. Since usually, the rowkey in HTable are a combination of more than one hive-columns; and we support do store a column list in to a HTable column family but do not use HBase's column quailfer feature, but the columns in a column family are self-coded (such as use of comma delimiter).
          RowSerDe

          { RowKeySerDe, ColumnSerDe}

          This is example of above SerDe design.

          CREATE TABLE t1(rowkey1 int, rowkey2 string, value1 string, valuer2 int, value3 long, valuer string)
          ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.hbase.HBaseRowSerDe'
          WITH SERDEPROPERTIES (

          "rowkey.serde.class"="org.apache.hadoop.hive.serde2.hbase.SimpleRowkeySerDe" //this will be a build-in SerDe for rowkey
          "rowkey.columns"="rowkey2,rowkey1" //the rowkey in HTable is a combination of tow hive-columns.
          "rowkey.column.lengths"="12,2" //the lengths of the two hive-columns in rowkey
          "rowkey.column.delimiter"="," //the delimiter in rowkey (it may be omit if not be defined)

          "column.families"="cf1:(value1,value2); cf2:(value3,value4)" //there two column families in HTable, cf1 and cf2 have tow column respectively
          "column.family.serde.class"="cf1:org.apache.hadoop.hive.serde2.hbase.SimpleColumnSerDe;
          cf2:org.apache.hadoop.hive.serde2.hbase.ColumnSerDe1" //cf1 and cf2 can use different SerDe
          "column.family.cf1.delimiter"=","

          ) STORED AS HBASETABLE;

          (Note: we have complete above code and verified)

          We shall also support the rigid mapping (MAP) like HIVE-705, e.g.

          CREATE TABLE hbase_table_1(rowkey1 int, rowkey2 string, value1 string, valuer2 int, abcd MAP<string, string>)
          ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.hbase.HBaseRowSerDe'
          WITH SERDEPROPERTIES (

          "rowkey.serde.class"="org.apache.hadoop.hive.serde2.hbase.SimpleRowkeySerDe"
          "rowkey.columns"="rowkey2,rowkey1"
          "rowkey.column.lengths"="12,2"
          "rowkey.column.delimiter"=","

          "column.families"="cf1:(value1,value2); cf2:=abcd"
          "column.family.serde.class"="cf1:org.apache.hadoop.hive.serde2.hbase.SimpleColumnSerDe;
          cf2:org.apache.hadoop.hive.serde2.hbase.QualiferColumnSerDe"
          "column.family.cf1.delimiter"=","

          ) STORED AS HBASETABLE;

          3. To support direct query (scan or get) from HBase HTable

          Some straightforward query target to HTable need not use mapreduce, we can difectly scan or get rows from HTable, since HTable is a global indexed store. We can use some features of HBase to improve the performance.
          (1) rowkey or rowkey ranges
          (2) column list
          (3) filter tree (predicate pushdow)
          (4) .....

          (Note: we have complete above code and verified)

          4. other...

          Show
          Schubert Zhang added a comment - @Zheng, we are in desigining and coding now. and we had a talk with Samuel days ago. Because this is involved in one of our ongoing project, I am sorry the update will be not so quick. I describe something of out consideration bellow, and will update when we complete our implementation and verification. 1. A new HBaseInputFormat. The current TableInputFormat always scan the whole HBase HTable, it is usually unnecessary, especially when we know one or more row-range. A new HBaseInputFormat will be implemented to provide more parameters to control the behavior of HTable scan. e.g.: (1) row-ranges (one or more startRow and endRow paires) (2) column list (some times we need not read all columns, HBase is a column-oriented store) (3) filter tree (predicate pushdow, filter rows/columns at region server) (4) maybe, we can do some computing on region server. (optional) 2. SerDe We use more flexible SerDe for engineering practice. (1) we will support the MAP data type to map to HBase's (sparse) column family:column qualifers. This is a rigid mapping between Hive table schema and HTable schema, and sometimes it is not so effective for structurized data. (2) use a nested SerDe to implement the codec of RowKey and Columns. Since usually, the rowkey in HTable are a combination of more than one hive-columns; and we support do store a column list in to a HTable column family but do not use HBase's column quailfer feature, but the columns in a column family are self-coded (such as use of comma delimiter). RowSerDe { RowKeySerDe, ColumnSerDe} This is example of above SerDe design. CREATE TABLE t1(rowkey1 int, rowkey2 string, value1 string, valuer2 int, value3 long, valuer string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.hbase.HBaseRowSerDe' WITH SERDEPROPERTIES ( "rowkey.serde.class"="org.apache.hadoop.hive.serde2.hbase.SimpleRowkeySerDe" //this will be a build-in SerDe for rowkey "rowkey.columns"="rowkey2,rowkey1" //the rowkey in HTable is a combination of tow hive-columns. "rowkey.column.lengths"="12,2" //the lengths of the two hive-columns in rowkey "rowkey.column.delimiter"="," //the delimiter in rowkey (it may be omit if not be defined) "column.families"="cf1:(value1,value2); cf2:(value3,value4)" //there two column families in HTable, cf1 and cf2 have tow column respectively "column.family.serde.class"="cf1:org.apache.hadoop.hive.serde2.hbase.SimpleColumnSerDe; cf2:org.apache.hadoop.hive.serde2.hbase.ColumnSerDe1" //cf1 and cf2 can use different SerDe "column.family.cf1.delimiter"="," ) STORED AS HBASETABLE; (Note: we have complete above code and verified) We shall also support the rigid mapping (MAP) like HIVE-705 , e.g. CREATE TABLE hbase_table_1(rowkey1 int, rowkey2 string, value1 string, valuer2 int, abcd MAP<string, string>) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.hbase.HBaseRowSerDe' WITH SERDEPROPERTIES ( "rowkey.serde.class"="org.apache.hadoop.hive.serde2.hbase.SimpleRowkeySerDe" "rowkey.columns"="rowkey2,rowkey1" "rowkey.column.lengths"="12,2" "rowkey.column.delimiter"="," "column.families"="cf1:(value1,value2); cf2:=abcd" "column.family.serde.class"="cf1:org.apache.hadoop.hive.serde2.hbase.SimpleColumnSerDe; cf2:org.apache.hadoop.hive.serde2.hbase.QualiferColumnSerDe" "column.family.cf1.delimiter"="," ) STORED AS HBASETABLE; 3. To support direct query (scan or get) from HBase HTable Some straightforward query target to HTable need not use mapreduce, we can difectly scan or get rows from HTable, since HTable is a global indexed store. We can use some features of HBase to improve the performance. (1) rowkey or rowkey ranges (2) column list (3) filter tree (predicate pushdow) (4) ..... (Note: we have complete above code and verified) 4. other...
          Hide
          Namit Jain added a comment -

          Any updates on this ?

          Show
          Namit Jain added a comment - Any updates on this ?
          Hide
          Namit Jain added a comment -

          Is someone working on this ?

          Show
          Namit Jain added a comment - Is someone working on this ?
          Hide
          He Yongqiang added a comment -

          I think Schubert is on vocation right now. Will try to contact him.

          Show
          He Yongqiang added a comment - I think Schubert is on vocation right now. Will try to contact him.
          Hide
          Schubert Zhang added a comment -

          @yongqiang,
          I am in vocation now, I will try to contact someone to update it.

          Guangxian,

          Could you please do something about this issue to contrib hive when
          you hive time?

          发自我的 iPhone

          在 2009-11-19,16:17,"He Yongqiang (JIRA)" <jira@apache.org> 写到:

          Show
          Schubert Zhang added a comment - @yongqiang, I am in vocation now, I will try to contact someone to update it. Guangxian, Could you please do something about this issue to contrib hive when you hive time? 发自我的 iPhone 在 2009-11-19,16:17,"He Yongqiang (JIRA)" <jira@apache.org> 写到:
          Hide
          John Sichi added a comment -

          Marking this one incomplete. If there's still interest in any of the material here, please create new JIRA issue(s) with the details.

          Show
          John Sichi added a comment - Marking this one incomplete. If there's still interest in any of the material here, please create new JIRA issue(s) with the details.

            People

            • Assignee:
              Unassigned
              Reporter:
              Schubert Zhang
            • Votes:
              2 Vote for this issue
              Watchers:
              20 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development