Hive
  1. Hive
  2. HIVE-705

Hive HBase Integration (umbrella)

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6.0
    • Fix Version/s: 0.6.0
    • Component/s: HBase Handler
    • Labels:
    • Hadoop Flags:
      Reviewed

      Description

      Add a serde over the hbase's tables, so that hive can analyse the data stored in hbase easily.

      1. hbase-0.19.3.jar
        1.07 MB
        Sijie Guo
      2. hbase-0.19.3-test.jar
        1.31 MB
        Sijie Guo
      3. hbase-0.20.3.jar
        1.49 MB
        John Sichi
      4. hbase-0.20.3-test.jar
        1.90 MB
        John Sichi
      5. HIVE-705_draft.patch
        101 kB
        Sijie Guo
      6. HIVE-705_revision806905.patch
        136 kB
        Sijie Guo
      7. HIVE-705_revision883033.patch
        145 kB
        Sijie Guo
      8. HIVE-705.1.patch
        147 kB
        John Sichi
      9. HIVE-705.2.patch
        194 kB
        John Sichi
      10. HIVE-705.3.patch
        200 kB
        John Sichi
      11. HIVE-705.4.patch
        200 kB
        John Sichi
      12. HIVE-705.5.patch
        202 kB
        John Sichi
      13. HIVE-705.6.patch
        202 kB
        John Sichi
      14. HIVE-705.7.patch
        208 kB
        John Sichi
      15. zookeeper-3.2.2.jar
        894 kB
        John Sichi

        Issue Links

          Activity

          Hide
          He Yongqiang added a comment -

          Do we need to add a new serde for this? can you add more in the description?

          Show
          He Yongqiang added a comment - Do we need to add a new serde for this? can you add more in the description?
          Hide
          Sijie Guo added a comment -

          I will add more detail about this issue late.

          Show
          Sijie Guo added a comment - I will add more detail about this issue late.
          Hide
          Ashish Thusoo added a comment -

          Also would be great if you could comment on how you plan to map the hbase data model to the sql data model (i.e. tables, columns etc.)

          This will be a cool contribution....

          SerDe would be the right way to go...

          Thanks,
          Ashish

          Show
          Ashish Thusoo added a comment - Also would be great if you could comment on how you plan to map the hbase data model to the sql data model (i.e. tables, columns etc.) This will be a cool contribution.... SerDe would be the right way to go... Thanks, Ashish
          Hide
          Sijie Guo added a comment -

          The key problem to let hive analyse hbase's tables is how to map the hbase's data model to hive's sql data model.

          As we know, the hbase's data is accessed by <key, column_family:column_name, timestamp>. so a meta-data mapping should be recorded in hive's metadata, as below:

          -------------------------------------------------------
          hbase's tablename -> hive's tablename
          hbase's columns -> hive's columns
          hbase's key -> hive's first column
          hbase's timestamp -> hive's second column
          -------------------------------------------------------

          The key and timestamp of hbase table will be mapped to first two default columns in hive's table automatically. So the hbased-hive table may be like <.key, .timestamp, ..., other columns defined by users>.

          For example, a hbase table 'webpages', has columns <contents:page_content, anchors:>. There are 2 column families, "contents" and "anchors". The content of table 'webpages' is stored in column 'contents:page_content', the data is dense. And the anchors of a specified page will varied between different pages, so the data in 'anchros:' will be sparse.
          The columns of hbase' table will be mapped manually be programmers : we can map a full column <column_family:column_name> in hbase to a primitive_type column in hive, while mapping a column family <column_family:> in hbase to a map_type column in hive. So the hbase table webpages' hive schema will be (.key, .timestamp, page_content, anchors).

          Setting up schema mapping between hbase table and hive table, we need to consider how to record the shema mapping, serialize the hive object to hbase table and deserialize hbase's data to hive object.

          The proposal is to add a new HbaseSerDe for recording the schema mapping in SerDe properties. So the SerDe can use its schema mapping to serialize the hive object to hbase's table and deserialize hbase's data to hive object.

          The properties in HBaseSerDe will be:
          1) "hbase.key.type" : the type of .key column in hive table, defining how to deserialize the .key field from hbase's key. (the hbase key is a bytes array)
          2) "hbase.schema.mapping" : a string separated by comma, defining the shema mapping. The schema will be mapped in order one by one.

          These properites should be provided during creating a hbased-hive table. If the "hbase.key.type" is not defined, we treat it as a string. But if the "hbase.schema.mapping" is not defined, we should fail the table creation because we do not how to deserialize hive object from hbase raw bytes data.

          A hbased-hive table's operations are showed as below:

          1. Using existed hbase table as an external table in hive

          The 'create' command will be as below:

          -----------------------------

          CREATE EXTERNAL TABLE webpages(page_content STRING, anchors MAP<STRING, STRING>)
          COMMENT 'This is the pages table'
          ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.HBaseSerDe'
          WITH SERDEPROPERTIES (
          "hbase.key.type" = "string",
          "hbase.columns.mapping" = "contents:page_content,anchors:",
          )
          STORED AS HBASETABLE
          LOCATION '<hbase_table_location>'

          -----------------------------
          Here the hbase_table_location will identify the location of hbase and the hbase table name, such as "hbase:/hbase_master:port/hbase_tablename".

          And after creating an external table using an existing hbase table, we can do analysis over the table like normal hive table.

          A. Get all the urls and their pages that added after a specified time t1.

          SELECT .key, page_content FROM webpages WHERE .timestamp > t1;

          B. Get the revisions of a specified url <www.apache.org> from a specified time t1 to a specified time t2.

          SELECT page_content FROM webpages WHERE .timestamp > t1 AND .timestamp < t2 AND .key = 'www.apache.org';

          2. Creating a new hbase table as a hive table.

          The 'create' command will be as below:

          -----------------------------

          CREATE TABLE webpages(page_content STRING, anchors MAP<STRING, STRING>)
          COMMENT 'This is the pages table'
          ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.HBaseSerDe'
          WITH SERDEPROPERTIES (
          "hbase.key.type" = "string",
          "hbase.columns.mapping" = "contents:page_content,anchors:",
          )
          STORED AS HBASETABLE
          LOCATION '<hbase_table_location>'

          -----------------------------

          After invoking the 'create' command, the hive client will also create a hbase table in the specified hbase cluster. And the created hbase table will have two column families defined in HBaseSerDe properties, "contents:" and "anchros:".

          3. Loading data into tables.

          As we have two default hidden column (.key, .timestamp) in hbased-hive table, we must count these two columns in during inserting data.
          We can eigth load data into hbased-hive table by inserting data from other tables or loading data from local filesystem.

          A. Inserting data from other tables.

          for example, we have a 'crawled_pages' table collecting all the pages crawled from the internet. the 'crawled_pages' is simple: <url, crawled_date, page_content>.

          I. If we want to load all this data into the 'webpages' table, we will invoke the command as below:

          FROM crawled_pages cp
          INSERT TABLE webpages
          SELECT cp.url, cp.crawled_date, cp.page_content, null;

          II. If we do not want to specified the time during inserting these data, we can simply set the .timestamp column to 'null', as below:

          FROM crawled_pages cp
          INSERT TABLE webpages
          SELECT cp.url, null, cp.page_content, null;

          III. Crazily, if the .key column provided is null, we may throw out errors to client or just skipping the bad records?

          FORM crawled_pages cp
          INSERT TABLE webpages
          SELECT null, null, cp.page_content, null;

          B. Loading data from local filesystem (or hdfs)

          Now hive just copy/move the file to the specified dir of a hive table. But we should forbbiden it during loading data into a hbased-hive table.

          if we want to loading data from files in local filesystem (or hdfs) into hbased-hive tables, we can do as below:

          I. create a temp external table for the original data(files).
          II. load data into the hbased-hive table using 'insert' from the temp external table.

          4. Performance Improvements
          Some improvements may be considered during analysing hbase tables.for example, hbase key is an index to access data that can be used to accelerating hive. No clearly.

          -----------------------------

          forget my pool english, and welcome for comments.

          Show
          Sijie Guo added a comment - The key problem to let hive analyse hbase's tables is how to map the hbase's data model to hive's sql data model. As we know, the hbase's data is accessed by <key, column_family:column_name, timestamp>. so a meta-data mapping should be recorded in hive's metadata, as below: ------------------------------------------------------- hbase's tablename -> hive's tablename hbase's columns -> hive's columns hbase's key -> hive's first column hbase's timestamp -> hive's second column ------------------------------------------------------- The key and timestamp of hbase table will be mapped to first two default columns in hive's table automatically. So the hbased-hive table may be like <.key, .timestamp, ..., other columns defined by users>. For example, a hbase table 'webpages', has columns <contents:page_content, anchors:>. There are 2 column families, "contents" and "anchors". The content of table 'webpages' is stored in column 'contents:page_content', the data is dense. And the anchors of a specified page will varied between different pages, so the data in 'anchros:' will be sparse. The columns of hbase' table will be mapped manually be programmers : we can map a full column <column_family:column_name> in hbase to a primitive_type column in hive, while mapping a column family <column_family:> in hbase to a map_type column in hive. So the hbase table webpages' hive schema will be (.key, .timestamp, page_content, anchors). Setting up schema mapping between hbase table and hive table, we need to consider how to record the shema mapping, serialize the hive object to hbase table and deserialize hbase's data to hive object. The proposal is to add a new HbaseSerDe for recording the schema mapping in SerDe properties. So the SerDe can use its schema mapping to serialize the hive object to hbase's table and deserialize hbase's data to hive object. The properties in HBaseSerDe will be: 1) "hbase.key.type" : the type of .key column in hive table, defining how to deserialize the .key field from hbase's key. (the hbase key is a bytes array) 2) "hbase.schema.mapping" : a string separated by comma, defining the shema mapping. The schema will be mapped in order one by one. These properites should be provided during creating a hbased-hive table. If the "hbase.key.type" is not defined, we treat it as a string. But if the "hbase.schema.mapping" is not defined, we should fail the table creation because we do not how to deserialize hive object from hbase raw bytes data. A hbased-hive table's operations are showed as below: 1. Using existed hbase table as an external table in hive The 'create' command will be as below: ----------------------------- CREATE EXTERNAL TABLE webpages(page_content STRING, anchors MAP<STRING, STRING>) COMMENT 'This is the pages table' ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.HBaseSerDe' WITH SERDEPROPERTIES ( "hbase.key.type" = "string", "hbase.columns.mapping" = "contents:page_content,anchors:", ) STORED AS HBASETABLE LOCATION '<hbase_table_location>' ----------------------------- Here the hbase_table_location will identify the location of hbase and the hbase table name, such as "hbase:/hbase_master:port/hbase_tablename". And after creating an external table using an existing hbase table, we can do analysis over the table like normal hive table. A. Get all the urls and their pages that added after a specified time t1. SELECT .key, page_content FROM webpages WHERE .timestamp > t1; B. Get the revisions of a specified url <www.apache.org> from a specified time t1 to a specified time t2. SELECT page_content FROM webpages WHERE .timestamp > t1 AND .timestamp < t2 AND .key = 'www.apache.org'; 2. Creating a new hbase table as a hive table. The 'create' command will be as below: ----------------------------- CREATE TABLE webpages(page_content STRING, anchors MAP<STRING, STRING>) COMMENT 'This is the pages table' ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.HBaseSerDe' WITH SERDEPROPERTIES ( "hbase.key.type" = "string", "hbase.columns.mapping" = "contents:page_content,anchors:", ) STORED AS HBASETABLE LOCATION '<hbase_table_location>' ----------------------------- After invoking the 'create' command, the hive client will also create a hbase table in the specified hbase cluster. And the created hbase table will have two column families defined in HBaseSerDe properties, "contents:" and "anchros:". 3. Loading data into tables. As we have two default hidden column (.key, .timestamp) in hbased-hive table, we must count these two columns in during inserting data. We can eigth load data into hbased-hive table by inserting data from other tables or loading data from local filesystem. A. Inserting data from other tables. for example, we have a 'crawled_pages' table collecting all the pages crawled from the internet. the 'crawled_pages' is simple: <url, crawled_date, page_content>. I. If we want to load all this data into the 'webpages' table, we will invoke the command as below: FROM crawled_pages cp INSERT TABLE webpages SELECT cp.url, cp.crawled_date, cp.page_content, null; II. If we do not want to specified the time during inserting these data, we can simply set the .timestamp column to 'null', as below: FROM crawled_pages cp INSERT TABLE webpages SELECT cp.url, null, cp.page_content, null; III. Crazily, if the .key column provided is null, we may throw out errors to client or just skipping the bad records? FORM crawled_pages cp INSERT TABLE webpages SELECT null, null, cp.page_content, null; B. Loading data from local filesystem (or hdfs) Now hive just copy/move the file to the specified dir of a hive table. But we should forbbiden it during loading data into a hbased-hive table. if we want to loading data from files in local filesystem (or hdfs) into hbased-hive tables, we can do as below: I. create a temp external table for the original data(files). II. load data into the hbased-hive table using 'insert' from the temp external table. 4. Performance Improvements Some improvements may be considered during analysing hbase tables.for example, hbase key is an index to access data that can be used to accelerating hive. No clearly. ----------------------------- forget my pool english, and welcome for comments.
          Hide
          Sijie Guo added a comment -

          Attach my patch.

          There is a little different with my previous proposal.

          creating a table will be:

          -------------------------------------
          CREATE EXTERNAL TABLE webpages(pageURL STRING, page_content STRING, anchors MAP<STRING, STRING>)
          COMMENT 'This is the pages table'
          ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.HBaseSerDe'
          WITH SERDEPROPERTIES (
          "hbase.columns.mapping" = "contents:page_content,anchors:",
          )
          STORED AS HBASETABLE
          LOCATION '<hbase_table_location>'
          --------------------------------------

          The first field defined in the hive table will be mapped to hbase's table key and the left fields will be mapped to the hbase columns specified in serde properties named "hbase.columns.mapping".

          And the timestamp field is not added now. I just retrieve the latest version of each hbase cell from a hbase table now.

          Show
          Sijie Guo added a comment - Attach my patch. There is a little different with my previous proposal. creating a table will be: ------------------------------------- CREATE EXTERNAL TABLE webpages(pageURL STRING, page_content STRING, anchors MAP<STRING, STRING>) COMMENT 'This is the pages table' ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.HBaseSerDe' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = "contents:page_content,anchors:", ) STORED AS HBASETABLE LOCATION '<hbase_table_location>' -------------------------------------- The first field defined in the hive table will be mapped to hbase's table key and the left fields will be mapped to the hbase columns specified in serde properties named "hbase.columns.mapping". And the timestamp field is not added now. I just retrieve the latest version of each hbase cell from a hbase table now.
          Hide
          Schubert Zhang added a comment -

          Hi Samuel,

          Thanks for your great job.
          In you patch, we found many java files are modified, it is really a big effort. I don't know if there is any way to avoid such a big modification.

          Regards the schema mapping between HBase table and Hive SQL table, I have following consideration.
          1. We just want to use HBase as a scalable structure data store, or key-value store.
          2. The performance is not good when we maped SQL columns to HBase columns in our past experience. For example, we have a table with 20 columns, then, each read or write of a row will comprise 20 key-value operations. It is ineffective.

          How about consider more flexible schema mapping:
          1. one HBase column can map to multiple hive-SQL columns with a SerDe. e.g. cf1:q1 =>

          {(col1, col2, col3), Default SerDe}


          2. one HBase column family can map to multiple hive-SQL columns with a SerDe. e.g. cf2: =>

          {(col3, col5, col6), Default SerDe}


          3. your MAP column (in Hive table) for sparse column family. [Optional] Since Hive is a structured data analysis front-end, we can omit this feature at the beginning.

          For example:

          CREATE EXTERNAL TABLE hive_table (pkey STRING, col1 STRING, col2 INT, col2, STRING, col3 INT, col4 STRING)
          ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.MyHBaseSerDe'
          WITH SERDEPROPERTIES (
          "hbase.columns.mapping" = "cf1:(col1,col2,col3) with DefaultSerDe, cf2:c1 (col4) with DefaultSerDe",
          )
          STORED AS HBASETABLE
          LOCATION '<hbase_table_location>'

          Usually, we want a more advanced data store backend than HDFS, to achieve more flexible data placement and indexing. HBase's data model is very good to meet this requirement, but we may need not the full fearures of HBase here.


          Look forward to have more communication with you in Chinese, by your convenience.

          Schubert

          Show
          Schubert Zhang added a comment - Hi Samuel, Thanks for your great job. In you patch, we found many java files are modified, it is really a big effort. I don't know if there is any way to avoid such a big modification. Regards the schema mapping between HBase table and Hive SQL table, I have following consideration. 1. We just want to use HBase as a scalable structure data store, or key-value store. 2. The performance is not good when we maped SQL columns to HBase columns in our past experience. For example, we have a table with 20 columns, then, each read or write of a row will comprise 20 key-value operations. It is ineffective. How about consider more flexible schema mapping: 1. one HBase column can map to multiple hive-SQL columns with a SerDe. e.g. cf1:q1 => {(col1, col2, col3), Default SerDe} 2. one HBase column family can map to multiple hive-SQL columns with a SerDe. e.g. cf2: => {(col3, col5, col6), Default SerDe} 3. your MAP column (in Hive table) for sparse column family. [Optional] Since Hive is a structured data analysis front-end, we can omit this feature at the beginning. For example: CREATE EXTERNAL TABLE hive_table (pkey STRING, col1 STRING, col2 INT, col2, STRING, col3 INT, col4 STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.MyHBaseSerDe' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = "cf1:(col1,col2,col3) with DefaultSerDe, cf2:c1 (col4) with DefaultSerDe", ) STORED AS HBASETABLE LOCATION '<hbase_table_location>' Usually, we want a more advanced data store backend than HDFS, to achieve more flexible data placement and indexing. HBase's data model is very good to meet this requirement, but we may need not the full fearures of HBase here. – Look forward to have more communication with you in Chinese, by your convenience. Schubert
          Hide
          He Yongqiang added a comment -

          Samuel, i am now in ShangHai attending a meeting. I will talk with you on phone asap when i get back. Thanks for the quick fix.

          Show
          He Yongqiang added a comment - Samuel, i am now in ShangHai attending a meeting. I will talk with you on phone asap when i get back. Thanks for the quick fix.
          Hide
          Sijie Guo added a comment -

          @schubert,

          Thank you for you comment.

          >> In you patch, we found many java files are modified, it is really a big effort. I don't know if there is any way to avoid such a big modification.
          A HBase Table is quite different with a file in HDFS. The original Hive code is based on files. For example, when outputting the reduce results to the target table, Hive uses a FileSinkOperator to output the results to the temp file in the HDFS, and uses a MoveTask to rename the temp files in the HDFS to the target table dir. But when the the target table is based on a HBase Table, we do not need to deal with these file operations, and just output to the target HBase Table.

          The modification of the original java files is to tell hive to deal with a hbase table in a differnt way.

          I will try to look into the code and find a way to avoid the modification.

          >> 2. The performance is not good when we maped SQL columns to HBase columns in our past experience. For example, we have a table with 20 columns, then, each read or write of a row will comprise 20 key-value operations. It is ineffective.

          A good point. The schema mapping does not effect the peformance during creating a hive table. The performance is effected if we get all the mapping columns out of hbase table in an actual query operation. Some code will be added to do the column-prune during hbase table scanning.

          For example, an hbase table (cf1:(co1, col2, col3), cf2:(col4,col5,col6), ... , cfn:(colk,colj,coll)) is mapping to a hive table (column1, column2, column3, column4, ... ,column n).
          If a query "select column3, column4 from hbasedhivetable" is invoked, we should not let hbase scan out all the columns. We know all the hive columns used in the query, map back to the hbase column, and get the scanning list "cf1:col3 cf2:col4". We set the scanning list "cf1:col3 cf2:col4" in the HBaseInputFormat to let HBase just scan out the useful columns.

          The code will be added in the new patch.

          >> cf2: =>

          {(col3, col5, col6), Default SerDe}

          Cool. Let different SerDe work on different hbase column. I will try it in the new patch.

          >> Look forward to have more communication with you in Chinese, by your convenience.
          My Gtalk is : sijie0413@gmail.com

          Show
          Sijie Guo added a comment - @schubert, Thank you for you comment. >> In you patch, we found many java files are modified, it is really a big effort. I don't know if there is any way to avoid such a big modification. A HBase Table is quite different with a file in HDFS. The original Hive code is based on files. For example, when outputting the reduce results to the target table, Hive uses a FileSinkOperator to output the results to the temp file in the HDFS, and uses a MoveTask to rename the temp files in the HDFS to the target table dir. But when the the target table is based on a HBase Table, we do not need to deal with these file operations, and just output to the target HBase Table. The modification of the original java files is to tell hive to deal with a hbase table in a differnt way. I will try to look into the code and find a way to avoid the modification. >> 2. The performance is not good when we maped SQL columns to HBase columns in our past experience. For example, we have a table with 20 columns, then, each read or write of a row will comprise 20 key-value operations. It is ineffective. A good point. The schema mapping does not effect the peformance during creating a hive table. The performance is effected if we get all the mapping columns out of hbase table in an actual query operation. Some code will be added to do the column-prune during hbase table scanning. For example, an hbase table (cf1:(co1, col2, col3), cf2:(col4,col5,col6), ... , cfn:(colk,colj,coll)) is mapping to a hive table (column1, column2, column3, column4, ... ,column n). If a query "select column3, column4 from hbasedhivetable" is invoked, we should not let hbase scan out all the columns. We know all the hive columns used in the query, map back to the hbase column, and get the scanning list "cf1:col3 cf2:col4". We set the scanning list "cf1:col3 cf2:col4" in the HBaseInputFormat to let HBase just scan out the useful columns. The code will be added in the new patch. >> cf2: => {(col3, col5, col6), Default SerDe} Cool. Let different SerDe work on different hbase column. I will try it in the new patch. >> Look forward to have more communication with you in Chinese, by your convenience. My Gtalk is : sijie0413@gmail.com
          Hide
          Ashish Thusoo added a comment -

          The data model mapping works. I have one suggestion though. Can we infer the columns list of the hive table from the hbase table instead of explicitly stating it in the create command. My concerns is that an addition of a column family in hbase will require an alter table on hive and if we can avoid it that would be great.

          Show
          Ashish Thusoo added a comment - The data model mapping works. I have one suggestion though. Can we infer the columns list of the hive table from the hbase table instead of explicitly stating it in the create command. My concerns is that an addition of a column family in hbase will require an alter table on hive and if we can avoid it that would be great.
          Hide
          Sijie Guo added a comment -

          @Ashish

          Thank you for your comment.
          It is difficult to infer the columns list from a sparse column hbase table, we do not know exactly how many columns in a given hbase table. We just know all the column families of a given hbase table.
          Also, the data in hbase are all raw bytes. If we do not explicitly stat the schema mapping, we will lose the information how to serialize/deserialize the data from raw bytes.

          Show
          Sijie Guo added a comment - @Ashish Thank you for your comment. It is difficult to infer the columns list from a sparse column hbase table, we do not know exactly how many columns in a given hbase table. We just know all the column families of a given hbase table. Also, the data in hbase are all raw bytes. If we do not explicitly stat the schema mapping, we will lose the information how to serialize/deserialize the data from raw bytes.
          Hide
          Kula Liao added a comment -

          Hi Samuel,

          Thanks for your great job.
          I found some error when testing your patch.

          The sql statements are from the file : "ql/src/test/queries/clienthbase/hbase_queries.q".
          I created a table named "hbase_table_1" using the following statement:

          CREATE TABLE hbase_table_1(key int, value string)
          ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.hbase.HBaseSerDe'
          WITH SERDEPROPERTIES (
          "hbase.columns.mapping" = "cf:string"
          ) STORED AS HBASETABLE;

          OK. Then I inserted data into "hbase_table_1".

          hive> FROM src INSERT OVERWRITE TABLE hbase_table_1 SELECT *;
          Total MapReduce jobs = 1
          Number of reduce tasks is set to 0 since there's no reduce operator
          Starting Job = job_200908131113_0002, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_200908131113_0002
          Kill Command = /home/stephen/hadoop-0.19.2/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_200908131113_0002
          2009-08-13 11:17:07,162 map = 0%, reduce =0%
          2009-08-13 11:17:14,200 map = 50%, reduce =0%
          2009-08-13 11:17:15,215 map = 100%, reduce =0%
          Ended Job = job_200908131113_0002
          500 Rows loaded to hbase_table_1
          OK

          When I tried to do some queries. I found the following error message:

          hive> select * from hbase_table_1 where value > '0';
          Total MapReduce jobs = 1
          Number of reduce tasks is set to 0 since there's no reduce operator
          Starting Job = job_200908131113_0003, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_200908131113_0003
          Kill Command = /home/stephen/hadoop-0.19.2/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_200908131113_0003
          2009-08-13 11:18:24,019 map = 0%, reduce =0%
          2009-08-13 11:18:42,146 map = 100%, reduce =100%
          Ended Job = job_200908131113_0003 with errors
          FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.ExecDriver

          The following message is found in the mapreduce log:

          java.lang.NullPointerException
          at org.apache.hadoop.hbase.mapred.TableInputFormat.configure(TableInputFormat.java:52)
          at org.apache.hadoop.hive.ql.io.HiveHBaseTableInputFormat.configure(HiveHBaseTableInputFormat.java:36)
          at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
          at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
          at org.apache.hadoop.hive.ql.io.HiveInputFormat.getInputFormatFromCache(HiveInputFormat.java:184)
          at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:211)
          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
          at org.apache.hadoop.mapred.Child.main(Child.java:158)

          There is another query, nothing returned.
          hive> select * from hbase_table_1;
          OK
          Time taken: 2.952 seconds

          Show
          Kula Liao added a comment - Hi Samuel, Thanks for your great job. I found some error when testing your patch. The sql statements are from the file : "ql/src/test/queries/clienthbase/hbase_queries.q". I created a table named "hbase_table_1" using the following statement: CREATE TABLE hbase_table_1(key int, value string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.hbase.HBaseSerDe' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = "cf:string" ) STORED AS HBASETABLE; OK. Then I inserted data into "hbase_table_1". hive> FROM src INSERT OVERWRITE TABLE hbase_table_1 SELECT *; Total MapReduce jobs = 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_200908131113_0002, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_200908131113_0002 Kill Command = /home/stephen/hadoop-0.19.2/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_200908131113_0002 2009-08-13 11:17:07,162 map = 0%, reduce =0% 2009-08-13 11:17:14,200 map = 50%, reduce =0% 2009-08-13 11:17:15,215 map = 100%, reduce =0% Ended Job = job_200908131113_0002 500 Rows loaded to hbase_table_1 OK When I tried to do some queries. I found the following error message: hive> select * from hbase_table_1 where value > '0'; Total MapReduce jobs = 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_200908131113_0003, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_200908131113_0003 Kill Command = /home/stephen/hadoop-0.19.2/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_200908131113_0003 2009-08-13 11:18:24,019 map = 0%, reduce =0% 2009-08-13 11:18:42,146 map = 100%, reduce =100% Ended Job = job_200908131113_0003 with errors FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.ExecDriver The following message is found in the mapreduce log: java.lang.NullPointerException at org.apache.hadoop.hbase.mapred.TableInputFormat.configure(TableInputFormat.java:52) at org.apache.hadoop.hive.ql.io.HiveHBaseTableInputFormat.configure(HiveHBaseTableInputFormat.java:36) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getInputFormatFromCache(HiveInputFormat.java:184) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:211) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331) at org.apache.hadoop.mapred.Child.main(Child.java:158) There is another query, nothing returned. hive> select * from hbase_table_1; OK Time taken: 2.952 seconds
          Hide
          stephen xie added a comment -

          Hi Samuel,

          Also, I found the same problem as Kula.
          I changed one line in the method HiveInputFormat::getSplits,

          — newjob.set(TableInputFormat.COLUMN_LIST, hbaseColumns);
          +++ job.set(TableInputFormat.COLUMN_LIST, hbaseColumns);

          Then the above java exception disappered, select is ok.

          But when I tested more than 2 columns, the query returned nothing.

          CREATE TABLE hbase_table_2(key int, value1 string, value2 int)
          ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.hbase.HBaseSerDe'
          WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf:value1, cf:value2"
          ) STORED AS HBASETABLE;
          FROM src2 INSERT OVERWRITE TABLE hbase_table_2 SELECT *;

          The following 2 queries both returned nothing.
          select * from hbase_table_2 where value > '0';
          select * from hbase_table2;

          Show
          stephen xie added a comment - Hi Samuel, Also, I found the same problem as Kula. I changed one line in the method HiveInputFormat::getSplits, — newjob.set(TableInputFormat.COLUMN_LIST, hbaseColumns); +++ job.set(TableInputFormat.COLUMN_LIST, hbaseColumns); Then the above java exception disappered, select is ok. But when I tested more than 2 columns, the query returned nothing. CREATE TABLE hbase_table_2(key int, value1 string, value2 int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.hbase.HBaseSerDe' WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf:value1, cf:value2" ) STORED AS HBASETABLE; FROM src2 INSERT OVERWRITE TABLE hbase_table_2 SELECT *; The following 2 queries both returned nothing. select * from hbase_table_2 where value > '0'; select * from hbase_table2;
          Hide
          Sijie Guo added a comment -

          @kula @stephen
          Thank you all for your comments.

          1) As stephen methioned, the NullPointerException is thrown out because the COLUMN_LIST is set in the wrong job configuration.
          I will fixed it in the new path.

          2) It seems that "select *" statement is buggy now. I will find out the problem and fix it.

          Show
          Sijie Guo added a comment - @kula @stephen Thank you all for your comments. 1) As stephen methioned, the NullPointerException is thrown out because the COLUMN_LIST is set in the wrong job configuration. I will fixed it in the new path. 2) It seems that "select *" statement is buggy now. I will find out the problem and fix it.
          Hide
          Sijie Guo added a comment -

          Attach a new patch.

          1) move the related hbase code to the contrib package, as hbase just an optional storage for hive, not neccessary.
          I have tried to avoid modifying the hive original code and just add a hbase serde to connect hive with hbase. But the hbase storage model is quite different with file storage model. For example, a loadwork is used to rename/copy files from temp dir to the target table's dir if a query's target is a hive table. But in a hbased hive table, we can't rename a table now. So it's hard to let a hbased hive table to follow the logic of a normal file-based hive table. So I add some code(HiveFormatUtils) to distinguish a file-based table from a not-file-based table.

          2) fix some bugs in the draft patch, such as "select *" return nothing.

          ----------------------------------------------------------------------------------------------

          How to use the hbase as hive's storage?

          1) remember to add the contrib jar and the hbase jar in the hive's auxPath, so m/r can populate the neccessary hbase-related jars to the whole hadoop m/r cluster.

          > $HIVE_HOME/bin/hive -auxPath $

          {contrib_jar}

          ,$

          {hbase_jar}

          2) modify the configuration to add the following configuration parameters.

          "hbase.master" : pointer to the hbase's master.
          "hive.othermetadata.handlers" : "org.apache.hadoop.hive.contrib.hbase.HiveHBaseTableInputFormat:org.apache.hadoop.hive.contrib.hbase.HBaseMetadataHandler"

          "hive.othermetadata.handlers" collects the metadata handlers to handle the other metadata operations in the not-file-based hive tables. Take hbase as an example. HBaseMetadataHandler will create the neccessary hbase table and its family columns when we create a hbased hive table from hive's client. It also drop the hbase table when we drop the hive table.

          The metastore read the registered handlers map from the configuration file during initialization. The registered handlers map is formated as "table_format_classname:table_metadata_handler_classname,table_format_classname:table_metadata_handler_classname,...".

          3) enjoy "hive over hbase"!

          ------------------------------------------------------------------------

          Other problems.

          1) Altering a hased-hive table is not supported now.
          renaming a table in hbase is not supported now, so I just do not support rename operation. ( maybe if we rename a hive table, we do not need to rename the base hbase table.)

          adding/replacing cloumns.
          Now we need to specify the schema mapping in the SerDe properties explicitly. If we want to adding columns, we need to call 'alter' twice to adding columns: change the serde properties and the hive columns. Either change the serde properties first or change the hive columns first will fail now, because we validate the schema mapping during SerDe initialization. One of the hbase serde validation is to check the counts of hive columns and hbase mapping columns. If we first change the hive columns, the number of hive columns will be more than hbase mapping columns, the HBase Serde initialization will fail this alter operation. (maybe we need to remove the validation code from HBaseSerDe initialization and do it in other place?)

          2) more flexible schema mapping?
          As Schubert metioned before, more flexible schema mapping will be useful for user. This feature will be added later.

          welcome for comments~

          Show
          Sijie Guo added a comment - Attach a new patch. 1) move the related hbase code to the contrib package, as hbase just an optional storage for hive, not neccessary. I have tried to avoid modifying the hive original code and just add a hbase serde to connect hive with hbase. But the hbase storage model is quite different with file storage model. For example, a loadwork is used to rename/copy files from temp dir to the target table's dir if a query's target is a hive table. But in a hbased hive table, we can't rename a table now. So it's hard to let a hbased hive table to follow the logic of a normal file-based hive table. So I add some code(HiveFormatUtils) to distinguish a file-based table from a not-file-based table. 2) fix some bugs in the draft patch, such as "select *" return nothing. ---------------------------------------------------------------------------------------------- How to use the hbase as hive's storage? 1) remember to add the contrib jar and the hbase jar in the hive's auxPath, so m/r can populate the neccessary hbase-related jars to the whole hadoop m/r cluster. > $HIVE_HOME/bin/hive -auxPath $ {contrib_jar} ,$ {hbase_jar} 2) modify the configuration to add the following configuration parameters. "hbase.master" : pointer to the hbase's master. "hive.othermetadata.handlers" : "org.apache.hadoop.hive.contrib.hbase.HiveHBaseTableInputFormat:org.apache.hadoop.hive.contrib.hbase.HBaseMetadataHandler" "hive.othermetadata.handlers" collects the metadata handlers to handle the other metadata operations in the not-file-based hive tables. Take hbase as an example. HBaseMetadataHandler will create the neccessary hbase table and its family columns when we create a hbased hive table from hive's client. It also drop the hbase table when we drop the hive table. The metastore read the registered handlers map from the configuration file during initialization. The registered handlers map is formated as "table_format_classname:table_metadata_handler_classname,table_format_classname:table_metadata_handler_classname,...". 3) enjoy "hive over hbase"! ------------------------------------------------------------------------ Other problems. 1) Altering a hased-hive table is not supported now. renaming a table in hbase is not supported now, so I just do not support rename operation. ( maybe if we rename a hive table, we do not need to rename the base hbase table.) adding/replacing cloumns. Now we need to specify the schema mapping in the SerDe properties explicitly. If we want to adding columns, we need to call 'alter' twice to adding columns: change the serde properties and the hive columns. Either change the serde properties first or change the hive columns first will fail now, because we validate the schema mapping during SerDe initialization. One of the hbase serde validation is to check the counts of hive columns and hbase mapping columns. If we first change the hive columns, the number of hive columns will be more than hbase mapping columns, the HBase Serde initialization will fail this alter operation. (maybe we need to remove the validation code from HBaseSerDe initialization and do it in other place?) 2) more flexible schema mapping? As Schubert metioned before, more flexible schema mapping will be useful for user. This feature will be added later. welcome for comments~
          Hide
          stephen xie added a comment -

          Hi, Samuel

          Thankx very much for your new patch.
          There are some problem when i used it as the following,

          1. create table src(key int, value string);
          ok
          2. LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE src;
          ok
          3. CREATE TABLE hbase_table_1(key int, value string)
          ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.hbase.HBaseSerDe'
          WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf:string"
          ) STORED AS HBASETABLE;
          ok
          4.FROM src INSERT OVERWRITE TABLE hbase_table_1 SELECT *;
          FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.ExecDriver

          I found error in the m/r map process, just as the following,

          java.lang.RuntimeException: Map operator initialization failed
          at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:110)
          at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
          at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
          at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
          at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
          at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338)
          at org.apache.hadoop.mapred.Child.main(Child.java:158)
          Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException
          at org.apache.hadoop.hive.ql.exec.FileSinkOperator.initializeOp(FileSinkOperator.java:165)
          at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:308)
          at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:345)
          at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:330)
          at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:58)
          at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:308)
          at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:345)
          at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:330)
          at org.apache.hadoop.hive.ql.exec.Operator.initializeOp(Operator.java:316)
          at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:308)
          at org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:289)
          at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:308)
          at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:82)
          ... 7 more
          Caused by: java.lang.NullPointerException
          at org.apache.hadoop.hive.ql.exec.FileSinkOperator.initializeOp(FileSinkOperator.java:88)
          ... 19 more

          Show
          stephen xie added a comment - Hi, Samuel Thankx very much for your new patch. There are some problem when i used it as the following, 1. create table src(key int, value string); ok 2. LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE src; ok 3. CREATE TABLE hbase_table_1(key int, value string) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.hbase.HBaseSerDe' WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf:string" ) STORED AS HBASETABLE; ok 4.FROM src INSERT OVERWRITE TABLE hbase_table_1 SELECT *; FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.ExecDriver I found error in the m/r map process, just as the following, java.lang.RuntimeException: Map operator initialization failed at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:110) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338) at org.apache.hadoop.mapred.Child.main(Child.java:158) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.FileSinkOperator.initializeOp(FileSinkOperator.java:165) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:308) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:345) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:330) at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:58) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:308) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:345) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:330) at org.apache.hadoop.hive.ql.exec.Operator.initializeOp(Operator.java:316) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:308) at org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:289) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:308) at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:82) ... 7 more Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.FileSinkOperator.initializeOp(FileSinkOperator.java:88) ... 19 more
          Hide
          Sijie Guo added a comment -

          @stephen:

          Did you set the configuration parameter ' "hive.othermetadata.handlers" : "org.apache.hadoop.hive.contrib.hbase.HiveHBaseTableInputFormat:org.apache.hadoop.hive.contrib.hbase.HBaseMetadataHandler" '?

          I am sorry that I have other things to handle these days. I will fix the bug immediately if I have time.

          Show
          Sijie Guo added a comment - @stephen: Did you set the configuration parameter ' "hive.othermetadata.handlers" : "org.apache.hadoop.hive.contrib.hbase.HiveHBaseTableInputFormat:org.apache.hadoop.hive.contrib.hbase.HBaseMetadataHandler" '? I am sorry that I have other things to handle these days. I will fix the bug immediately if I have time.
          Hide
          stephen xie added a comment -

          Hi, Samuel

          Before the testing, I have set the configuration parameter ' "hive.othermetadata.handlers" the same as your said.
          Thanks.

          Show
          stephen xie added a comment - Hi, Samuel Before the testing, I have set the configuration parameter ' "hive.othermetadata.handlers" the same as your said. Thanks.
          Hide
          Sijie Guo added a comment -

          @stephen

          I have run the patch on my notebook. But I did not encounter the NullPointerException mentioned in your comment.
          Can you send me the hive log and the userlogs of the mr job 'FROM src INSERT OVERWRITE TABLE hbase_table_1 SELECT *;' ?

          Thanks.

          Show
          Sijie Guo added a comment - @stephen I have run the patch on my notebook. But I did not encounter the NullPointerException mentioned in your comment. Can you send me the hive log and the userlogs of the mr job 'FROM src INSERT OVERWRITE TABLE hbase_table_1 SELECT *;' ? Thanks.
          Hide
          stephen xie added a comment -

          thanks very much for Samuel's help.
          The issue above has been resolved.
          In the distributed test environment, running hive command must be added the parameter --auxpath hive_contrib.jar,hbase.jar.

          Show
          stephen xie added a comment - thanks very much for Samuel's help. The issue above has been resolved. In the distributed test environment, running hive command must be added the parameter --auxpath hive_contrib.jar,hbase.jar.
          Hide
          Sijie Guo added a comment -

          Attach a new patch for trunk.

          Show
          Sijie Guo added a comment - Attach a new patch for trunk.
          Hide
          John Sichi added a comment -

          I'm going to start working on getting this ready for submission against latest trunk.

          Show
          John Sichi added a comment - I'm going to start working on getting this ready for submission against latest trunk.
          Hide
          John Sichi added a comment -

          Here's the result of rebasing the old patch to apply against latest trunk. This is NOT intended for submission; it's just a checkpoint of the rebasing work for anyone who needs it. For the real submission, I'll be doing quite a bit of refactoring to generalize the concept of plugging in external storage, and possibly other concepts from HIVE-1133 based on pending discussions.

          Notable changes from the old patch:

          • update to require HBase 0.20.3, resulting in new zookeeper dependency (I'm testing with zookeeper 3.2.2)
          • eliminated parser changes; they'll probably come back in a more general form something like STORED BY 'storage-handler-class' which will encapsulate the combination of inputformat, outputformat, metastore hooks, and optimizer interaction such as filter/predicate pushdown
          Show
          John Sichi added a comment - Here's the result of rebasing the old patch to apply against latest trunk. This is NOT intended for submission; it's just a checkpoint of the rebasing work for anyone who needs it. For the real submission, I'll be doing quite a bit of refactoring to generalize the concept of plugging in external storage, and possibly other concepts from HIVE-1133 based on pending discussions. Notable changes from the old patch: update to require HBase 0.20.3, resulting in new zookeeper dependency (I'm testing with zookeeper 3.2.2) eliminated parser changes; they'll probably come back in a more general form something like STORED BY 'storage-handler-class' which will encapsulate the combination of inputformat, outputformat, metastore hooks, and optimizer interaction such as filter/predicate pushdown
          Hide
          Jonathan Ellis added a comment -

          ISTM that merging the HBase columnfamilies into a single Hive table is the wrong approach and could lead to poor performance; rather, each HBase CF should be its own Hive table, which may of course be joined with others as necessary. (I think using the word "table" for HBase's "collection of CFs" is unfortunate in the first place since they are different animals; fundamentally, the basic unit of data access in HBase is the CF.)

          I'm interested because Cassandra is also looking at adding Hive support, and we also implement a ColumnFamily data model.

          Show
          Jonathan Ellis added a comment - ISTM that merging the HBase columnfamilies into a single Hive table is the wrong approach and could lead to poor performance; rather, each HBase CF should be its own Hive table, which may of course be joined with others as necessary. (I think using the word "table" for HBase's "collection of CFs" is unfortunate in the first place since they are different animals; fundamentally, the basic unit of data access in HBase is the CF.) I'm interested because Cassandra is also looking at adding Hive support, and we also implement a ColumnFamily data model.
          Hide
          John Sichi added a comment -

          Jonathan, thanks for the input. I think we should be able to come up with a mapping feature which encompasses what you've proposed plus what's in HIVE-806 so that it will be up to the user to decide how to map a particular set of HBase tables into Hive.

          We can do this by allowing the HBase table name to be specified as part of mapping it into Hive. That way, you can have

          Hive t1(c1, c2) -> HBase t.cf1(c1, c2)
          Hive t2(c3, c4) -> HBase t.cf2(c3, c4)

          or

          Hive t(c1,c2,c3,c4) -> HBase t(cf1(c1, c2), cf2(c3, c4))

          or

          Hive t(cf1map, cf2map) -> HBase t(cf1, cf2)

          or variations. I'm going to write up a proposal in the Hive wiki and solicit feedback.

          Show
          John Sichi added a comment - Jonathan, thanks for the input. I think we should be able to come up with a mapping feature which encompasses what you've proposed plus what's in HIVE-806 so that it will be up to the user to decide how to map a particular set of HBase tables into Hive. We can do this by allowing the HBase table name to be specified as part of mapping it into Hive. That way, you can have Hive t1(c1, c2) -> HBase t.cf1(c1, c2) Hive t2(c3, c4) -> HBase t.cf2(c3, c4) or Hive t(c1,c2,c3,c4) -> HBase t(cf1(c1, c2), cf2(c3, c4)) or Hive t(cf1map, cf2map) -> HBase t(cf1, cf2) or variations. I'm going to write up a proposal in the Hive wiki and solicit feedback.
          Hide
          John Sichi added a comment -

          BTW, the new STORED BY 'storage-handler-class' should make it easy to plug in Cassandra.

          Show
          John Sichi added a comment - BTW, the new STORED BY 'storage-handler-class' should make it easy to plug in Cassandra.
          Hide
          John Sichi added a comment -

          First draft of the patch ready for review. Reviewers, please read these two accompanying docs:

          http://wiki.apache.org/hadoop/Hive/HBaseIntegration
          http://wiki.apache.org/hadoop/Hive/StorageHandlers

          Note that for this to be committed, it needs the accompanying, which I have also attached:

          hbase-0.20.3.jar
          hbase-0.20.3-test.jar
          zookeeper-3.2.2.jar

          These should be committed to trunk/hbase-handler/lib

          Show
          John Sichi added a comment - First draft of the patch ready for review. Reviewers, please read these two accompanying docs: http://wiki.apache.org/hadoop/Hive/HBaseIntegration http://wiki.apache.org/hadoop/Hive/StorageHandlers Note that for this to be committed, it needs the accompanying, which I have also attached: hbase-0.20.3.jar hbase-0.20.3-test.jar zookeeper-3.2.2.jar These should be committed to trunk/hbase-handler/lib
          Hide
          John Sichi added a comment -

          HIVE-705.2.patch

          Show
          John Sichi added a comment - HIVE-705 .2.patch
          Hide
          Prasad Chakka added a comment -

          John, Why are pre, commit, rollback functions needed in MetaHook? Isn't it enough just to drop table as a rollback for create, and do the drop table after hive drop table? With the current definition the MetaHook implementation needs to keep state around which Hive itself doesn't do.

          Also alter table on external tables should be allowed since underlying storage format for external tables is not managed by Hive itself. In such cases alter table is just changing metadata in side Hive.

          Show
          Prasad Chakka added a comment - John, Why are pre, commit, rollback functions needed in MetaHook? Isn't it enough just to drop table as a rollback for create, and do the drop table after hive drop table? With the current definition the MetaHook implementation needs to keep state around which Hive itself doesn't do. Also alter table on external tables should be allowed since underlying storage format for external tables is not managed by Hive itself. In such cases alter table is just changing metadata in side Hive.
          Hide
          John Sichi added a comment -

          Prasad, the MetaHook interface is defined that way so that if a handler wants to, it can carry out the operation in a stateful fashion (e.g. if its underlying catalog supports transactions), but there is no requirement for it to keep state, and in fact the HBaseStorageHandler implementation is itself stateless (and has a NOP for three of its method implementations).

          Alter table: yes, I'm planning to create a followup task for this. The original patch had alter table support in the meta hook interface too, but I trimmed it down for now to limit the scope of the first commit.

          Show
          John Sichi added a comment - Prasad, the MetaHook interface is defined that way so that if a handler wants to, it can carry out the operation in a stateful fashion (e.g. if its underlying catalog supports transactions), but there is no requirement for it to keep state, and in fact the HBaseStorageHandler implementation is itself stateless (and has a NOP for three of its method implementations). Alter table: yes, I'm planning to create a followup task for this. The original patch had alter table support in the meta hook interface too, but I trimmed it down for now to limit the scope of the first commit.
          Hide
          John Sichi added a comment -

          While testing, found a few bugs in HBaseSerDe.serialize for the case where a Hive map is being converted into an HBase column family; I'll fix these together with whatever comes out of review.

          Show
          John Sichi added a comment - While testing, found a few bugs in HBaseSerDe.serialize for the case where a Hive map is being converted into an HBase column family; I'll fix these together with whatever comes out of review.
          Hide
          John Sichi added a comment -

          HIVE-705.3.patch resolves a conflict with trunk, fixes some serde bugs, and adds more tests.

          Show
          John Sichi added a comment - HIVE-705 .3.patch resolves a conflict with trunk, fixes some serde bugs, and adds more tests.
          Hide
          John Sichi added a comment -

          HIVE-705.4.patch fixes hbase-handler/ivy.xml

          Show
          John Sichi added a comment - HIVE-705 .4.patch fixes hbase-handler/ivy.xml
          Hide
          John Sichi added a comment -

          HIVE-705.5.patch: fix conflict with latest trunk, and do some more cleanup.

          Show
          John Sichi added a comment - HIVE-705 .5.patch: fix conflict with latest trunk, and do some more cleanup.
          Hide
          Jonathan Ellis added a comment -

          Thanks John, I read your wiki notes and it does look like this will work fine for Cassandra at least at the conceptual level.

          Is HIVE-806 redundant w/ your latest patchset now?

          Show
          Jonathan Ellis added a comment - Thanks John, I read your wiki notes and it does look like this will work fine for Cassandra at least at the conceptual level. Is HIVE-806 redundant w/ your latest patchset now?
          Hide
          Namit Jain added a comment -

          John, can you file the follow-up jiras ?

          Show
          Namit Jain added a comment - John, can you file the follow-up jiras ?
          Hide
          John Sichi added a comment -

          @Jonathan: I haven't seen any patch uploaded for HIVE-806. The comments indicate that they have a way to customize the serialization per column in HBase, which could be interesting, but it's non-essential. Once HIVE-705 gets committed, I'll post a comment on HIVE-806 and ask whether they want to keep it open or abandon it.

          @Namit: will do.

          Show
          John Sichi added a comment - @Jonathan: I haven't seen any patch uploaded for HIVE-806 . The comments indicate that they have a way to customize the serialization per column in HBase, which could be interesting, but it's non-essential. Once HIVE-705 gets committed, I'll post a comment on HIVE-806 and ask whether they want to keep it open or abandon it. @Namit: will do.
          Hide
          John Sichi added a comment -

          Followup JIRA issues have been logged and linked to this one as related.

          Show
          John Sichi added a comment - Followup JIRA issues have been logged and linked to this one as related.
          Hide
          Namit Jain added a comment -

          [ivy:retrieve] :: problems summary ::
          [ivy:retrieve] :::: WARNINGS
          [ivy:retrieve] module not found: hadoop#hbase;$

          {hbase.version}
          [ivy:retrieve] ==== hadoop-source: tried
          [ivy:retrieve] – artifact hadoop#hbase;${hbase.version}

          !hbase.tar.gz(source):
          [ivy:retrieve] http://mirror.facebook.net/facebook/hive-deps/hadoop/core/hbase-$

          {hbase.version}/hbase-${hbase.version}

          .tar.gz
          [ivy:retrieve] ==== apache-snapshot: tried
          [ivy:retrieve] https://repository.apache.org/content/repositories/snapshots/hadoop/hbase/$

          {hbase.version}/hbase-${hbase.version}

          .pom
          [ivy:retrieve] – artifact hadoop#hbase;$

          {hbase.version}!hbase.tar.gz(source):
          [ivy:retrieve] https://repository.apache.org/content/repositories/snapshots/hadoop/hbase/${hbase.version}

          /hbase-$

          {hbase.version}.tar.gz
          [ivy:retrieve] ==== maven2: tried
          [ivy:retrieve] http://repo1.maven.org/maven2/hadoop/hbase/${hbase.version}

          /hbase-$

          {hbase.version}.pom
          [ivy:retrieve] – artifact hadoop#hbase;${hbase.version}

          !hbase.tar.gz(source):
          [ivy:retrieve] http://repo1.maven.org/maven2/hadoop/hbase/$

          {hbase.version}/hbase-${hbase.version}

          .tar.gz
          [ivy:retrieve] ::::::::::::::::::::::::::::::::::::::::::::::
          [ivy:retrieve] :: UNRESOLVED DEPENDENCIES ::
          [ivy:retrieve] ::::::::::::::::::::::::::::::::::::::::::::::
          [ivy:retrieve] :: hadoop#hbase;$

          {hbase.version}

          : not found
          [ivy:retrieve] ::::::::::::::::::::::::::::::::::::::::::::::
          [ivy:retrieve]
          [ivy:retrieve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS

          I am getting the following errors when I compile

          Show
          Namit Jain added a comment - [ivy:retrieve] :: problems summary :: [ivy:retrieve] :::: WARNINGS [ivy:retrieve] module not found: hadoop#hbase;$ {hbase.version} [ivy:retrieve] ==== hadoop-source: tried [ivy:retrieve] – artifact hadoop#hbase;${hbase.version} !hbase.tar.gz(source): [ivy:retrieve] http://mirror.facebook.net/facebook/hive-deps/hadoop/core/hbase-$ {hbase.version}/hbase-${hbase.version} .tar.gz [ivy:retrieve] ==== apache-snapshot: tried [ivy:retrieve] https://repository.apache.org/content/repositories/snapshots/hadoop/hbase/$ {hbase.version}/hbase-${hbase.version} .pom [ivy:retrieve] – artifact hadoop#hbase;$ {hbase.version}!hbase.tar.gz(source): [ivy:retrieve] https://repository.apache.org/content/repositories/snapshots/hadoop/hbase/$ {hbase.version} /hbase-$ {hbase.version}.tar.gz [ivy:retrieve] ==== maven2: tried [ivy:retrieve] http://repo1.maven.org/maven2/hadoop/hbase/$ {hbase.version} /hbase-$ {hbase.version}.pom [ivy:retrieve] – artifact hadoop#hbase;${hbase.version} !hbase.tar.gz(source): [ivy:retrieve] http://repo1.maven.org/maven2/hadoop/hbase/$ {hbase.version}/hbase-${hbase.version} .tar.gz [ivy:retrieve] :::::::::::::::::::::::::::::::::::::::::::::: [ivy:retrieve] :: UNRESOLVED DEPENDENCIES :: [ivy:retrieve] :::::::::::::::::::::::::::::::::::::::::::::: [ivy:retrieve] :: hadoop#hbase;$ {hbase.version} : not found [ivy:retrieve] :::::::::::::::::::::::::::::::::::::::::::::: [ivy:retrieve] [ivy:retrieve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS I am getting the following errors when I compile
          Hide
          John Sichi added a comment -

          Oops, sorry, when I regenerated the last patch, the bad ivy dependency crept back in accidentally. Here's a fixed version.

          Show
          John Sichi added a comment - Oops, sorry, when I regenerated the last patch, the bad ivy dependency crept back in accidentally. Here's a fixed version.
          Hide
          John Sichi added a comment -

          Use HIVE-705.6.patch.

          Show
          John Sichi added a comment - Use HIVE-705 .6.patch.
          Hide
          John Sichi added a comment -

          Latest patch hits a test failure with latest trunk. I'll upload a new patch soon to fix it.

          Show
          John Sichi added a comment - Latest patch hits a test failure with latest trunk. I'll upload a new patch soon to fix it.
          Hide
          John Sichi added a comment -

          OK, HIVE-705.7.patch should run through tests cleanly.

          Show
          John Sichi added a comment - OK, HIVE-705 .7.patch should run through tests cleanly.
          Hide
          Namit Jain added a comment -

          +1

          will commit if the tests pass

          Show
          Namit Jain added a comment - +1 will commit if the tests pass
          Hide
          Namit Jain added a comment -

          Committed. Thanks John

          Show
          Namit Jain added a comment - Committed. Thanks John

            People

            • Assignee:
              John Sichi
              Reporter:
              Sijie Guo
            • Votes:
              6 Vote for this issue
              Watchers:
              32 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development