Hive
  1. Hive
  2. HIVE-1257

joins between HBase tables and other tables (whether HBase or not) are broken

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6.0
    • Fix Version/s: 0.6.0
    • Component/s: HBase Handler
    • Labels:
    • Hadoop Flags:
      Reviewed
    1. HIVE-1257.prelim.patch
      2 kB
      John Sichi
    2. HIVE-1257.1.patch
      11 kB
      John Sichi
    3. HIVE-1257.2.patch
      14 kB
      John Sichi

      Issue Links

        Activity

        Hide
        Namit Jain added a comment -

        Committed. Thanks John

        Show
        Namit Jain added a comment - Committed. Thanks John
        Hide
        Namit Jain added a comment -

        +1

        will commit if the tests pass

        Show
        Namit Jain added a comment - +1 will commit if the tests pass
        Hide
        John Sichi added a comment -

        HIVE-1257.2.patch repeats the test with CombineHiveInputFormat

        Show
        John Sichi added a comment - HIVE-1257 .2.patch repeats the test with CombineHiveInputFormat
        Hide
        Namit Jain added a comment -

        Can you add a test (or run the sane test) with CombineHiveInputFormat also ?

        Show
        Namit Jain added a comment - Can you add a test (or run the sane test) with CombineHiveInputFormat also ?
        Hide
        John Sichi added a comment -

        HIVE-1257.1.patch is the real fix.

        Per HIVE-1236, I made it work both with and without CombineHiveInputFormat, but only by making CombineHiveInputFormat punt in the case where it sees a non-native table. This is not a good long-term solution, so I'm opening a separate issue HIVE-1267 for that.

        Show
        John Sichi added a comment - HIVE-1257 .1.patch is the real fix. Per HIVE-1236 , I made it work both with and without CombineHiveInputFormat, but only by making CombineHiveInputFormat punt in the case where it sees a non-native table. This is not a good long-term solution, so I'm opening a separate issue HIVE-1267 for that.
        Hide
        John Sichi added a comment -

        I also need to add new tests for the real patch submission.

        Show
        John Sichi added a comment - I also need to add new tests for the real patch submission.
        Hide
        John Sichi added a comment -

        Preliminary patch which fixes the test queries; I haven't run through regression tests yet.

        Show
        John Sichi added a comment - Preliminary patch which fixes the test queries; I haven't run through regression tests yet.
        Hide
        John Sichi added a comment -

        Copying details from mailing list thread:

        From "Daniel Pruckler" <Daniel.Pruck...@t8webware.com>
        Subject Problem when joining hbase tables in hive
        Date Thu, 18 Mar 2010 16:50:22 GMT

        We're just starting with the HBaseIntegration, and are really excited about it, great work
        on this. Thanks

        We are running into a problem when joining two hbase tables. It seems to usually work when
        joining one hbase table with one hive table, but not always. Let me know if I'm doing something
        wrong.

        My workaround for right now is to create new tables in hive and overwrite that table with
        the external hbase table, and then join from the new tables. In our situation this is working
        as the one large table we've got seems to work just fine in hbase, we just need to overwrite
        the smaller join tables before doing a join. I have one join example at the end of this email
        that fails when joining an hbase and hive table.

        Here's what I did:

        Create the HBase tables:

        create 'users', 'info'

        put 'users', 'user1', 'info:state', 'IA'

        put 'users', 'user1', 'info:country', 'USA'

        put 'users', 'user1', 'info:country_id', '0'

        create 'states', 'state'

        put 'states', 'IA', 'state:name', 'Iowa'

        create 'countries', 'info'

        put 'countries', 'USA', 'info:name', 'United States'

        put 'countries', 'USA', 'info:country', 'USA'

        put 'countries', 'USA', 'info:country_id', '1'

        Create the hive tables:

        CREATE EXTERNAL TABLE users(key string, state string, country string, country_id int)

        STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

        WITH SERDEPROPERTIES (

        "hbase.key.type" = "string",

        "hbase.columns.mapping" = "info:state,info:country,info:country_id",

        "hbase.table.name" = "users"

        );

        CREATE EXTERNAL TABLE states(key string, name string)

        STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

        WITH SERDEPROPERTIES (

        "hbase.key.type" = "string",

        "hbase.columns.mapping" = "state:name",

        "hbase.table.name" = "states"

        );

        CREATE EXTERNAL TABLE countries(key string, name string, country string, country_id int)

        STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

        WITH SERDEPROPERTIES (

        "hbase.key.type" = "string",

        "hbase.columns.mapping" = "info:name,info:country,info:country_id",

        "hbase.table.name" = "countries"

        );

        Then try to join, this returns no results:

        SELECT u.key, u.country, c.name, c.key FROM users u JOIN countries c ON (u.country = c.key);

        OK

        Time taken: 26.75 seconds

        This returns results, but not what's expected, it seems like it joins the user table with
        itself

        SELECT u.key, u.country, c.name, c.key FROM users u JOIN countries c ON (u.country = c.country);

        OK

        user1 USA NULL user1

        Time taken: 24.867 seconds

        This also returns results (it should not as the country_ids do not match)

        SELECT u.key, u.country, c.name, c.key FROM users u JOIN countries c ON (u.country_id = c.country_id);

        OK

        user1 USA NULL user1

        Time taken: 22.868 seconds

        I would guess this is part of the same issue, trying to join anything in state with user results
        in an error (state does not have an 'info' column family)

        hive> SELECT u.key, u.state, s.name FROM users u JOIN states s ON (u.state = s.key);

        Total MapReduce jobs = 1

        Launching Job 1 out of 1

        Number of reduce tasks not specified. Estimated from input data size: 1

        In order to change the average load for a reducer (in bytes):

        set hive.exec.reducers.bytes.per.reducer=<number>

        In order to limit the maximum number of reducers:

        set hive.exec.reducers.max=<number>

        In order to set a constant number of reducers:

        set mapred.reduce.tasks=<number>

        Starting Job = job_201003150937_0096, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201003150937_0096

        Kill Command = /usr/lib/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001
        -kill job_201003150937_0096

        2010-03-18 10:43:28,730 Stage-1 map = 0%, reduce = 0%

        2010-03-18 10:43:34,908 Stage-1 map = 50%, reduce = 0%

        2010-03-18 10:43:44,020 Stage-1 map = 50%, reduce = 17%

        2010-03-18 10:44:02,146 Stage-1 map = 100%, reduce = 100%

        Ended Job = job_201003150937_0096 with errors

        Task with the most failures(4):


        Task ID:

        task_201003150937_0096_m_000001

        URL:

        http://localhost:50030/taskdetails.jsp?jobid=job_201003150937_0096&tipid=task_201003150937_0096_m_000001


        FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.ExecDriver

        org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException:
        Column family state does not exist in region users,,1268926579188 in table {NAME => 'users',
        FAMILIES => [

        {NAME => 'info', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

        ]}

        at org.apache.hadoop.hbase.regionserver.HRegion.checkFamily(HRegion.java:2476)

        at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1107)

        at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1095)

        at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1846)

        at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)

        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

        at java.lang.reflect.Method.invoke(Method.java:597)

        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)

        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)

        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)

        at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)

        at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1041)

        at org.apache.hadoop.hbase.client.HTable$ClientScanner.nextScanner(HTable.java:1935)

        at org.apache.hadoop.hbase.client.HTable$ClientScanner.initialize(HTable.java:1855)

        at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:376)

        at org.apache.hadoop.hbase.mapred.TableInputFormatBase$TableRecordReader.restart(TableInputFormatBase.java:129)

        at org.apache.hadoop.hbase.mapred.TableInputFormatBase$TableRecordReader.init(TableInputFormatBase.java:139)

        at org.apache.hadoop.hbase.mapred.TableInputFormatBase.getRecordReader(TableInputFormatBase.java:260)

        at org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat.getRecordReader(HiveHBaseTableInputFormat.java:119)

        at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:229)

        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)

        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)

        at org.apache.hadoop.mapred.Child.main(Child.java:170)

        Here's my workaround that allows me to be able to join.

        CREATE TABLE hive_users(key string, state string, country string, country_id int);

        INSERT OVERWRITE TABLE hive_users SELECT key, state, country, country_id FROM users;

        CREATE TABLE hive_states(key string, name string);

        INSERT OVERWRITE TABLE hive_states SELECT key, name FROM states;

        CREATE TABLE hive_countries(key string, name string, country string, country_id int);

        INSERT OVERWRITE TABLE hive_countries SELECT key, name, country, country_id FROM countries;

        Any of my selects work just fine:

        SELECT u.key, u.state, s.name FROM hive_users u JOIN hive_states s ON (u.state = s.key);

        SELECT u.key, u.country, c.name FROM hive_users u JOIN hive_countries c ON (u.country = c.key);

        When joining one hbase table it will work with some tables, and give errors with others, I'm
        not seeing anything I'm doing wrong with the select.

        This works:

        SELECT u.key, u.country, c.name FROM users u JOIN hive_countries c ON (u.country = c.key);

        However while trying other things, I found that this query does not work, and cannot figure
        out why it'd be any different, maybe I'm doing something wrong.

        SELECT u.key, u.country, c.name FROM hive_users u JOIN countries c ON (u.country = c.key);

        Ended Job = job_201003150937_0118 with errors

        Task with the most failures(4):


        Task ID:

        task_201003150937_0118_m_000001

        URL:

        http://localhost:50030/taskdetails.jsp?jobid=job_201003150937_0118&tipid=task_201003150937_0118_m_000001


        FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.ExecDriver

        java.lang.IllegalArgumentException: string cannot be null

        at org.apache.hadoop.hbase.util.Bytes.toBytes(Bytes.java:390)

        at org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat.getRecordReader(HiveHBaseTableInputFormat.java:68)

        at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:229)

        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)

        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)

        at org.apache.hadoop.mapred.Child.main(Child.java:170)

        Thanks,

        Daniel

        Show
        John Sichi added a comment - Copying details from mailing list thread: From "Daniel Pruckler" <Daniel.Pruck...@t8webware.com> Subject Problem when joining hbase tables in hive Date Thu, 18 Mar 2010 16:50:22 GMT We're just starting with the HBaseIntegration, and are really excited about it, great work on this. Thanks We are running into a problem when joining two hbase tables. It seems to usually work when joining one hbase table with one hive table, but not always. Let me know if I'm doing something wrong. My workaround for right now is to create new tables in hive and overwrite that table with the external hbase table, and then join from the new tables. In our situation this is working as the one large table we've got seems to work just fine in hbase, we just need to overwrite the smaller join tables before doing a join. I have one join example at the end of this email that fails when joining an hbase and hive table. Here's what I did: Create the HBase tables: create 'users', 'info' put 'users', 'user1', 'info:state', 'IA' put 'users', 'user1', 'info:country', 'USA' put 'users', 'user1', 'info:country_id', '0' create 'states', 'state' put 'states', 'IA', 'state:name', 'Iowa' create 'countries', 'info' put 'countries', 'USA', 'info:name', 'United States' put 'countries', 'USA', 'info:country', 'USA' put 'countries', 'USA', 'info:country_id', '1' Create the hive tables: CREATE EXTERNAL TABLE users(key string, state string, country string, country_id int) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.key.type" = "string", "hbase.columns.mapping" = "info:state,info:country,info:country_id", "hbase.table.name" = "users" ); CREATE EXTERNAL TABLE states(key string, name string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.key.type" = "string", "hbase.columns.mapping" = "state:name", "hbase.table.name" = "states" ); CREATE EXTERNAL TABLE countries(key string, name string, country string, country_id int) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.key.type" = "string", "hbase.columns.mapping" = "info:name,info:country,info:country_id", "hbase.table.name" = "countries" ); Then try to join, this returns no results: SELECT u.key, u.country, c.name, c.key FROM users u JOIN countries c ON (u.country = c.key); OK Time taken: 26.75 seconds This returns results, but not what's expected, it seems like it joins the user table with itself SELECT u.key, u.country, c.name, c.key FROM users u JOIN countries c ON (u.country = c.country); OK user1 USA NULL user1 Time taken: 24.867 seconds This also returns results (it should not as the country_ids do not match) SELECT u.key, u.country, c.name, c.key FROM users u JOIN countries c ON (u.country_id = c.country_id); OK user1 USA NULL user1 Time taken: 22.868 seconds I would guess this is part of the same issue, trying to join anything in state with user results in an error (state does not have an 'info' column family) hive> SELECT u.key, u.state, s.name FROM users u JOIN states s ON (u.state = s.key); Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Starting Job = job_201003150937_0096, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201003150937_0096 Kill Command = /usr/lib/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201003150937_0096 2010-03-18 10:43:28,730 Stage-1 map = 0%, reduce = 0% 2010-03-18 10:43:34,908 Stage-1 map = 50%, reduce = 0% 2010-03-18 10:43:44,020 Stage-1 map = 50%, reduce = 17% 2010-03-18 10:44:02,146 Stage-1 map = 100%, reduce = 100% Ended Job = job_201003150937_0096 with errors Task with the most failures(4): Task ID: task_201003150937_0096_m_000001 URL: http://localhost:50030/taskdetails.jsp?jobid=job_201003150937_0096&tipid=task_201003150937_0096_m_000001 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.ExecDriver org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: Column family state does not exist in region users,,1268926579188 in table {NAME => 'users', FAMILIES => [ {NAME => 'info', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} ]} at org.apache.hadoop.hbase.regionserver.HRegion.checkFamily(HRegion.java:2476) at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1107) at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1095) at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1846) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1041) at org.apache.hadoop.hbase.client.HTable$ClientScanner.nextScanner(HTable.java:1935) at org.apache.hadoop.hbase.client.HTable$ClientScanner.initialize(HTable.java:1855) at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:376) at org.apache.hadoop.hbase.mapred.TableInputFormatBase$TableRecordReader.restart(TableInputFormatBase.java:129) at org.apache.hadoop.hbase.mapred.TableInputFormatBase$TableRecordReader.init(TableInputFormatBase.java:139) at org.apache.hadoop.hbase.mapred.TableInputFormatBase.getRecordReader(TableInputFormatBase.java:260) at org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat.getRecordReader(HiveHBaseTableInputFormat.java:119) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:229) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) Here's my workaround that allows me to be able to join. CREATE TABLE hive_users(key string, state string, country string, country_id int); INSERT OVERWRITE TABLE hive_users SELECT key, state, country, country_id FROM users; CREATE TABLE hive_states(key string, name string); INSERT OVERWRITE TABLE hive_states SELECT key, name FROM states; CREATE TABLE hive_countries(key string, name string, country string, country_id int); INSERT OVERWRITE TABLE hive_countries SELECT key, name, country, country_id FROM countries; Any of my selects work just fine: SELECT u.key, u.state, s.name FROM hive_users u JOIN hive_states s ON (u.state = s.key); SELECT u.key, u.country, c.name FROM hive_users u JOIN hive_countries c ON (u.country = c.key); When joining one hbase table it will work with some tables, and give errors with others, I'm not seeing anything I'm doing wrong with the select. This works: SELECT u.key, u.country, c.name FROM users u JOIN hive_countries c ON (u.country = c.key); However while trying other things, I found that this query does not work, and cannot figure out why it'd be any different, maybe I'm doing something wrong. SELECT u.key, u.country, c.name FROM hive_users u JOIN countries c ON (u.country = c.key); Ended Job = job_201003150937_0118 with errors Task with the most failures(4): Task ID: task_201003150937_0118_m_000001 URL: http://localhost:50030/taskdetails.jsp?jobid=job_201003150937_0118&tipid=task_201003150937_0118_m_000001 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.ExecDriver java.lang.IllegalArgumentException: string cannot be null at org.apache.hadoop.hbase.util.Bytes.toBytes(Bytes.java:390) at org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat.getRecordReader(HiveHBaseTableInputFormat.java:68) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:229) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) Thanks, Daniel

          People

          • Assignee:
            John Sichi
            Reporter:
            John Sichi
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development