Hive
  1. Hive
  2. HIVE-3179

HBase Handler doesn't handle NULLs properly

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.9.0, 0.10.0
    • Fix Version/s: 0.11.0
    • Component/s: HBase Handler
    • Labels:
      None

      Description

      We found a quite severe issue in the HBase Handler which actually means that Hive potentially returns incorrect data if a column has NULL values in HBase (which means the cell doesn't even exist)

      In HBase Shell:

      create 'hive_hbase_test', 'test'
      put 'hive_hbase_test', '1', 'test:c1', 'c1-1'
      put 'hive_hbase_test', '1', 'test:c2', 'c2-1'
      put 'hive_hbase_test', '1', 'test:c3', 'c3-1'
      put 'hive_hbase_test', '2', 'test:c1', 'c1-2'
      

      In Hive:

      DROP TABLE IF EXISTS hive_hbase_test;
      CREATE EXTERNAL TABLE hive_hbase_test (
        id int,
        c1 string,
        c2 string,
        c3 string
      )
      STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
      WITH SERDEPROPERTIES ("hbase.columns.mapping" =
      ":key#s,test:c1#s,test:c2#s,test:c3#s")
      TBLPROPERTIES("hbase.table.name" = "hive_hbase_test");
      
      hive> select * from hive_hbase_test;
      OK
      1	c1-1	c2-1	c3-1
      2	c1-2	NULL	NULL
      
      hive> select c1 from hive_hbase_test;
      c1-1
      c1-2
      
      hive> select c1, c2 from hive_hbase_test;
      c1-1	c2-1
      c1-2	NULL
      

      So far everything is correct but now:

      hive> select c1, c2, c2 from hive_hbase_test;
      c1-1	c2-1	c2-1
      c1-2	NULL	c2-1
      

      Selecting c2 twice works the first time but the second time we
      actually get the value from the previous row.

      hive> select c1, c3, c2, c2, c3, c3, c1 from hive_hbase_test;
      c1-1	c3-1	c2-1	c2-1	c3-1	c3-1	c1-1
      c1-2	NULL	NULL	c2-1	c3-1	c3-1	c1-2
      

      We've narrowed this down to an early initialization of fieldsInited[fieldID] = true in LazyHBaseRow#uncheckedGetField and we'll try to provide a patch which surely needs review.

        Issue Links

          Activity

          Lars Francke created issue -
          Lars Francke made changes -
          Field Original Value New Value
          Description We found a quite severe issue in the HBase Handler which actually means that Hive potentially returns incorrect data if a column has NULL values in HBase (which means the cell doesn't even exist)

          In HBase Shell:

          {noformat}
          create 'hive_hbase_test', 'test'
          put 'hive_hbase_test', '1', 'test:c1', 'c1-1'
          put 'hive_hbase_test', '1', 'test:c2', 'c2-1'
          put 'hive_hbase_test', '1', 'test:c3', 'c3-1'
          put 'hive_hbase_test', '2', 'test:c1', 'c1-2'
          {noformat}

          In Hive:

          {noformat}
          DROP TABLE IF EXISTS hive_hbase_test;
          CREATE EXTERNAL TABLE hive_hbase_test (
            id int,
            c1 string,
            c2 string,
            c3 string
          )
          STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
          WITH SERDEPROPERTIES ("hbase.columns.mapping" =
          ":key#s,test:c1#s,test:c2#s,test:c3#s")
          TBLPROPERTIES("hbase.table.name" = "hive_hbase_test");

          hive> select * from hive_hbase_test;
          OK
          1 c1-1 c2-1 c3-1
          2 c1-2 NULL NULL

          hive> select c1 from hive_hbase_test;
          c1-1
          c1-2

          hive> select c1, c2 from hive_hbase_test;
          c1-1 c2-1
          c1-2 NULL
          {noformat}

          So far everything is correct but now:

          {noformat}
          hive> select c1, c2, c2 from hive_hbase_test;
          c1-1 c2-1 c2-1
          c1-2 NULL c2-1
          {noformat}

          Selecting c2 twice works the first time but the second time we
          actually get the value from the previous row.

          {noformat}
          hive> select c1, c3, c2, c2, c3, c3, c1 from hive_hbase_test;
          c1-1 c3-1 c2-1 c2-1 c3-1 c3-1 c1-1
          c1-2 NULL NULL c2-1 c3-1 c3-1 c1-2
          {noformat}

          We've narrowed this down to an early initialization of {{fieldsInited[fieldID] = true;}} in {{LazyHBaseRow#uncheckedGetField}} and we'll try to provide a patch which surely needs review.
          We found a quite severe issue in the HBase Handler which actually means that Hive potentially returns incorrect data if a column has NULL values in HBase (which means the cell doesn't even exist)

          In HBase Shell:

          {noformat}
          create 'hive_hbase_test', 'test'
          put 'hive_hbase_test', '1', 'test:c1', 'c1-1'
          put 'hive_hbase_test', '1', 'test:c2', 'c2-1'
          put 'hive_hbase_test', '1', 'test:c3', 'c3-1'
          put 'hive_hbase_test', '2', 'test:c1', 'c1-2'
          {noformat}

          In Hive:

          {noformat}
          DROP TABLE IF EXISTS hive_hbase_test;
          CREATE EXTERNAL TABLE hive_hbase_test (
            id int,
            c1 string,
            c2 string,
            c3 string
          )
          STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
          WITH SERDEPROPERTIES ("hbase.columns.mapping" =
          ":key#s,test:c1#s,test:c2#s,test:c3#s")
          TBLPROPERTIES("hbase.table.name" = "hive_hbase_test");

          hive> select * from hive_hbase_test;
          OK
          1 c1-1 c2-1 c3-1
          2 c1-2 NULL NULL

          hive> select c1 from hive_hbase_test;
          c1-1
          c1-2

          hive> select c1, c2 from hive_hbase_test;
          c1-1 c2-1
          c1-2 NULL
          {noformat}

          So far everything is correct but now:

          {noformat}
          hive> select c1, c2, c2 from hive_hbase_test;
          c1-1 c2-1 c2-1
          c1-2 NULL c2-1
          {noformat}

          Selecting c2 twice works the first time but the second time we
          actually get the value from the previous row.

          {noformat}
          hive> select c1, c3, c2, c2, c3, c3, c1 from hive_hbase_test;
          c1-1 c3-1 c2-1 c2-1 c3-1 c3-1 c1-1
          c1-2 NULL NULL c2-1 c3-1 c3-1 c1-2
          {noformat}

          We've narrowed this down to an early initialization of {{fieldsInited\[fieldID] = true}} in {{LazyHBaseRow#uncheckedGetField}} and we'll try to provide a patch which surely needs review.
          Hide
          Lars Francke added a comment -

          The attached patch fixes the problem as well as changes a unit test that actually tests this behavior. The unit test fails if our fix to LazyHBaseRow is not applied.

          We're not sure if this is the best way to fix this problem as it circumvents the optimization being done by the fieldsInited field. Ideally instead of returning null on an empty HBase cell this would insert some kind of marker but adding an empty ByteArrayRef is not interpreted as NULL but as an empty value (which makes sense).

          In short: This fixes the bug at the cost of some performance for NULL (non-existing) fields in HBase.

          Show
          Lars Francke added a comment - The attached patch fixes the problem as well as changes a unit test that actually tests this behavior. The unit test fails if our fix to LazyHBaseRow is not applied. We're not sure if this is the best way to fix this problem as it circumvents the optimization being done by the fieldsInited field. Ideally instead of returning null on an empty HBase cell this would insert some kind of marker but adding an empty ByteArrayRef is not interpreted as NULL but as an empty value (which makes sense). In short: This fixes the bug at the cost of some performance for NULL (non-existing) fields in HBase.
          Lars Francke made changes -
          Attachment HIVE-3179.1.patch [ 12533045 ]
          Hide
          Lars Francke added a comment -

          We could add a second boolean array to go with fieldsInited that's called fieldsNull that caches those fields. Not sure if that's needed though.

          Thanks to my colleague Oliver Meyn who actually looked at the code and found the fix, I only packaged it up and added the unit test.

          Show
          Lars Francke added a comment - We could add a second boolean array to go with fieldsInited that's called fieldsNull that caches those fields. Not sure if that's needed though. Thanks to my colleague Oliver Meyn who actually looked at the code and found the fix, I only packaged it up and added the unit test.
          Lars Francke made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Carl Steinbach added a comment -

          @Lars: Please post a review request on reviews.apache.org. Thanks.

          Show
          Carl Steinbach added a comment - @Lars: Please post a review request on reviews.apache.org. Thanks.
          Hide
          Lars Francke added a comment -

          @Carl: Sure: https://reviews.apache.org/r/5542/ thanks for the reminder.

          Show
          Lars Francke added a comment - @Carl: Sure: https://reviews.apache.org/r/5542/ thanks for the reminder.
          Hide
          Lars Francke added a comment -

          As far as I can tell this is still an issue. Would anyone mind doing a review on this one?

          Show
          Lars Francke added a comment - As far as I can tell this is still an issue. Would anyone mind doing a review on this one?
          Brock Noland made changes -
          Remote Link This issue links to "Review Board (Web Link)" [ 12010 ]
          Hide
          Brock Noland added a comment -

          I have verified this is an issue with trunk, the patch applies, and the patch addresses the issue.

          hive> select c1, c3, c2, c2, c3, c3, c1 from hive_hbase_test;
          Total MapReduce jobs = 1
          Launching Job 1 out of 1
          Number of reduce tasks is set to 0 since there's no reduce operator
          Starting Job = job_201302071609_0002, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201302071609_0002
          Kill Command = /opt/local/hadoop-1.1.1/libexec/../bin/hadoop job  -kill job_201302071609_0002
          Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
          2013-02-07 16:10:31,826 Stage-1 map = 0%,  reduce = 0%
          2013-02-07 16:10:34,846 Stage-1 map = 100%,  reduce = 0%
          2013-02-07 16:10:36,861 Stage-1 map = 100%,  reduce = 100%
          Ended Job = job_201302071609_0002
          MapReduce Jobs Launched: 
          Job 0: Map: 1   HDFS Read: 260 HDFS Write: 60 SUCCESS
          Total MapReduce CPU Time Spent: 0 msec
          OK
          c1-1	c3-1	c2-1	c2-1	c3-1	c3-1	c1-1
          c1-2	NULL	NULL	NULL	NULL	NULL	c1-2
          Time taken: 10.702 seconds, Fetched: 2 row(s)
          hive> 
          
          Show
          Brock Noland added a comment - I have verified this is an issue with trunk, the patch applies, and the patch addresses the issue. hive> select c1, c3, c2, c2, c3, c3, c1 from hive_hbase_test; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201302071609_0002, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201302071609_0002 Kill Command = /opt/local/hadoop-1.1.1/libexec/../bin/hadoop job -kill job_201302071609_0002 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2013-02-07 16:10:31,826 Stage-1 map = 0%, reduce = 0% 2013-02-07 16:10:34,846 Stage-1 map = 100%, reduce = 0% 2013-02-07 16:10:36,861 Stage-1 map = 100%, reduce = 100% Ended Job = job_201302071609_0002 MapReduce Jobs Launched: Job 0: Map: 1 HDFS Read: 260 HDFS Write: 60 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK c1-1 c3-1 c2-1 c2-1 c3-1 c3-1 c1-1 c1-2 NULL NULL NULL NULL NULL c1-2 Time taken: 10.702 seconds, Fetched: 2 row(s) hive>
          Brock Noland made changes -
          Affects Version/s 0.10.0 [ 12320745 ]
          Hide
          Shreepadma Venugopalan added a comment -

          +1.

          Show
          Shreepadma Venugopalan added a comment - +1.
          Hide
          Mark Grover added a comment -

          Running TestHBaseCliDriver tests...

          Show
          Mark Grover added a comment - Running TestHBaseCliDriver tests...
          Hide
          Brock Noland added a comment -

          Mark,

          How did the tests turn out?

          Show
          Brock Noland added a comment - Mark, How did the tests turn out?
          Hide
          Mark Grover added a comment -

          They timed out on my pseudo-distributed laptop but that most likely is an environment issue local to me. But looks like Lars mentioned that he had run the tests, so that should be ok.

          I will try to fix the environment, but don't wait on me.

          Show
          Mark Grover added a comment - They timed out on my pseudo-distributed laptop but that most likely is an environment issue local to me. But looks like Lars mentioned that he had run the tests, so that should be ok. I will try to fix the environment, but don't wait on me.
          Hide
          Brock Noland added a comment -

          Carl Steinbach do you have time to look at this patch?

          Show
          Brock Noland added a comment - Carl Steinbach do you have time to look at this patch?
          binlijin made changes -
          Link This issue is related too HIVE-4057 [ HIVE-4057 ]
          Hide
          Navis added a comment -

          Passed all tests and committed to trunk. Thanks Lars.

          Show
          Navis added a comment - Passed all tests and committed to trunk. Thanks Lars.
          Navis made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Fix Version/s 0.12.0 [ 12324312 ]
          Resolution Fixed [ 1 ]
          Hide
          Hudson added a comment -

          Integrated in Hive-trunk-hadoop2 #160 (See https://builds.apache.org/job/Hive-trunk-hadoop2/160/)
          HIVE-3179 HBase Handler doesn't handle NULLs properly (Lars Francke via Navis) (Revision 1467874)

          Result = FAILURE
          navis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1467874
          Files :

          • /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/LazyHBaseRow.java
          • /hive/trunk/hbase-handler/src/test/org/apache/hadoop/hive/hbase/TestLazyHBaseObject.java
          Show
          Hudson added a comment - Integrated in Hive-trunk-hadoop2 #160 (See https://builds.apache.org/job/Hive-trunk-hadoop2/160/ ) HIVE-3179 HBase Handler doesn't handle NULLs properly (Lars Francke via Navis) (Revision 1467874) Result = FAILURE navis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1467874 Files : /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/LazyHBaseRow.java /hive/trunk/hbase-handler/src/test/org/apache/hadoop/hive/hbase/TestLazyHBaseObject.java
          Hide
          Hudson added a comment -

          Integrated in Hive-trunk-h0.21 #2065 (See https://builds.apache.org/job/Hive-trunk-h0.21/2065/)
          HIVE-3179 HBase Handler doesn't handle NULLs properly (Lars Francke via Navis) (Revision 1467874)

          Result = FAILURE
          navis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1467874
          Files :

          • /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/LazyHBaseRow.java
          • /hive/trunk/hbase-handler/src/test/org/apache/hadoop/hive/hbase/TestLazyHBaseObject.java
          Show
          Hudson added a comment - Integrated in Hive-trunk-h0.21 #2065 (See https://builds.apache.org/job/Hive-trunk-h0.21/2065/ ) HIVE-3179 HBase Handler doesn't handle NULLs properly (Lars Francke via Navis) (Revision 1467874) Result = FAILURE navis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1467874 Files : /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/LazyHBaseRow.java /hive/trunk/hbase-handler/src/test/org/apache/hadoop/hive/hbase/TestLazyHBaseObject.java
          Ashutosh Chauhan made changes -
          Assignee Lars Francke [ lars_francke ]
          Ashutosh Chauhan made changes -
          Fix Version/s 0.11.0 [ 12323587 ]
          Fix Version/s 0.12.0 [ 12324312 ]
          Gavin made changes -
          Link This issue is related to HIVE-4057 [ HIVE-4057 ]
          Gavin made changes -
          Link This issue is related to HIVE-4057 [ HIVE-4057 ]
          Owen O'Malley made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          2d 2h 28m 1 Lars Francke 24/Jun/12 15:54
          Patch Available Patch Available Resolved Resolved
          294d 16h 16m 1 Navis 15/Apr/13 08:10
          Resolved Resolved Closed Closed
          31d 14h 1 Owen O'Malley 16/May/13 22:10

            People

            • Assignee:
              Lars Francke
              Reporter:
              Lars Francke
            • Votes:
              2 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development