Hive
  1. Hive
  2. HIVE-3179

HBase Handler doesn't handle NULLs properly

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.9.0, 0.10.0
    • Fix Version/s: 0.11.0
    • Component/s: HBase Handler
    • Labels:
      None

      Description

      We found a quite severe issue in the HBase Handler which actually means that Hive potentially returns incorrect data if a column has NULL values in HBase (which means the cell doesn't even exist)

      In HBase Shell:

      create 'hive_hbase_test', 'test'
      put 'hive_hbase_test', '1', 'test:c1', 'c1-1'
      put 'hive_hbase_test', '1', 'test:c2', 'c2-1'
      put 'hive_hbase_test', '1', 'test:c3', 'c3-1'
      put 'hive_hbase_test', '2', 'test:c1', 'c1-2'
      

      In Hive:

      DROP TABLE IF EXISTS hive_hbase_test;
      CREATE EXTERNAL TABLE hive_hbase_test (
        id int,
        c1 string,
        c2 string,
        c3 string
      )
      STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
      WITH SERDEPROPERTIES ("hbase.columns.mapping" =
      ":key#s,test:c1#s,test:c2#s,test:c3#s")
      TBLPROPERTIES("hbase.table.name" = "hive_hbase_test");
      
      hive> select * from hive_hbase_test;
      OK
      1	c1-1	c2-1	c3-1
      2	c1-2	NULL	NULL
      
      hive> select c1 from hive_hbase_test;
      c1-1
      c1-2
      
      hive> select c1, c2 from hive_hbase_test;
      c1-1	c2-1
      c1-2	NULL
      

      So far everything is correct but now:

      hive> select c1, c2, c2 from hive_hbase_test;
      c1-1	c2-1	c2-1
      c1-2	NULL	c2-1
      

      Selecting c2 twice works the first time but the second time we
      actually get the value from the previous row.

      hive> select c1, c3, c2, c2, c3, c3, c1 from hive_hbase_test;
      c1-1	c3-1	c2-1	c2-1	c3-1	c3-1	c1-1
      c1-2	NULL	NULL	c2-1	c3-1	c3-1	c1-2
      

      We've narrowed this down to an early initialization of fieldsInited[fieldID] = true in LazyHBaseRow#uncheckedGetField and we'll try to provide a patch which surely needs review.

        Issue Links

          Activity

          Owen O'Malley made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Gavin made changes -
          Link This issue is related to HIVE-4057 [ HIVE-4057 ]
          Gavin made changes -
          Link This issue is related to HIVE-4057 [ HIVE-4057 ]
          Ashutosh Chauhan made changes -
          Fix Version/s 0.11.0 [ 12323587 ]
          Fix Version/s 0.12.0 [ 12324312 ]
          Ashutosh Chauhan made changes -
          Assignee Lars Francke [ lars_francke ]
          Navis made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Fix Version/s 0.12.0 [ 12324312 ]
          Resolution Fixed [ 1 ]
          binlijin made changes -
          Link This issue is related too HIVE-4057 [ HIVE-4057 ]
          Brock Noland made changes -
          Affects Version/s 0.10.0 [ 12320745 ]
          Brock Noland made changes -
          Remote Link This issue links to "Review Board (Web Link)" [ 12010 ]
          Lars Francke made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Lars Francke made changes -
          Attachment HIVE-3179.1.patch [ 12533045 ]
          Lars Francke made changes -
          Field Original Value New Value
          Description We found a quite severe issue in the HBase Handler which actually means that Hive potentially returns incorrect data if a column has NULL values in HBase (which means the cell doesn't even exist)

          In HBase Shell:

          {noformat}
          create 'hive_hbase_test', 'test'
          put 'hive_hbase_test', '1', 'test:c1', 'c1-1'
          put 'hive_hbase_test', '1', 'test:c2', 'c2-1'
          put 'hive_hbase_test', '1', 'test:c3', 'c3-1'
          put 'hive_hbase_test', '2', 'test:c1', 'c1-2'
          {noformat}

          In Hive:

          {noformat}
          DROP TABLE IF EXISTS hive_hbase_test;
          CREATE EXTERNAL TABLE hive_hbase_test (
            id int,
            c1 string,
            c2 string,
            c3 string
          )
          STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
          WITH SERDEPROPERTIES ("hbase.columns.mapping" =
          ":key#s,test:c1#s,test:c2#s,test:c3#s")
          TBLPROPERTIES("hbase.table.name" = "hive_hbase_test");

          hive> select * from hive_hbase_test;
          OK
          1 c1-1 c2-1 c3-1
          2 c1-2 NULL NULL

          hive> select c1 from hive_hbase_test;
          c1-1
          c1-2

          hive> select c1, c2 from hive_hbase_test;
          c1-1 c2-1
          c1-2 NULL
          {noformat}

          So far everything is correct but now:

          {noformat}
          hive> select c1, c2, c2 from hive_hbase_test;
          c1-1 c2-1 c2-1
          c1-2 NULL c2-1
          {noformat}

          Selecting c2 twice works the first time but the second time we
          actually get the value from the previous row.

          {noformat}
          hive> select c1, c3, c2, c2, c3, c3, c1 from hive_hbase_test;
          c1-1 c3-1 c2-1 c2-1 c3-1 c3-1 c1-1
          c1-2 NULL NULL c2-1 c3-1 c3-1 c1-2
          {noformat}

          We've narrowed this down to an early initialization of {{fieldsInited[fieldID] = true;}} in {{LazyHBaseRow#uncheckedGetField}} and we'll try to provide a patch which surely needs review.
          We found a quite severe issue in the HBase Handler which actually means that Hive potentially returns incorrect data if a column has NULL values in HBase (which means the cell doesn't even exist)

          In HBase Shell:

          {noformat}
          create 'hive_hbase_test', 'test'
          put 'hive_hbase_test', '1', 'test:c1', 'c1-1'
          put 'hive_hbase_test', '1', 'test:c2', 'c2-1'
          put 'hive_hbase_test', '1', 'test:c3', 'c3-1'
          put 'hive_hbase_test', '2', 'test:c1', 'c1-2'
          {noformat}

          In Hive:

          {noformat}
          DROP TABLE IF EXISTS hive_hbase_test;
          CREATE EXTERNAL TABLE hive_hbase_test (
            id int,
            c1 string,
            c2 string,
            c3 string
          )
          STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
          WITH SERDEPROPERTIES ("hbase.columns.mapping" =
          ":key#s,test:c1#s,test:c2#s,test:c3#s")
          TBLPROPERTIES("hbase.table.name" = "hive_hbase_test");

          hive> select * from hive_hbase_test;
          OK
          1 c1-1 c2-1 c3-1
          2 c1-2 NULL NULL

          hive> select c1 from hive_hbase_test;
          c1-1
          c1-2

          hive> select c1, c2 from hive_hbase_test;
          c1-1 c2-1
          c1-2 NULL
          {noformat}

          So far everything is correct but now:

          {noformat}
          hive> select c1, c2, c2 from hive_hbase_test;
          c1-1 c2-1 c2-1
          c1-2 NULL c2-1
          {noformat}

          Selecting c2 twice works the first time but the second time we
          actually get the value from the previous row.

          {noformat}
          hive> select c1, c3, c2, c2, c3, c3, c1 from hive_hbase_test;
          c1-1 c3-1 c2-1 c2-1 c3-1 c3-1 c1-1
          c1-2 NULL NULL c2-1 c3-1 c3-1 c1-2
          {noformat}

          We've narrowed this down to an early initialization of {{fieldsInited\[fieldID] = true}} in {{LazyHBaseRow#uncheckedGetField}} and we'll try to provide a patch which surely needs review.
          Lars Francke created issue -

            People

            • Assignee:
              Lars Francke
              Reporter:
              Lars Francke
            • Votes:
              2 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development