Hive
  1. Hive
  2. HIVE-6117

mapreduce.RecordReader instance needs to be initialized

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.13.0
    • Component/s: HBase Handler
    • Labels:
      None

      Description

      The HBase storage handler makes use of a mapreduce.RecordReader instance but does not initialize it when consumed from local context. This results in a NPE for some queries, for instance

      create table hbase_1(key string, age int) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties ( "hbase.columns.mapping" = "info:age");
      insert overwrite table hbase_1 select name, SUM(age) from studenttab10k group by name;
      select * from hbase_1;
      

      The select statement throws the following exception

      13/12/18 01:30:32 ERROR CliDriver: Failed with exception java.io.IOException:java.lang.NullPointerException
      java.io.IOException: java.lang.NullPointerException
      at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:551)
      at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:489)
      at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:136)
      at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1494)
      at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:271)
      at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
      at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
      at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:348)
      at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:446)
      at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:456)
      at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:737)
      at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
      at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:601)
      at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
      Caused by: java.lang.NullPointerException
      at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:196)
      at org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:138)
      at org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat$1.next(HiveHBaseTableInputFormat.java:234)
      at org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat$1.next(HiveHBaseTableInputFormat.java:193)
      at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:521)
      
      1. HIVE-6117.0.patch
        0.8 kB
        Brock Noland
      2. 6117.00.patch
        0.8 kB
        Nick Dimiduk

        Issue Links

          Activity

          Nick Dimiduk created issue -
          Hide
          Nick Dimiduk added a comment -

          Attaching patch for trunk.

          Show
          Nick Dimiduk added a comment - Attaching patch for trunk.
          Nick Dimiduk made changes -
          Field Original Value New Value
          Attachment 6117.00.patch [ 12620641 ]
          Nick Dimiduk made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Brock Noland added a comment -

          Thanks Nick. Looks like this was caused by HBASE-9791. Was hive not correctly using the API or this an API change?

          Show
          Brock Noland added a comment - Thanks Nick. Looks like this was caused by HBASE-9791 . Was hive not correctly using the API or this an API change?
          Hide
          Nick Dimiduk added a comment -

          Oh, good catch Brock Noland. I think the responsibility lies with Hive because, in this case, the RecordReader is being consumed outside of the MapReduce Framework context. I don't know where Hive should initialize the RecordReader – my patch may result in a double-initialize for the mapreduce case, but corrects the problem for the local query. I think, somehow, processLocalCmd needs to indicate to HiveHBaseTableInputFormat that its RecordReader is being consumed outside of MapReduce. Perhaps you can advise?

          Show
          Nick Dimiduk added a comment - Oh, good catch Brock Noland . I think the responsibility lies with Hive because, in this case, the RecordReader is being consumed outside of the MapReduce Framework context. I don't know where Hive should initialize the RecordReader – my patch may result in a double-initialize for the mapreduce case, but corrects the problem for the local query. I think, somehow, processLocalCmd needs to indicate to HiveHBaseTableInputFormat that its RecordReader is being consumed outside of MapReduce. Perhaps you can advise?
          Hide
          Brock Noland added a comment -

          Hmm, does this same error not occur during a MR run?

          Show
          Brock Noland added a comment - Hmm, does this same error not occur during a MR run?
          Hide
          Nick Dimiduk added a comment -

          Not sure; can I force the SELECT statement to run as a job?

          Show
          Nick Dimiduk added a comment - Not sure; can I force the SELECT statement to run as a job?
          Hide
          Brock Noland added a comment -

          SELECT count(*) from table

          Show
          Brock Noland added a comment - SELECT count(*) from table
          Hide
          Nick Dimiduk added a comment -

          SELECT count FROM hbase_1; gave me a MR job. It ran without complaint.

          Show
          Nick Dimiduk added a comment - SELECT count FROM hbase_1; gave me a MR job. It ran without complaint.
          Hide
          Brock Noland added a comment -

          It ran correctly with or without the patch?

          Show
          Brock Noland added a comment - It ran correctly with or without the patch?
          Hide
          Nick Dimiduk added a comment -

          Ran correctly with the patch.

          BuildBot should verify as well, yes?

          Show
          Nick Dimiduk added a comment - Ran correctly with the patch. BuildBot should verify as well, yes?
          Hide
          Brock Noland added a comment -

          Thanks Nick! I have seen this error with MR + HBase trunk.

          As fair as I can tell we have not yet answered the question "Was hive not correctly using the API or this an API change?" I am fine making the change to Hive, but we should know if hive was simply using HBase incorrectly or if HBase is changing the API.

          Brock

          Show
          Brock Noland added a comment - Thanks Nick! I have seen this error with MR + HBase trunk. As fair as I can tell we have not yet answered the question "Was hive not correctly using the API or this an API change?" I am fine making the change to Hive, but we should know if hive was simply using HBase incorrectly or if HBase is changing the API. Brock
          Nick Dimiduk made changes -
          Link This issue relates to HBASE-9791 [ HBASE-9791 ]
          Hide
          Nick Dimiduk added a comment -

          Technically speaking, HBase changed the way it initializes this particular MapReduce component via a public API, thus HBase's API has changed. Hive is consuming that component without respecting its API – that is, Hadoop's mapreduce.RecordReader says the framework will call initialize, so Hive should be calling initialize when consuming it outside of the MapReduce Framework context. I think the answer is "both". Perhaps HBASE-9791 needs to add a release note.

          cc stack Jimmy Xiang.

          Show
          Nick Dimiduk added a comment - Technically speaking, HBase changed the way it initializes this particular MapReduce component via a public API, thus HBase's API has changed. Hive is consuming that component without respecting its API – that is, Hadoop's mapreduce.RecordReader says the framework will call initialize, so Hive should be calling initialize when consuming it outside of the MapReduce Framework context. I think the answer is "both". Perhaps HBASE-9791 needs to add a release note. cc stack Jimmy Xiang .
          Hide
          Brock Noland added a comment -

          Reuploading the exact same patch with a new file name so HiveQA will pick it up.

          Show
          Brock Noland added a comment - Reuploading the exact same patch with a new file name so HiveQA will pick it up.
          Brock Noland made changes -
          Attachment HIVE-6117.0.patch [ 12620732 ]
          Hide
          Brock Noland added a comment -

          Thank you for clarifying. I am +1 if the tests pass.

          Hadoop's mapreduce.RecordReader says the framework will call initialize,

          What's interesting is I have seen this same error occur within an MR job. Additionally we are doing some weird wrapping of RR here. Therefore I think it's likely that Hive should be calling initialize itself even in the MR case. I think this patch will in fact do this.

          Show
          Brock Noland added a comment - Thank you for clarifying. I am +1 if the tests pass. Hadoop's mapreduce.RecordReader says the framework will call initialize, What's interesting is I have seen this same error occur within an MR job. Additionally we are doing some weird wrapping of RR here. Therefore I think it's likely that Hive should be calling initialize itself even in the MR case. I think this patch will in fact do this.
          Hide
          Hive QA added a comment -

          Overall: +1 all checks pass

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12620732/HIVE-6117.0.patch

          SUCCESS: +1 4818 tests passed

          Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/762/testReport
          Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/762/console

          Messages:

          Executing org.apache.hive.ptest.execution.PrepPhase
          Executing org.apache.hive.ptest.execution.ExecutionPhase
          Executing org.apache.hive.ptest.execution.ReportingPhase
          

          This message is automatically generated.

          ATTACHMENT ID: 12620732

          Show
          Hive QA added a comment - Overall : +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12620732/HIVE-6117.0.patch SUCCESS: +1 4818 tests passed Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/762/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/762/console Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase This message is automatically generated. ATTACHMENT ID: 12620732
          Hide
          Jimmy Xiang added a comment -

          Nick Dimiduk, I think HBASE-9791 fixed a bug in HBase that does initialization twice. If some subclass needs to call initialization again, does this mean the fix in HBASE-9791 is not right? If it is right, why should HIVE need to call init again?

          Show
          Jimmy Xiang added a comment - Nick Dimiduk , I think HBASE-9791 fixed a bug in HBase that does initialization twice. If some subclass needs to call initialization again, does this mean the fix in HBASE-9791 is not right? If it is right, why should HIVE need to call init again?
          Hide
          Brock Noland added a comment -

          Hey guys,

          I think I see the issue here. o.a.h.mapreduce.RecordReader has an initialize method which is called by the mapper. o.a.h.mapred.RecordReader does not have an initialize method. Hive uses the mapred API so you can see here: https://github.com/apache/hive/blob/fb63a28cd5fddb5e5c974cab84cd9c3a4155e40d/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java#L178

          that we wrap the mapreduce RR from HBase in a mapred RR. Therefore it's hive's responsibility to call initialize which we are not doing. Therefore the patch looks correct and I will commit.

          Show
          Brock Noland added a comment - Hey guys, I think I see the issue here. o.a.h.mapreduce.RecordReader has an initialize method which is called by the mapper. o.a.h.mapred.RecordReader does not have an initialize method. Hive uses the mapred API so you can see here: https://github.com/apache/hive/blob/fb63a28cd5fddb5e5c974cab84cd9c3a4155e40d/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java#L178 that we wrap the mapreduce RR from HBase in a mapred RR. Therefore it's hive's responsibility to call initialize which we are not doing. Therefore the patch looks correct and I will commit.
          Hide
          Brock Noland added a comment -

          I also verified that this patch fixes the NPE I was seeing with both select * from table (no-MR job) and select count(*) from table (MR job).

          Show
          Brock Noland added a comment - I also verified that this patch fixes the NPE I was seeing with both select * from table (no-MR job) and select count(*) from table (MR job).
          Hide
          Brock Noland added a comment -

          Committed to trunk! Thank you Nick and Jimmy for your help on this one!

          Show
          Brock Noland added a comment - Committed to trunk! Thank you Nick and Jimmy for your help on this one!
          Brock Noland made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Fix Version/s 0.13.0 [ 12324986 ]
          Resolution Fixed [ 1 ]
          Hide
          Nick Dimiduk added a comment -

          that we wrap the mapreduce RR from HBase in a mapred RR. Therefore it's hive's responsibility to call initialize which we are not doing. Therefore the patch looks correct and I will commit.

          Yes, that's my understanding too. Thanks Brock Noland, Jimmy Xiang for the context and quick response!

          Show
          Nick Dimiduk added a comment - that we wrap the mapreduce RR from HBase in a mapred RR. Therefore it's hive's responsibility to call initialize which we are not doing. Therefore the patch looks correct and I will commit. Yes, that's my understanding too. Thanks Brock Noland , Jimmy Xiang for the context and quick response!
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          8m 25s 1 Nick Dimiduk 27/Dec/13 19:54
          Patch Available Patch Available Resolved Resolved
          2d 18h 59m 1 Brock Noland 30/Dec/13 14:54

            People

            • Assignee:
              Nick Dimiduk
              Reporter:
              Nick Dimiduk
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development