HBase
  1. HBase
  2. HBASE-6660

Meta assignment to a region server caused continuous NPE loop during postOpenDeployTasks

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.92.2
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      CentOS release 5.7

      Description

      Recently I upgraded three data centers to our own checkout of 0.92.2, last commit :

      commit 5accb6a1be4776630126ac21d07adb652b74df95
      Author: Zhihong Yu <tedyu@apache.org>
      Date:   Mon Aug 20 18:19:45 2012 +0000
      HBASE-6608 Fix for HBASE-6160, META entries from daughters can be deleted before parent entries, shouldn't compare HRegionInfo's (Enis)
      

      Two upgrades went fine, upgrade to one data center failed. Failed in the sense that ROOT and META assignment took forever. Panic struck I restarted master and all region servers. I may have deleted zookeeper node /hbase/root-region-server as well, dont ask me why

      After this I managed to get ROOT assigned. But META assignment got stuck again.

      The log is here : https://raw.github.com/gist/3455435/adebd118b47aa3d715201010aa09e5eb8930033c/npe_rs_0.92.2.log

      Notice how region server was stuck in a loop of NPE (grep processBatchCallback). There is one more NPE related to zookeeper constructor.

        Activity

        Hide
        Enis Soztutar added a comment -

        From the log, and the code:

        2012-08-22 20:26:28,045 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Exception running postOpenDeployTasks; region=1028785192
        java.lang.NullPointerException
        	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:1527)
        	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1411)
        	at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:915)
        	at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:792)
        	at org.apache.hadoop.hbase.client.HTable.put(HTable.java:767)
        	at org.apache.hadoop.hbase.catalog.MetaEditor.put(MetaEditor.java:99)
        	at org.apache.hadoop.hbase.catalog.MetaEditor.putToCatalogTable(MetaEditor.java:89)
        	at org.apache.hadoop.hbase.catalog.MetaEditor.updateLocation(MetaEditor.java:260)
        	at org.apache.hadoop.hbase.catalog.MetaEditor.updateMetaLocation(MetaEditor.java:222)
        	at org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1494)
        	at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler$PostOpenDeployTasksThread.run(OpenRegionHandler.java:242)
        

        It seems that HConnectionManager.locateRegion() returned null, since RootRegionTracker.waitRootRegionLocation() returned null:

            private HRegionLocation locateRegion(final byte [] tableName,
              final byte [] row, boolean useCache)
            throws IOException {
              if (this.closed) throw new IOException(toString() + " closed");
              if (tableName == null || tableName.length == 0) {
                throw new IllegalArgumentException(
                    "table name cannot be null or zero length");
              }
        
              if (Bytes.equals(tableName, HConstants.ROOT_TABLE_NAME)) {
                try {
                  ServerName servername =
                    this.rootRegionTracker.waitRootRegionLocation(this.rpcTimeout);
                  LOG.debug("Lookedup root region location, connection=" + this +
                    "; serverName=" + ((servername == null)? "": servername.toString()));
                  if (servername == null) return null;
        ...
        

        From my understanding, the root location in ZK seems corrupted. Do you have a chance to post the master and the region server(who deployed ROOT) logs?

        Show
        Enis Soztutar added a comment - From the log, and the code: 2012-08-22 20:26:28,045 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Exception running postOpenDeployTasks; region=1028785192 java.lang.NullPointerException at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:1527) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1411) at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:915) at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:792) at org.apache.hadoop.hbase.client.HTable.put(HTable.java:767) at org.apache.hadoop.hbase.catalog.MetaEditor.put(MetaEditor.java:99) at org.apache.hadoop.hbase.catalog.MetaEditor.putToCatalogTable(MetaEditor.java:89) at org.apache.hadoop.hbase.catalog.MetaEditor.updateLocation(MetaEditor.java:260) at org.apache.hadoop.hbase.catalog.MetaEditor.updateMetaLocation(MetaEditor.java:222) at org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1494) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler$PostOpenDeployTasksThread.run(OpenRegionHandler.java:242) It seems that HConnectionManager.locateRegion() returned null, since RootRegionTracker.waitRootRegionLocation() returned null: private HRegionLocation locateRegion( final byte [] tableName, final byte [] row, boolean useCache) throws IOException { if ( this .closed) throw new IOException(toString() + " closed" ); if (tableName == null || tableName.length == 0) { throw new IllegalArgumentException( "table name cannot be null or zero length" ); } if (Bytes.equals(tableName, HConstants.ROOT_TABLE_NAME)) { try { ServerName servername = this .rootRegionTracker.waitRootRegionLocation( this .rpcTimeout); LOG.debug( "Lookedup root region location, connection=" + this + "; serverName=" + ((servername == null )? "": servername.toString())); if (servername == null ) return null ; ... From my understanding, the root location in ZK seems corrupted. Do you have a chance to post the master and the region server(who deployed ROOT) logs?
        Hide
        Shrijeet Paliwal added a comment -

        Enis,
        Did you notice I mentioned in an attempt to get things going down - I not only did complete restart I also deleted zookeeper node /hbase/root-region-server , but I did this delete when all region servers and master were down.

        Let me get the other two logs.

        Show
        Shrijeet Paliwal added a comment - Enis, Did you notice I mentioned in an attempt to get things going down - I not only did complete restart I also deleted zookeeper node /hbase/root-region-server , but I did this delete when all region servers and master were down. Let me get the other two logs.
        Show
        Shrijeet Paliwal added a comment - Master log https://raw.github.com/gist/3456346/defc781b356832e3a2bfc4264b28a3fa3ba2bd8b/master.log Root region server log (based on what I understood master assigned root to): https://raw.github.com/gist/3456292/38058d59fb147fd928b5b0a0107feab5fd8dc648/root.log
        Hide
        Shrijeet Paliwal added a comment -

        Also does any one else find these log messages odd:

        2012-08-22 20:15:10,231 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting connect to Master server at ine-60.rfiserve.net,60000,1345680901361
        2012-08-22 20:15:10,336 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at ine-51.rfiserve.net/172.22.2.51:60020
        

        ine-60 is HMaster, ine-51 is a region server which printed these log messages.

        Show
        Shrijeet Paliwal added a comment - Also does any one else find these log messages odd: 2012-08-22 20:15:10,231 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting connect to Master server at ine-60.rfiserve.net,60000,1345680901361 2012-08-22 20:15:10,336 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at ine-51.rfiserve.net/172.22.2.51:60020 ine-60 is HMaster, ine-51 is a region server which printed these log messages.
        Hide
        Shrijeet Paliwal added a comment - - edited

        Opened HBASE-6662 for the incorrect address reporting issue.

        Edit: URL for JIRA wouldn't reflect the latest status.

        Show
        Shrijeet Paliwal added a comment - - edited Opened HBASE-6662 for the incorrect address reporting issue. Edit: URL for JIRA wouldn't reflect the latest status.
        Hide
        Enis Soztutar added a comment -

        I noticed only one odd thing. Other than that everything seems normal. Master, running at ine-60 deployed ROOT to ine-48, around 2012-08-22 20:15:30. Around 2012-08-22 20:24:33, the master seems to be restarted, but not ine-48. After it comes up, it tries to assign META,

        2012-08-22 20:24:43,060 INFO org.apache.hadoop.hbase.master.HMaster: -ROOT- assigned=0, rit=false, location=ine-48.rfiserve.net,60020,1345680918003
        2012-08-22 20:24:43,325 INFO org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of .META.,,1 at address=ine-50.rfiserve.net,60020,1345680145127
        

        But the logs from the META RS, seems to indicate that it has a hostname of ine-51, but the master thinks that META is assigned at ine-50. Could you be having a DNS issue, or a hostname mixup?

        Show
        Enis Soztutar added a comment - I noticed only one odd thing. Other than that everything seems normal. Master, running at ine-60 deployed ROOT to ine-48, around 2012-08-22 20:15:30. Around 2012-08-22 20:24:33, the master seems to be restarted, but not ine-48. After it comes up, it tries to assign META, 2012-08-22 20:24:43,060 INFO org.apache.hadoop.hbase.master.HMaster: -ROOT- assigned=0, rit= false , location=ine-48.rfiserve.net,60020,1345680918003 2012-08-22 20:24:43,325 INFO org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of .META.,,1 at address=ine-50.rfiserve.net,60020,1345680145127 But the logs from the META RS, seems to indicate that it has a hostname of ine-51, but the master thinks that META is assigned at ine-50. Could you be having a DNS issue, or a hostname mixup?
        Hide
        stack added a comment -

        @Shrijeet Did you want to make this a blocker on 0.92.2? What you think of Enis's question?

        Show
        stack added a comment - @Shrijeet Did you want to make this a blocker on 0.92.2? What you think of Enis's question?
        Hide
        Shrijeet Paliwal added a comment -

        Sorry for late response.

        Response to Enis's question: DNS issue and hostname mixup is unlikely. DNS is well setup and has been working reliably since long time.

        Stack, I do not have enough evidence and confidence to label this as blocker. I did nasty things (killed processes out of order, deleted zookeeper data etc. etc.) It could be a pebkac issue.

        I am planning to upgrade one more data center today/tomorrow. I will use the same HBase checkout as this bug report. Upto you guys if you want to wait for my report.

        Thanks.

        Show
        Shrijeet Paliwal added a comment - Sorry for late response. Response to Enis's question: DNS issue and hostname mixup is unlikely. DNS is well setup and has been working reliably since long time. Stack, I do not have enough evidence and confidence to label this as blocker. I did nasty things (killed processes out of order, deleted zookeeper data etc. etc.) It could be a pebkac issue. I am planning to upgrade one more data center today/tomorrow. I will use the same HBase checkout as this bug report. Upto you guys if you want to wait for my report. Thanks.
        Hide
        stack added a comment -

        We can wait. There are a few unit tests yet to clean up.

        Show
        stack added a comment - We can wait. There are a few unit tests yet to clean up.
        Hide
        Ted Yu added a comment -

        @Shrijeet:
        We'll wait for your next upgrade.

        Thanks for trying out new 0.92 build.

        Show
        Ted Yu added a comment - @Shrijeet: We'll wait for your next upgrade. Thanks for trying out new 0.92 build.
        Hide
        Shrijeet Paliwal added a comment -

        Latest upgrade went smooth. OK with me to make this 'not a blocker'.
        Thanks guys.

        Show
        Shrijeet Paliwal added a comment - Latest upgrade went smooth. OK with me to make this 'not a blocker'. Thanks guys.
        Hide
        Ted Yu added a comment -

        Thanks for the update.

        Waiting for this issue to be reproduced.

        Show
        Ted Yu added a comment - Thanks for the update. Waiting for this issue to be reproduced.

          People

          • Assignee:
            Unassigned
            Reporter:
            Shrijeet Paliwal
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:

              Development