Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-8096

[replication] NPE while replicating a log that is acquiring a new block from HDFS

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.94.5
    • Fix Version/s: 0.98.0, 0.94.7, 0.95.1
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      We're getting an NPE during replication, which causes replication for that RegionServer to stop until we restart it.

      2013-03-10 12:49:12,679 ERROR org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unexpected exception in ReplicationSource, currentPath=hdfs://hmaster1:9000/hbase/.logs/hslave1177,60020,1362549511446/hslave1177%2C60020%2C1362549511446.1362944946489
      java.lang.NullPointerException
              at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.updateBlockInfo(DFSClient.java:1882)
              at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1855)
              at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1831)
              at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:578)
              at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154)
              at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:108)
              at org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1495)
              at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.openFile(SequenceFileLogReader.java:62)
              at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1482)
              at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)
              at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470)
              at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
              at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:308)
              at org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:69)
              at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:505)
              at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:313)
      

      Some extra digging into the DataNode and NameNode logs makes this seem related to HBASE-7530 and HDFS-4380

      Here's the relevant snipped portions of the RS, DN, and NN logs:

      RS 2013-03-10 12:49:12,618 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Going to report log #hslave1177%2C60020%2C1362549511446.1362944946489 for position 59670826 in hdfs://hmaster1:9000/hbase/.logs/hslave1177,60020,1362549511446/hslave1177%2C60020%2C1362549511446.1362944946489
      RS 2013-03-10 12:49:12,621 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: HLog roll requested
      RS 2013-03-10 12:49:12,623 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Replicated in total: 31500300
      RS 2013-03-10 12:49:12,623 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication hslave1177%2C60020%2C1362549511446.1362944946489 at 59670826
      NN 2013-03-10 12:49:12,627 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /hbase/.logs/hslave1177,60020,1362549511446/hslave1177%2C60020%2C1362549511446.1362944946489. blk_6905758215335505153_656717631
      RS 2013-03-10 12:49:12,679 ERROR org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unexpected exception in ReplicationSource, currentPath=hdfs://hmaster1:9000/hbase/.logs/hslave1177,60020,1362549511446/hslave1177%2C60020%2C1362549511446.1362944946489
      DN 2013-03-10 12:49:12,680 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_6905758215335505153_656717631 src: /192.168.44.1:43503 dest: /192.168.44.1:50010
      NN 2013-03-10 12:49:12,804 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.fsync: file /hbase/.logs/hslave1177,60020,1362549511446/hslave1177%2C60020%2C1362549511446.1362944946489 for DFSClient_hb_rs_hslave1177,60020,1362549511446
      
      1. HBASE-8096.0.94.patch
        3 kB
        Dave Latham
      2. HBASE-8096.patch
        3 kB
        Dave Latham

        Issue Links

          Activity

          Hide
          hudson Hudson added a comment -

          Integrated in HBase-0.94-security #137 (See https://builds.apache.org/job/HBase-0.94-security/137/)
          HBASE-8096 [replication] NPE while replicating a log that is acquiring a new block from HDFS (Revision 1467660)

          Result = SUCCESS
          stack :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationHLogReaderManager.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSmallTests.java
          Show
          hudson Hudson added a comment - Integrated in HBase-0.94-security #137 (See https://builds.apache.org/job/HBase-0.94-security/137/ ) HBASE-8096 [replication] NPE while replicating a log that is acquiring a new block from HDFS (Revision 1467660) Result = SUCCESS stack : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationHLogReaderManager.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSmallTests.java
          Hide
          hudson Hudson added a comment -

          Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #498 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/498/)
          HBASE-8096 [replication] NPE while replicating a log that is acquiring a new block from HDFS (Revision 1467662)

          Result = FAILURE
          stack :
          Files :

          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationHLogReaderManager.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSmallTests.java
          Show
          hudson Hudson added a comment - Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #498 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/498/ ) HBASE-8096 [replication] NPE while replicating a log that is acquiring a new block from HDFS (Revision 1467662) Result = FAILURE stack : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationHLogReaderManager.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSmallTests.java
          Hide
          hudson Hudson added a comment -

          Integrated in hbase-0.95-on-hadoop2 #68 (See https://builds.apache.org/job/hbase-0.95-on-hadoop2/68/)
          HBASE-8096 [replication] NPE while replicating a log that is acquiring a new block from HDFS (Revision 1467661)

          Result = FAILURE
          stack :
          Files :

          • /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationHLogReaderManager.java
          • /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          • /hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSmallTests.java
          Show
          hudson Hudson added a comment - Integrated in hbase-0.95-on-hadoop2 #68 (See https://builds.apache.org/job/hbase-0.95-on-hadoop2/68/ ) HBASE-8096 [replication] NPE while replicating a log that is acquiring a new block from HDFS (Revision 1467661) Result = FAILURE stack : Files : /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationHLogReaderManager.java /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java /hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSmallTests.java
          Hide
          hudson Hudson added a comment -

          Integrated in hbase-0.95 #146 (See https://builds.apache.org/job/hbase-0.95/146/)
          HBASE-8096 [replication] NPE while replicating a log that is acquiring a new block from HDFS (Revision 1467661)

          Result = SUCCESS
          stack :
          Files :

          • /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationHLogReaderManager.java
          • /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          • /hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSmallTests.java
          Show
          hudson Hudson added a comment - Integrated in hbase-0.95 #146 (See https://builds.apache.org/job/hbase-0.95/146/ ) HBASE-8096 [replication] NPE while replicating a log that is acquiring a new block from HDFS (Revision 1467661) Result = SUCCESS stack : Files : /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationHLogReaderManager.java /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java /hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSmallTests.java
          Hide
          hudson Hudson added a comment -

          Integrated in HBase-TRUNK #4063 (See https://builds.apache.org/job/HBase-TRUNK/4063/)
          HBASE-8096 [replication] NPE while replicating a log that is acquiring a new block from HDFS (Revision 1467662)

          Result = SUCCESS
          stack :
          Files :

          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationHLogReaderManager.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSmallTests.java
          Show
          hudson Hudson added a comment - Integrated in HBase-TRUNK #4063 (See https://builds.apache.org/job/HBase-TRUNK/4063/ ) HBASE-8096 [replication] NPE while replicating a log that is acquiring a new block from HDFS (Revision 1467662) Result = SUCCESS stack : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationHLogReaderManager.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSmallTests.java
          Hide
          hudson Hudson added a comment -

          Integrated in HBase-0.94 #959 (See https://builds.apache.org/job/HBase-0.94/959/)
          HBASE-8096 [replication] NPE while replicating a log that is acquiring a new block from HDFS (Revision 1467660)

          Result = SUCCESS
          stack :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationHLogReaderManager.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSmallTests.java
          Show
          hudson Hudson added a comment - Integrated in HBase-0.94 #959 (See https://builds.apache.org/job/HBase-0.94/959/ ) HBASE-8096 [replication] NPE while replicating a log that is acquiring a new block from HDFS (Revision 1467660) Result = SUCCESS stack : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationHLogReaderManager.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSmallTests.java
          Hide
          stack stack added a comment -

          Committed to 0.94, 0.95 and trunk. Thanks for patch Dave.

          Show
          stack stack added a comment - Committed to 0.94, 0.95 and trunk. Thanks for patch Dave.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          +1

          Show
          lhofhansl Lars Hofhansl added a comment - +1
          Hide
          stack stack added a comment -

          Lars Hofhansl Any chance of committing this for 0.94.7? (See above)

          Show
          stack stack added a comment - Lars Hofhansl Any chance of committing this for 0.94.7? (See above)
          Hide
          davelatham Dave Latham added a comment -

          Thanks guys for looking it over. Would love to see it committed for 0.94.7.

          Show
          davelatham Dave Latham added a comment - Thanks guys for looking it over. Would love to see it committed for 0.94.7.
          Hide
          ctrezzo Chris Trezzo added a comment -

          Looks good to me. ReplicationHLogReaderManager.openReader is the only place where replication tries to reset the HLogReader. I think it makes sense to sleep/retry when the reset is not successful, and I like relying on ReplicationSource.run for the retry.

          +1 on the patch for 0.94 and trunk/0.95.

          Show
          ctrezzo Chris Trezzo added a comment - Looks good to me. ReplicationHLogReaderManager.openReader is the only place where replication tries to reset the HLogReader. I think it makes sense to sleep/retry when the reset is not successful, and I like relying on ReplicationSource.run for the retry. +1 on the patch for 0.94 and trunk/0.95.
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12577922/HBASE-8096.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 lineLengths. The patch does not introduce lines longer than 100

          +1 site. The mvn site goal succeeds with this patch.

          +1 core tests. The patch passed unit tests in .

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5236//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5236//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5236//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5236//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5236//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5236//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5236//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5236//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5236//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5236//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12577922/HBASE-8096.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified tests. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 site . The mvn site goal succeeds with this patch. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5236//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5236//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5236//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5236//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5236//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5236//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5236//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5236//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5236//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5236//console This message is automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12577916/HBASE-8096.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 lineLengths. The patch does not introduce lines longer than 100

          +1 site. The mvn site goal succeeds with this patch.

          +1 core tests. The patch passed unit tests in .

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5232//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5232//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5232//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5232//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5232//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5232//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5232//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5232//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5232//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5232//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12577916/HBASE-8096.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified tests. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 site . The mvn site goal succeeds with this patch. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5232//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5232//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5232//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5232//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5232//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5232//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5232//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5232//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5232//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5232//console This message is automatically generated.
          Hide
          davelatham Dave Latham added a comment -

          First trunk patch omitted a LOG.warn that should be there. This one includes it.

          Show
          davelatham Dave Latham added a comment - First trunk patch omitted a LOG.warn that should be there. This one includes it.
          Hide
          stack stack added a comment -

          +1 on patch. Waiting on hadoopqa run. Chris Trezzo or Jean-Daniel Cryans, you good w/ this?

          Show
          stack stack added a comment - +1 on patch. Waiting on hadoopqa run. Chris Trezzo or Jean-Daniel Cryans , you good w/ this?
          Hide
          davelatham Dave Latham added a comment -

          Here's a patch for trunk / 0.95

          Show
          davelatham Dave Latham added a comment - Here's a patch for trunk / 0.95
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12577063/HBASE-8096.0.94.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5140//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12577063/HBASE-8096.0.94.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified tests. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5140//console This message is automatically generated.
          Hide
          davelatham Dave Latham added a comment -

          I was able to regularly reproduce this problem with a few steps:

          • Revert HBASE-7530 (i.e. put backconf1.setInt("hbase.regionserver.hlog.blocksize", 1024*20) )
          • Introduce a Thread.sleep(500) at the beginning of DataXceiver.writeBlock
          • Running TestReplicationSmallTests.loadTesting (more likely to occur if you change NB_ROWS_IN_BIG_BATCH to something larger)

          Watching the test output brought some more insight. In production we only saw this happening on region servers when they were reusing an existing reader (on a new data block). In the test I could see the NPE logged in the case of a new reader being used, but it was caught and wrapped in an IOException by HLog.getReader so in that case it was caught and retried by the existing logic in ReplicationSource.openReader and ReplicationSource.run.

          I'm attaching a patch for 0.94 which solves the problem for the test case I described. It updates ReplicationHLogReaderManager.openReader to catch the NPE wrap it in an IOException so that both cases react the same way. Then when an IOException is caught in ReplicationSource.openReader it checks to see if the cause is a NPE and allows ReplicationSource.run to retry the file.

          It would be great if one or two people could give it a look, especially Jean-Daniel Cryans if you have a moment.

          I'll look at a patch for 0.95 and trunk.

          Show
          davelatham Dave Latham added a comment - I was able to regularly reproduce this problem with a few steps: Revert HBASE-7530 (i.e. put backconf1.setInt("hbase.regionserver.hlog.blocksize", 1024*20) ) Introduce a Thread.sleep(500) at the beginning of DataXceiver.writeBlock Running TestReplicationSmallTests.loadTesting (more likely to occur if you change NB_ROWS_IN_BIG_BATCH to something larger) Watching the test output brought some more insight. In production we only saw this happening on region servers when they were reusing an existing reader (on a new data block). In the test I could see the NPE logged in the case of a new reader being used, but it was caught and wrapped in an IOException by HLog.getReader so in that case it was caught and retried by the existing logic in ReplicationSource.openReader and ReplicationSource.run. I'm attaching a patch for 0.94 which solves the problem for the test case I described. It updates ReplicationHLogReaderManager.openReader to catch the NPE wrap it in an IOException so that both cases react the same way. Then when an IOException is caught in ReplicationSource.openReader it checks to see if the cause is a NPE and allows ReplicationSource.run to retry the file. It would be great if one or two people could give it a look, especially Jean-Daniel Cryans if you have a moment. I'll look at a patch for 0.95 and trunk.
          Hide
          ianfriedman Ian Friedman added a comment -

          The other alternative was to patch ReplicationSource.openReader to catch an NPE thrown from repLogReader.openReader and then retry by re-calling openReader, but I was concerned about catching something serious like a NPE that far up the chain, and Dave Latham was concerned about potential infinite recursion

          Show
          ianfriedman Ian Friedman added a comment - The other alternative was to patch ReplicationSource.openReader to catch an NPE thrown from repLogReader.openReader and then retry by re-calling openReader, but I was concerned about catching something serious like a NPE that far up the chain, and Dave Latham was concerned about potential infinite recursion
          Hide
          ianfriedman Ian Friedman added a comment -

          One workaround we've come up with is to set the hbase.regionserver.hlog.blocksize high, like 128M, and then the hbase.regionserver.logroll.multiplier to 50%

          Show
          ianfriedman Ian Friedman added a comment - One workaround we've come up with is to set the hbase.regionserver.hlog.blocksize high, like 128M, and then the hbase.regionserver.logroll.multiplier to 50%

            People

            • Assignee:
              davelatham Dave Latham
              Reporter:
              ianfriedman Ian Friedman
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development