HBase
  1. HBase
  2. HBASE-5828

TestLogRolling fails in 0.90 branch killing the test suite up on jenkins

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        993d 10h 1m 1 Cosmin Lehene 07/Jan/15 14:53
        Cosmin Lehene made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Cannot Reproduce [ 5 ]
        xufeng made changes -
        Attachment HBASE-5828_90_v1.patch [ 12526473 ]
        Attachment HBASE-5828_90_v1_100times_test_result.txt [ 12526474 ]
        Hide
        xufeng added a comment -

        I create a patch and run this patch 100 times by shell.

        for((i=1;i<=100;i++));
        do
           mvn clean -Dtest=TestLogRolling test >> HBASE-5828_90_v1_100times_test_result.txt
           mv target target_${i}
        done
        

        2 times(64th 75th)failed,I do not why happened now.

        -------------------------------------------------------------------------------
        Test set: org.apache.hadoop.hbase.regionserver.wal.TestLogRolling
        -------------------------------------------------------------------------------
        Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 1,219.296 sec <<< FAILURE!
        testLogRollOnPipelineRestart(org.apache.hadoop.hbase.regionserver.wal.TestLogRolling)  Time elapsed: 952.159 sec  <<< ERROR!
        org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException: #1336683442040
        	at org.apache.hadoop.hbase.regionserver.wal.HLog.cleanupCurrentWriter(HLog.java:802)
        	at org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:560)
        	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnPipelineRestart(TestLogRolling.java:476)
        	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        	at java.lang.reflect.Method.invoke(Method.java:597)
        	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
        	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
        	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
        	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
        	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
        	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
        	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
        	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
        	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
        	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
        	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
        	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
        	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
        	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
        	at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
        	at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:62)
        	at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:140)
        	at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:165)
        	at org.apache.maven.surefire.Surefire.run(Surefire.java:107)
        	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        	at java.lang.reflect.Method.invoke(Method.java:597)
        	at org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:289)
        	at org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:1005)
        Caused by: java.io.IOException: Error Recovery for block blk_1933341499089292179_1016 failed  because recovery from primary datanode 127.0.0.1:50920 failed 6 times.  Pipeline was 127.0.0.1:50920. Aborting...
        	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2741)
        	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1500(DFSClient.java:2172)
        	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2371)
        
        -------------------------------------------------------------------------------
        Test set: org.apache.hadoop.hbase.regionserver.wal.TestLogRolling
        -------------------------------------------------------------------------------
        Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 1,065.652 sec <<< FAILURE!
        testLogRollOnPipelineRestart(org.apache.hadoop.hbase.regionserver.wal.TestLogRolling)  Time elapsed: 830.246 sec  <<< ERROR!
        org.apache.hadoop.hbase.DroppedSnapshotException: region: TestLogRolling,,1336675047757.bfce863e5843952e14649dc6fc8a53dd.
        	at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1003)
        	at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:905)
        	at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:857)
        	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnPipelineRestart(TestLogRolling.java:509)
        	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        	at java.lang.reflect.Method.invoke(Method.java:597)
        	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
        	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
        	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
        	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
        	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
        	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
        	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
        	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
        	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
        	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
        	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
        	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
        	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
        	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
        	at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
        	at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:62)
        	at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:140)
        	at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:165)
        	at org.apache.maven.surefire.Surefire.run(Surefire.java:107)
        	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        	at java.lang.reflect.Method.invoke(Method.java:597)
        	at org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:289)
        	at org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:1005)
        Caused by: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/root/TestLogRolling/bfce863e5843952e14649dc6fc8a53dd/.tmp/8292661038971118318 could only be replicated to 0 nodes, instead of 1
        
        Show
        xufeng added a comment - I create a patch and run this patch 100 times by shell. for((i=1;i<=100;i++)); do mvn clean -Dtest=TestLogRolling test >> HBASE-5828_90_v1_100times_test_result.txt mv target target_${i} done 2 times(64th 75th)failed,I do not why happened now. ------------------------------------------------------------------------------- Test set: org.apache.hadoop.hbase.regionserver.wal.TestLogRolling ------------------------------------------------------------------------------- Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 1,219.296 sec <<< FAILURE! testLogRollOnPipelineRestart(org.apache.hadoop.hbase.regionserver.wal.TestLogRolling) Time elapsed: 952.159 sec <<< ERROR! org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException: #1336683442040 at org.apache.hadoop.hbase.regionserver.wal.HLog.cleanupCurrentWriter(HLog.java:802) at org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:560) at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnPipelineRestart(TestLogRolling.java:476) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.runners.ParentRunner.run(ParentRunner.java:236) at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:62) at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:140) at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:165) at org.apache.maven.surefire.Surefire.run(Surefire.java:107) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:289) at org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:1005) Caused by: java.io.IOException: Error Recovery for block blk_1933341499089292179_1016 failed because recovery from primary datanode 127.0.0.1:50920 failed 6 times. Pipeline was 127.0.0.1:50920. Aborting... at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2741) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1500(DFSClient.java:2172) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2371) ------------------------------------------------------------------------------- Test set: org.apache.hadoop.hbase.regionserver.wal.TestLogRolling ------------------------------------------------------------------------------- Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 1,065.652 sec <<< FAILURE! testLogRollOnPipelineRestart(org.apache.hadoop.hbase.regionserver.wal.TestLogRolling) Time elapsed: 830.246 sec <<< ERROR! org.apache.hadoop.hbase.DroppedSnapshotException: region: TestLogRolling,,1336675047757.bfce863e5843952e14649dc6fc8a53dd. at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1003) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:905) at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:857) at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnPipelineRestart(TestLogRolling.java:509) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.runners.ParentRunner.run(ParentRunner.java:236) at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:62) at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:140) at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:165) at org.apache.maven.surefire.Surefire.run(Surefire.java:107) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:289) at org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:1005) Caused by: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/root/TestLogRolling/bfce863e5843952e14649dc6fc8a53dd/.tmp/8292661038971118318 could only be replicated to 0 nodes, instead of 1
        Hide
        xufeng added a comment -

        Hi
        Anyone can give me suggestion?

        Show
        xufeng added a comment - Hi Anyone can give me suggestion?
        Hide
        xufeng added a comment -

        If anyone have no doubts on "this is a test case problem."
        I think we have two choice to fix this problem:
        1.roll the log writer belongs to the RS that holding the meta region before testLogRollOnPipelineRestart#deleteTable() executed.
        2.By @Before and @After tag to start the cluster for every @test.

        plz give me suggestion,thanks.

        Show
        xufeng added a comment - If anyone have no doubts on "this is a test case problem." I think we have two choice to fix this problem: 1.roll the log writer belongs to the RS that holding the meta region before testLogRollOnPipelineRestart#deleteTable() executed. 2.By @Before and @After tag to start the cluster for every @test. plz give me suggestion,thanks.
        Hide
        xufeng added a comment -

        @Rama
        In fact,I run this case many times.
        if the meta and user regions are in the same server, it will ok.

        Show
        xufeng added a comment - @Rama In fact,I run this case many times. if the meta and user regions are in the same server, it will ok.
        Hide
        ramkrishna.s.vasudevan added a comment -

        Found the meta region and user region be deployed on different RS.It may cause the problem.

        Can we always ensure the meta and user regions are always in the same server?

        Show
        ramkrishna.s.vasudevan added a comment - Found the meta region and user region be deployed on different RS.It may cause the problem. Can we always ensure the meta and user regions are always in the same server?
        Hide
        xufeng added a comment -

        @Rama
        >>Do you think doing as in trunk adding @before and @after will solve this problem?
        I think yes.In @Before,It will create a new cluster,new datanodes,new pipeline.

        >>Because the way the pipeline is formed may not be in our control right?
        Yes, but in this problem testLogRollOnPipelineRestart kill the datanodes in pipeline,So I think that this is a testcase problem.

        Show
        xufeng added a comment - @Rama >>Do you think doing as in trunk adding @before and @after will solve this problem? I think yes.In @Before,It will create a new cluster,new datanodes,new pipeline. >>Because the way the pipeline is formed may not be in our control right? Yes, but in this problem testLogRollOnPipelineRestart kill the datanodes in pipeline,So I think that this is a testcase problem.
        Hide
        ramkrishna.s.vasudevan added a comment -

        Xufeng, i can understand the problem here to some extent.

        A:Because by @before and @after tag,every case will have the absolute test cluster.

        Do you think doing as in trunk adding @before and @after will solve this problem?
        Because the way the pipeline is formed may not be in our control right?

        Show
        ramkrishna.s.vasudevan added a comment - Xufeng, i can understand the problem here to some extent. A:Because by @before and @after tag,every case will have the absolute test cluster. Do you think doing as in trunk adding @before and @after will solve this problem? Because the way the pipeline is formed may not be in our control right?
        xufeng made changes -
        Hide
        xufeng added a comment -

        1.The reproduce result.
        2.The test file thar be added some debug code.

        Show
        xufeng added a comment - 1.The reproduce result. 2.The test file thar be added some debug code.
        xufeng made changes -
        Field Original Value New Value
        Assignee xufeng [ xufeng ]
        Hide
        xufeng added a comment -

        My Conclusion:
        The the log writer pipeline of meta regionserver has been destroyed by testLogRollOnDatanodeDeath().
        when operate the Hlog on meta regionserver it will failed.

        I think this is not a bug just a test problem.Like the other versions,We scan segregate the test cluster by @before and @after tag.

        My My Analysis::
        1.I check the log from https://builds.apache.org/view/G-L/view/HBase/job/hbase-0.90/472/artifact/trunk/target/surefire-reports/org.apache.hadoop.hbase.regionserver.wal.TestLogRolling-output.txt
        Found the meta region and user region be deployed on different RS.It may cause the problem.

        	Line 226: 2012-04-15 09:19:11,325 INFO  [MASTER_OPEN_REGION-hemera.apache.org:45266-0] handler.OpenedRegionHandler(137): The master has opened the region -ROOT-,,0.70236052 that was online on serverName=hemera.apache.org,46368,1334481545408, load=(requests=0, regions=0, usedHeap=71, maxHeap=1244)
        	Line 288: 2012-04-15 09:19:11,379 INFO  [MASTER_OPEN_REGION-hemera.apache.org:45266-1] handler.OpenedRegionHandler(137): The master has opened the region .META.,,1.1028785192 that was online on serverName=hemera.apache.org,46368,1334481545408, load=(requests=0, regions=0, usedHeap=71, maxHeap=1244)
        	Line 504: 2012-04-15 09:19:21,513 DEBUG [MASTER_OPEN_REGION-hemera.apache.org:45266-2] handler.OpenedRegionHandler(139): The master has opened the region TestLogRolling,,1334481561377.369bae2d9411e5836a17df5e31b07b5e. that was online on serverName=hemera.apache.org,46368,1334481545408, load=(requests=0, regions=2, usedHeap=76, maxHeap=1244)
        	Line 8652: 2012-04-15 09:20:09,099 DEBUG [MASTER_OPEN_REGION-hemera.apache.org:45266-3] handler.OpenedRegionHandler(139): The master has opened the region TestLogRolling,,1334481609001.198321df4a9c90eebcf670a2218d29f0. that was online on serverName=hemera.apache.org,34725,1334481545432, load=(requests=0, regions=0, usedHeap=129, maxHeap=1244)
        

        2.In this issue we should find in case where the problem happened.
        I reproduce this issue again,this time I added the debug code in testLogRollOnPipelineRestart().
        if (admin.tableExists(tableName))

        { admin.disableTable(tableName); +LOG.info("xufeng7-------->before delete table!"); admin.deleteTable(tableName); +LOG.info("xufeng7-------->after delete table!"); }

        3.The test be killed And I just found the "before delete table!" no "after delete table!" string in log.
        It can ensure that the prroblem happened in admin.deleteTable(tableName).
        This is the region assigned info

        	Line 226: 2012-05-08 16:45:17,603 INFO  [MASTER_OPEN_REGION-HADOOP-CI-AGENT-A:42383-0] handler.OpenedRegionHandler(137): The master has opened the region -ROOT-,,0.70236052 that was online on serverName=HADOOP-CI-AGENT-A,49991,1336466711048, load=(requests=0, regions=0, usedHeap=49, maxHeap=1348)
        	Line 288: 2012-05-08 16:45:17,696 INFO  [MASTER_OPEN_REGION-HADOOP-CI-AGENT-A:42383-1] handler.OpenedRegionHandler(137): The master has opened the region .META.,,1.1028785192 that was online on serverName=HADOOP-CI-AGENT-A,49991,1336466711048, load=(requests=0, regions=0, usedHeap=49, maxHeap=1348)
        	Line 505: 2012-05-08 16:45:28,099 DEBUG [MASTER_OPEN_REGION-HADOOP-CI-AGENT-A:42383-2] handler.OpenedRegionHandler(139): The master has opened the region TestLogRolling,,1336466727648.b119d8cad80bc39cdd38b04421a67280. that was online on serverName=HADOOP-CI-AGENT-A,57227,1336466711078, load=(requests=0, regions=0, usedHeap=54, maxHeap=1348)
        	Line 8495: 2012-05-08 16:46:18,169 DEBUG [MASTER_OPEN_REGION-HADOOP-CI-AGENT-A:42383-3] handler.OpenedRegionHandler(139): The master has opened the region TestLogRolling,,1336466777847.0117c81b3e59fdad15c4a4390156a558. that was online on serverName=HADOOP-CI-AGENT-A,57227,1336466711078, load=(requests=0, regions=0, usedHeap=68, maxHeap=1348)
        

        This is my debug info:

        	Line 8526: 2012-05-08 16:46:28,138 INFO  [main] wal.TestLogRolling(339): xufeng1-------->cluster.getRegionServer(0)serverName=HADOOP-CI-AGENT-A,49991,1336466711048, load=(requests=0, regions=2, usedHeap=72, maxHeap=1348)
        	Line 8527: 2012-05-08 16:46:28,138 INFO  [main] wal.TestLogRolling(344): xufeng1-------->127.0.0.1:36932
        	Line 8528: 2012-05-08 16:46:28,138 INFO  [main] wal.TestLogRolling(345): xufeng1-------->127.0.0.1:42887
        	Line 8529: 2012-05-08 16:46:28,138 INFO  [main] wal.TestLogRolling(351): xufeng2-------->cluster.getRegionServer(1)serverName=HADOOP-CI-AGENT-A,57227,1336466711078, load=(requests=0, regions=1, usedHeap=72, maxHeap=1348)
        	Line 8530: 2012-05-08 16:46:28,138 INFO  [main] wal.TestLogRolling(356): xufeng2-------->127.0.0.1:42887
        	Line 8531: 2012-05-08 16:46:28,138 INFO  [main] wal.TestLogRolling(357): xufeng2-------->127.0.0.1:36932
        	Line 9795: 2012-05-08 16:47:06,982 INFO  [main] wal.TestLogRolling(420): xufeng3-------->cluster.getRegionServer(0)serverName=HADOOP-CI-AGENT-A,49991,1336466711048, load=(requests=0, regions=2, usedHeap=73, maxHeap=1348)
        	Line 9796: 2012-05-08 16:47:06,982 INFO  [main] wal.TestLogRolling(425): xufeng3-------->127.0.0.1:36932
        	Line 9797: 2012-05-08 16:47:06,982 INFO  [main] wal.TestLogRolling(426): xufeng3-------->127.0.0.1:42887
        	Line 9798: 2012-05-08 16:47:06,983 INFO  [main] wal.TestLogRolling(432): xufeng4-------->cluster.getRegionServer(1)serverName=HADOOP-CI-AGENT-A,57227,1336466711078, load=(requests=0, regions=1, usedHeap=73, maxHeap=1348)
        	Line 9799: 2012-05-08 16:47:06,983 INFO  [main] wal.TestLogRolling(437): xufeng4-------->127.0.0.1:45998
        	Line 9800: 2012-05-08 16:47:06,983 INFO  [main] wal.TestLogRolling(438): xufeng4-------->127.0.0.1:54936
        	Line 9803: 2012-05-08 16:47:06,985 INFO  [main] wal.TestLogRolling(468): xufeng5-------->cluster.getRegionServer(0)serverName=HADOOP-CI-AGENT-A,49991,1336466711048, load=(requests=0, regions=2, usedHeap=73, maxHeap=1348)
        	Line 9804: 2012-05-08 16:47:06,985 INFO  [main] wal.TestLogRolling(473): xufeng5-------->127.0.0.1:36932
        	Line 9805: 2012-05-08 16:47:06,985 INFO  [main] wal.TestLogRolling(474): xufeng5-------->127.0.0.1:42887
        	Line 9806: 2012-05-08 16:47:06,985 INFO  [main] wal.TestLogRolling(480): xufeng6-------->cluster.getRegionServer(1)serverName=HADOOP-CI-AGENT-A,57227,1336466711078, load=(requests=0, regions=1, usedHeap=73, maxHeap=1348)
        	Line 9807: 2012-05-08 16:47:06,985 INFO  [main] wal.TestLogRolling(485): xufeng6-------->127.0.0.1:45998
        	Line 9808: 2012-05-08 16:47:06,985 INFO  [main] wal.TestLogRolling(486): xufeng6-------->127.0.0.1:54936
        	Line 9956: 2012-05-08 16:47:17,032 INFO  [main] wal.TestLogRolling(494): xufeng7-------->before delete table!
        

        4.What happened in admin.deleteTable(tableName),I checked the log,and found that

        2012-05-08 16:46:55,975 WARN  [DataStreamer for file /user/root/.logs/HADOOP-CI-AGENT-A,49991,1336466711048/HADOOP-CI-AGENT-A%3A49991.1336466712229 block blk_2409401836437026689_1008] hdfs.DFSClient$DFSOutputStream(2707): Failed recovery attempt #0 from primary datanode 127.0.0.1:42887
        2012-05-08 16:47:06,981 WARN  [DataStreamer for file /user/root/.logs/HADOOP-CI-AGENT-A,49991,1336466711048/HADOOP-CI-AGENT-A%3A49991.1336466712229 block blk_2409401836437026689_1008] hdfs.DFSClient$DFSOutputStream(2707): Failed recovery attempt #1 from primary datanode 127.0.0.1:42887
        2012-05-08 16:47:17,986 WARN  [DataStreamer for file /user/root/.logs/HADOOP-CI-AGENT-A,49991,1336466711048/HADOOP-CI-AGENT-A%3A49991.1336466712229 block blk_2409401836437026689_1008] hdfs.DFSClient$DFSOutputStream(2707): Failed recovery attempt #2 from primary datanode 127.0.0.1:42887
        2012-05-08 16:47:27,990 WARN  [DataStreamer for file /user/root/.logs/HADOOP-CI-AGENT-A,49991,1336466711048/HADOOP-CI-AGENT-A%3A49991.1336466712229 block blk_2409401836437026689_1008] hdfs.DFSClient$DFSOutputStream(2707): Failed recovery attempt #3 from primary datanode 127.0.0.1:42887
        2012-05-08 16:47:38,995 WARN  [DataStreamer for file /user/root/.logs/HADOOP-CI-AGENT-A,49991,1336466711048/HADOOP-CI-AGENT-A%3A49991.1336466712229 block blk_2409401836437026689_1008] hdfs.DFSClient$DFSOutputStream(2707): Failed recovery attempt #4 from primary datanode 127.0.0.1:42887
        2012-05-08 16:47:49,999 WARN  [DataStreamer for file /user/root/.logs/HADOOP-CI-AGENT-A,49991,1336466711048/HADOOP-CI-AGENT-A%3A49991.1336466712229 block blk_2409401836437026689_1008] hdfs.DFSClient$DFSOutputStream(2707): Failed recovery attempt #5 from primary datanode 127.0.0.1:42887
        2012-05-08 16:48:01,003 WARN  [DataStreamer for file /user/root/.logs/HADOOP-CI-AGENT-A,49991,1336466711048/HADOOP-CI-AGENT-A%3A49991.1336466712229 block blk_2409401836437026689_1008] hdfs.DFSClient$DFSOutputStream(2707): Failed recovery attempt #0 from primary datanode 127.0.0.1:36932
        2012-05-08 16:48:12,007 WARN  [DataStreamer for file /user/root/.logs/HADOOP-CI-AGENT-A,49991,1336466711048/HADOOP-CI-AGENT-A%3A49991.1336466712229 block blk_2409401836437026689_1008] hdfs.DFSClient$DFSOutputStream(2707): Failed recovery attempt #1 from primary datanode 127.0.0.1:36932
        2012-05-08 16:48:23,011 WARN  [DataStreamer for file /user/root/.logs/HADOOP-CI-AGENT-A,49991,1336466711048/HADOOP-CI-AGENT-A%3A49991.1336466712229 block blk_2409401836437026689_1008] hdfs.DFSClient$DFSOutputStream(2707): Failed recovery attempt #2 from primary datanode 127.0.0.1:36932
        2012-05-08 16:48:34,015 WARN  [DataStreamer for file /user/root/.logs/HADOOP-CI-AGENT-A,49991,1336466711048/HADOOP-CI-AGENT-A%3A49991.1336466712229 block blk_2409401836437026689_1008] hdfs.DFSClient$DFSOutputStream(2707): Failed recovery attempt #3 from primary datanode 127.0.0.1:36932
        2012-05-08 16:48:45,019 WARN  [DataStreamer for file /user/root/.logs/HADOOP-CI-AGENT-A,49991,1336466711048/HADOOP-CI-AGENT-A%3A49991.1336466712229 block blk_2409401836437026689_1008] hdfs.DFSClient$DFSOutputStream(2707): Failed recovery attempt #4 from primary datanode 127.0.0.1:36932
        2012-05-08 16:48:56,023 WARN  [DataStreamer for file /user/root/.logs/HADOOP-CI-AGENT-A,49991,1336466711048/HADOOP-CI-AGENT-A%3A49991.1336466712229 block blk_2409401836437026689_1008] hdfs.DFSClient$DFSOutputStream(2707): Failed recovery attempt #5 from primary datanode 127.0.0.1:36932
        

        5.It will cause the Hlog writer to roll,and if deleta table failed,the HBaseAdmin#deleteTable will cost long time in loop until Retries exhausted.

        2012-05-08 16:48:56,025 DEBUG [RegionServer:0;HADOOP-CI-AGENT-A,49991,1336466711048.logRoller] regionserver.LogRoller(90): HLog roll manually triggered
        2012-05-08 16:48:56,027 ERROR [PRI IPC Server handler 8 on 49991] regionserver.HRegionServer(970): 
        java.io.IOException: Reflection
        	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:147)
        	at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1088)
        	at org.apache.hadoop.hbase.regionserver.wal.HLog.append(HLog.java:1023)
        	at org.apache.hadoop.hbase.regionserver.HRegion.delete(HRegion.java:1265)
        	at org.apache.hadoop.hbase.regionserver.HRegion.delete(HRegion.java:1185)
        	at org.apache.hadoop.hbase.regionserver.HRegionServer.delete(HRegionServer.java:1950)
        	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        	at java.lang.reflect.Method.invoke(Method.java:597)
        	at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:602)
        	at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
        Caused by: java.lang.reflect.InvocationTargetException
        	at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
        	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        	at java.lang.reflect.Method.invoke(Method.java:597)
        	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:145)
        	... 11 more
        Caused by: java.io.IOException: DFSOutputStream is closed
        	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3238)
        	at org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97)
        	at org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:944)
        	... 15 more
        
        
        Show
        xufeng added a comment - My Conclusion: The the log writer pipeline of meta regionserver has been destroyed by testLogRollOnDatanodeDeath(). when operate the Hlog on meta regionserver it will failed. I think this is not a bug just a test problem.Like the other versions,We scan segregate the test cluster by @before and @after tag. My My Analysis:: 1.I check the log from https://builds.apache.org/view/G-L/view/HBase/job/hbase-0.90/472/artifact/trunk/target/surefire-reports/org.apache.hadoop.hbase.regionserver.wal.TestLogRolling-output.txt Found the meta region and user region be deployed on different RS.It may cause the problem. Line 226: 2012-04-15 09:19:11,325 INFO [MASTER_OPEN_REGION-hemera.apache.org:45266-0] handler.OpenedRegionHandler(137): The master has opened the region -ROOT-,,0.70236052 that was online on serverName=hemera.apache.org,46368,1334481545408, load=(requests=0, regions=0, usedHeap=71, maxHeap=1244) Line 288: 2012-04-15 09:19:11,379 INFO [MASTER_OPEN_REGION-hemera.apache.org:45266-1] handler.OpenedRegionHandler(137): The master has opened the region .META.,,1.1028785192 that was online on serverName=hemera.apache.org,46368,1334481545408, load=(requests=0, regions=0, usedHeap=71, maxHeap=1244) Line 504: 2012-04-15 09:19:21,513 DEBUG [MASTER_OPEN_REGION-hemera.apache.org:45266-2] handler.OpenedRegionHandler(139): The master has opened the region TestLogRolling,,1334481561377.369bae2d9411e5836a17df5e31b07b5e. that was online on serverName=hemera.apache.org,46368,1334481545408, load=(requests=0, regions=2, usedHeap=76, maxHeap=1244) Line 8652: 2012-04-15 09:20:09,099 DEBUG [MASTER_OPEN_REGION-hemera.apache.org:45266-3] handler.OpenedRegionHandler(139): The master has opened the region TestLogRolling,,1334481609001.198321df4a9c90eebcf670a2218d29f0. that was online on serverName=hemera.apache.org,34725,1334481545432, load=(requests=0, regions=0, usedHeap=129, maxHeap=1244) 2.In this issue we should find in case where the problem happened. I reproduce this issue again,this time I added the debug code in testLogRollOnPipelineRestart(). if (admin.tableExists(tableName)) { admin.disableTable(tableName); +LOG.info("xufeng7-------->before delete table!"); admin.deleteTable(tableName); +LOG.info("xufeng7-------->after delete table!"); } 3.The test be killed And I just found the "before delete table!" no "after delete table!" string in log. It can ensure that the prroblem happened in admin.deleteTable(tableName). This is the region assigned info Line 226: 2012-05-08 16:45:17,603 INFO [MASTER_OPEN_REGION-HADOOP-CI-AGENT-A:42383-0] handler.OpenedRegionHandler(137): The master has opened the region -ROOT-,,0.70236052 that was online on serverName=HADOOP-CI-AGENT-A,49991,1336466711048, load=(requests=0, regions=0, usedHeap=49, maxHeap=1348) Line 288: 2012-05-08 16:45:17,696 INFO [MASTER_OPEN_REGION-HADOOP-CI-AGENT-A:42383-1] handler.OpenedRegionHandler(137): The master has opened the region .META.,,1.1028785192 that was online on serverName=HADOOP-CI-AGENT-A,49991,1336466711048, load=(requests=0, regions=0, usedHeap=49, maxHeap=1348) Line 505: 2012-05-08 16:45:28,099 DEBUG [MASTER_OPEN_REGION-HADOOP-CI-AGENT-A:42383-2] handler.OpenedRegionHandler(139): The master has opened the region TestLogRolling,,1336466727648.b119d8cad80bc39cdd38b04421a67280. that was online on serverName=HADOOP-CI-AGENT-A,57227,1336466711078, load=(requests=0, regions=0, usedHeap=54, maxHeap=1348) Line 8495: 2012-05-08 16:46:18,169 DEBUG [MASTER_OPEN_REGION-HADOOP-CI-AGENT-A:42383-3] handler.OpenedRegionHandler(139): The master has opened the region TestLogRolling,,1336466777847.0117c81b3e59fdad15c4a4390156a558. that was online on serverName=HADOOP-CI-AGENT-A,57227,1336466711078, load=(requests=0, regions=0, usedHeap=68, maxHeap=1348) This is my debug info: Line 8526: 2012-05-08 16:46:28,138 INFO [main] wal.TestLogRolling(339): xufeng1-------->cluster.getRegionServer(0)serverName=HADOOP-CI-AGENT-A,49991,1336466711048, load=(requests=0, regions=2, usedHeap=72, maxHeap=1348) Line 8527: 2012-05-08 16:46:28,138 INFO [main] wal.TestLogRolling(344): xufeng1-------->127.0.0.1:36932 Line 8528: 2012-05-08 16:46:28,138 INFO [main] wal.TestLogRolling(345): xufeng1-------->127.0.0.1:42887 Line 8529: 2012-05-08 16:46:28,138 INFO [main] wal.TestLogRolling(351): xufeng2-------->cluster.getRegionServer(1)serverName=HADOOP-CI-AGENT-A,57227,1336466711078, load=(requests=0, regions=1, usedHeap=72, maxHeap=1348) Line 8530: 2012-05-08 16:46:28,138 INFO [main] wal.TestLogRolling(356): xufeng2-------->127.0.0.1:42887 Line 8531: 2012-05-08 16:46:28,138 INFO [main] wal.TestLogRolling(357): xufeng2-------->127.0.0.1:36932 Line 9795: 2012-05-08 16:47:06,982 INFO [main] wal.TestLogRolling(420): xufeng3-------->cluster.getRegionServer(0)serverName=HADOOP-CI-AGENT-A,49991,1336466711048, load=(requests=0, regions=2, usedHeap=73, maxHeap=1348) Line 9796: 2012-05-08 16:47:06,982 INFO [main] wal.TestLogRolling(425): xufeng3-------->127.0.0.1:36932 Line 9797: 2012-05-08 16:47:06,982 INFO [main] wal.TestLogRolling(426): xufeng3-------->127.0.0.1:42887 Line 9798: 2012-05-08 16:47:06,983 INFO [main] wal.TestLogRolling(432): xufeng4-------->cluster.getRegionServer(1)serverName=HADOOP-CI-AGENT-A,57227,1336466711078, load=(requests=0, regions=1, usedHeap=73, maxHeap=1348) Line 9799: 2012-05-08 16:47:06,983 INFO [main] wal.TestLogRolling(437): xufeng4-------->127.0.0.1:45998 Line 9800: 2012-05-08 16:47:06,983 INFO [main] wal.TestLogRolling(438): xufeng4-------->127.0.0.1:54936 Line 9803: 2012-05-08 16:47:06,985 INFO [main] wal.TestLogRolling(468): xufeng5-------->cluster.getRegionServer(0)serverName=HADOOP-CI-AGENT-A,49991,1336466711048, load=(requests=0, regions=2, usedHeap=73, maxHeap=1348) Line 9804: 2012-05-08 16:47:06,985 INFO [main] wal.TestLogRolling(473): xufeng5-------->127.0.0.1:36932 Line 9805: 2012-05-08 16:47:06,985 INFO [main] wal.TestLogRolling(474): xufeng5-------->127.0.0.1:42887 Line 9806: 2012-05-08 16:47:06,985 INFO [main] wal.TestLogRolling(480): xufeng6-------->cluster.getRegionServer(1)serverName=HADOOP-CI-AGENT-A,57227,1336466711078, load=(requests=0, regions=1, usedHeap=73, maxHeap=1348) Line 9807: 2012-05-08 16:47:06,985 INFO [main] wal.TestLogRolling(485): xufeng6-------->127.0.0.1:45998 Line 9808: 2012-05-08 16:47:06,985 INFO [main] wal.TestLogRolling(486): xufeng6-------->127.0.0.1:54936 Line 9956: 2012-05-08 16:47:17,032 INFO [main] wal.TestLogRolling(494): xufeng7-------->before delete table! 4.What happened in admin.deleteTable(tableName),I checked the log,and found that 2012-05-08 16:46:55,975 WARN [DataStreamer for file /user/root/.logs/HADOOP-CI-AGENT-A,49991,1336466711048/HADOOP-CI-AGENT-A%3A49991.1336466712229 block blk_2409401836437026689_1008] hdfs.DFSClient$DFSOutputStream(2707): Failed recovery attempt #0 from primary datanode 127.0.0.1:42887 2012-05-08 16:47:06,981 WARN [DataStreamer for file /user/root/.logs/HADOOP-CI-AGENT-A,49991,1336466711048/HADOOP-CI-AGENT-A%3A49991.1336466712229 block blk_2409401836437026689_1008] hdfs.DFSClient$DFSOutputStream(2707): Failed recovery attempt #1 from primary datanode 127.0.0.1:42887 2012-05-08 16:47:17,986 WARN [DataStreamer for file /user/root/.logs/HADOOP-CI-AGENT-A,49991,1336466711048/HADOOP-CI-AGENT-A%3A49991.1336466712229 block blk_2409401836437026689_1008] hdfs.DFSClient$DFSOutputStream(2707): Failed recovery attempt #2 from primary datanode 127.0.0.1:42887 2012-05-08 16:47:27,990 WARN [DataStreamer for file /user/root/.logs/HADOOP-CI-AGENT-A,49991,1336466711048/HADOOP-CI-AGENT-A%3A49991.1336466712229 block blk_2409401836437026689_1008] hdfs.DFSClient$DFSOutputStream(2707): Failed recovery attempt #3 from primary datanode 127.0.0.1:42887 2012-05-08 16:47:38,995 WARN [DataStreamer for file /user/root/.logs/HADOOP-CI-AGENT-A,49991,1336466711048/HADOOP-CI-AGENT-A%3A49991.1336466712229 block blk_2409401836437026689_1008] hdfs.DFSClient$DFSOutputStream(2707): Failed recovery attempt #4 from primary datanode 127.0.0.1:42887 2012-05-08 16:47:49,999 WARN [DataStreamer for file /user/root/.logs/HADOOP-CI-AGENT-A,49991,1336466711048/HADOOP-CI-AGENT-A%3A49991.1336466712229 block blk_2409401836437026689_1008] hdfs.DFSClient$DFSOutputStream(2707): Failed recovery attempt #5 from primary datanode 127.0.0.1:42887 2012-05-08 16:48:01,003 WARN [DataStreamer for file /user/root/.logs/HADOOP-CI-AGENT-A,49991,1336466711048/HADOOP-CI-AGENT-A%3A49991.1336466712229 block blk_2409401836437026689_1008] hdfs.DFSClient$DFSOutputStream(2707): Failed recovery attempt #0 from primary datanode 127.0.0.1:36932 2012-05-08 16:48:12,007 WARN [DataStreamer for file /user/root/.logs/HADOOP-CI-AGENT-A,49991,1336466711048/HADOOP-CI-AGENT-A%3A49991.1336466712229 block blk_2409401836437026689_1008] hdfs.DFSClient$DFSOutputStream(2707): Failed recovery attempt #1 from primary datanode 127.0.0.1:36932 2012-05-08 16:48:23,011 WARN [DataStreamer for file /user/root/.logs/HADOOP-CI-AGENT-A,49991,1336466711048/HADOOP-CI-AGENT-A%3A49991.1336466712229 block blk_2409401836437026689_1008] hdfs.DFSClient$DFSOutputStream(2707): Failed recovery attempt #2 from primary datanode 127.0.0.1:36932 2012-05-08 16:48:34,015 WARN [DataStreamer for file /user/root/.logs/HADOOP-CI-AGENT-A,49991,1336466711048/HADOOP-CI-AGENT-A%3A49991.1336466712229 block blk_2409401836437026689_1008] hdfs.DFSClient$DFSOutputStream(2707): Failed recovery attempt #3 from primary datanode 127.0.0.1:36932 2012-05-08 16:48:45,019 WARN [DataStreamer for file /user/root/.logs/HADOOP-CI-AGENT-A,49991,1336466711048/HADOOP-CI-AGENT-A%3A49991.1336466712229 block blk_2409401836437026689_1008] hdfs.DFSClient$DFSOutputStream(2707): Failed recovery attempt #4 from primary datanode 127.0.0.1:36932 2012-05-08 16:48:56,023 WARN [DataStreamer for file /user/root/.logs/HADOOP-CI-AGENT-A,49991,1336466711048/HADOOP-CI-AGENT-A%3A49991.1336466712229 block blk_2409401836437026689_1008] hdfs.DFSClient$DFSOutputStream(2707): Failed recovery attempt #5 from primary datanode 127.0.0.1:36932 5.It will cause the Hlog writer to roll,and if deleta table failed,the HBaseAdmin#deleteTable will cost long time in loop until Retries exhausted. 2012-05-08 16:48:56,025 DEBUG [RegionServer:0;HADOOP-CI-AGENT-A,49991,1336466711048.logRoller] regionserver.LogRoller(90): HLog roll manually triggered 2012-05-08 16:48:56,027 ERROR [PRI IPC Server handler 8 on 49991] regionserver.HRegionServer(970): java.io.IOException: Reflection at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:147) at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1088) at org.apache.hadoop.hbase.regionserver.wal.HLog.append(HLog.java:1023) at org.apache.hadoop.hbase.regionserver.HRegion.delete(HRegion.java:1265) at org.apache.hadoop.hbase.regionserver.HRegion.delete(HRegion.java:1185) at org.apache.hadoop.hbase.regionserver.HRegionServer.delete(HRegionServer.java:1950) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:602) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:145) ... 11 more Caused by: java.io.IOException: DFSOutputStream is closed at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3238) at org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97) at org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:944) ... 15 more
        Hide
        stack added a comment -

        Xufeng: If you want to experiment, add logging or whatever, just make a patch and I'll apply it to 0.90 for you. Its hard figuring this stuff out. You need all the clues you can get. If you want me to run a bunch of 0.90 builds up on jenkins, just ask for it and I'll do it for you too. Thanks (Looks like there are moves to make a 0.90.7 coming so would be cool nailing this flapping test soon). Thanks Xufeng.

        Show
        stack added a comment - Xufeng: If you want to experiment, add logging or whatever, just make a patch and I'll apply it to 0.90 for you. Its hard figuring this stuff out. You need all the clues you can get. If you want me to run a bunch of 0.90 builds up on jenkins, just ask for it and I'll do it for you too. Thanks (Looks like there are moves to make a 0.90.7 coming so would be cool nailing this flapping test soon). Thanks Xufeng.
        Show
        Ted Yu added a comment - https://builds.apache.org/view/G-L/view/HBase/job/hbase-0.90/472/artifact/trunk/target/surefire-reports/org.apache.hadoop.hbase.regionserver.wal.TestLogRolling-output.txt
        Hide
        xufeng added a comment -

        @Ted
        Where the log file I can get?

        Show
        xufeng added a comment - @Ted Where the log file I can get?
        Hide
        Ted Yu added a comment -

        That build is no longer on Jenkins.
        You can look at:
        https://builds.apache.org/view/G-L/view/HBase/job/hbase-0.90/472/console

        Show
        Ted Yu added a comment - That build is no longer on Jenkins. You can look at: https://builds.apache.org/view/G-L/view/HBase/job/hbase-0.90/472/console
        Hide
        xufeng added a comment -

        I can not access https://builds.apache.org/view/G-L/view/HBase/job/hbase-0.90/471/console
        Is the problem same to what I meeted?

        Show
        xufeng added a comment - I can not access https://builds.apache.org/view/G-L/view/HBase/job/hbase-0.90/471/console Is the problem same to what I meeted?
        Hide
        xufeng added a comment -

        Q:Why the oter versions have not this problem?
        A:Because by @before and @after tag,every case will have the absolute test cluster.

        Show
        xufeng added a comment - Q:Why the oter versions have not this problem? A:Because by @before and @after tag,every case will have the absolute test cluster.
        Hide
        xufeng added a comment -

        I added some debug log and reproduce it in my test env,I got the error that the TestLogRolling be killed.
        1.The region assignment info:

        	Line 227: 2012-05-08 10:33:46,705 INFO  [MASTER_OPEN_REGION-HADOOP-CI-AGENT-A:56223-0] handler.OpenedRegionHandler(137): The master has opened the region -ROOT-,,0.70236052 that was online on serverName=HADOOP-CI-AGENT-A,36476,1336444420154, load=(requests=0, regions=0, usedHeap=49, maxHeap=1348)
        	Line 288: 2012-05-08 10:33:46,797 INFO  [MASTER_OPEN_REGION-HADOOP-CI-AGENT-A:56223-1] handler.OpenedRegionHandler(137): The master has opened the region .META.,,1.1028785192 that was online on serverName=HADOOP-CI-AGENT-A,36476,1336444420154, load=(requests=0, regions=0, usedHeap=49, maxHeap=1348)
        	Line 505: 2012-05-08 10:33:57,224 DEBUG [MASTER_OPEN_REGION-HADOOP-CI-AGENT-A:56223-2] handler.OpenedRegionHandler(139): The master has opened the region TestLogRolling,,1336444436745.7365edcbf8a8a078ce2c04f08a08e021. that was online on serverName=HADOOP-CI-AGENT-A,59995,1336444420124, load=(requests=0, regions=0, usedHeap=50, maxHeap=1348)
        	Line 8497: 2012-05-08 10:34:47,678 DEBUG [MASTER_OPEN_REGION-HADOOP-CI-AGENT-A:56223-3] handler.OpenedRegionHandler(139): The master has opened the region TestLogRolling,,1336444487331.d0ba082226efdac56a16272b51d1147d. that was online on serverName=HADOOP-CI-AGENT-A,59995,1336444420124, load=(requests=0, regions=0, usedHeap=51, maxHeap=1348)
        

        2.The piple line info,ensure the datanodes in pipeline of the rs that holding the meta region did not change,it will cause the problem.

        	Line 8528: 2012-05-08 10:34:57,648 INFO  [main] wal.TestLogRolling(339): xufeng1-------->cluster.getRegionServer(0)serverName=HADOOP-CI-AGENT-A,59995,1336444420124, load=(requests=0, regions=1, usedHeap=50, maxHeap=1348)
        	Line 8529: 2012-05-08 10:34:57,648 INFO  [main] wal.TestLogRolling(344): xufeng1-------->127.0.0.1:47107
        	Line 8530: 2012-05-08 10:34:57,648 INFO  [main] wal.TestLogRolling(345): xufeng1-------->127.0.0.1:33211
        	Line 8531: 2012-05-08 10:34:57,648 INFO  [main] wal.TestLogRolling(351): xufeng2-------->cluster.getRegionServer(1)serverName=HADOOP-CI-AGENT-A,36476,1336444420154, load=(requests=0, regions=2, usedHeap=50, maxHeap=1348)
        	Line 8532: 2012-05-08 10:34:57,648 INFO  [main] wal.TestLogRolling(356): xufeng2-------->127.0.0.1:33211
        	Line 8533: 2012-05-08 10:34:57,649 INFO  [main] wal.TestLogRolling(357): xufeng2-------->127.0.0.1:47107
        	Line 9536: 2012-05-08 10:35:35,470 INFO  [main] wal.TestLogRolling(420): xufeng3-------->cluster.getRegionServer(0)serverName=HADOOP-CI-AGENT-A,59995,1336444420124, load=(requests=9, regions=1, usedHeap=49, maxHeap=1348)
        	Line 9537: 2012-05-08 10:35:35,470 INFO  [main] wal.TestLogRolling(425): xufeng3-------->127.0.0.1:43393
        	Line 9538: 2012-05-08 10:35:35,471 INFO  [main] wal.TestLogRolling(426): xufeng3-------->127.0.0.1:41380
        	Line 9723: 2012-05-08 10:35:36,440 INFO  [main] wal.TestLogRolling(432): xufeng4-------->cluster.getRegionServer(1)serverName=HADOOP-CI-AGENT-A,36476,1336444420154, load=(requests=0, regions=2, usedHeap=52, maxHeap=1348)
        	Line 9724: 2012-05-08 10:35:36,440 INFO  [main] wal.TestLogRolling(437): xufeng4-------->127.0.0.1:33211
        	Line 9725: 2012-05-08 10:35:36,440 INFO  [main] wal.TestLogRolling(438): xufeng4-------->127.0.0.1:47107
        	Line 9729: 2012-05-08 10:35:36,442 INFO  [main] wal.TestLogRolling(468): xufeng5-------->cluster.getRegionServer(0)serverName=HADOOP-CI-AGENT-A,59995,1336444420124, load=(requests=9, regions=1, usedHeap=49, maxHeap=1348)
        	Line 9730: 2012-05-08 10:35:36,442 INFO  [main] wal.TestLogRolling(473): xufeng5-------->127.0.0.1:43393
        	Line 9731: 2012-05-08 10:35:36,442 INFO  [main] wal.TestLogRolling(474): xufeng5-------->127.0.0.1:41380
        	Line 9732: 2012-05-08 10:35:36,443 INFO  [main] wal.TestLogRolling(480): xufeng6-------->cluster.getRegionServer(1)serverName=HADOOP-CI-AGENT-A,36476,1336444420154, load=(requests=0, regions=2, usedHeap=52, maxHeap=1348)
        	Line 9733: 2012-05-08 10:35:36,443 INFO  [main] wal.TestLogRolling(485): xufeng6-------->127.0.0.1:33211
        	Line 9734: 2012-05-08 10:35:36,443 INFO  [main] wal.TestLogRolling(486): xufeng6-------->127.0.0.1:47107
        
        Show
        xufeng added a comment - I added some debug log and reproduce it in my test env,I got the error that the TestLogRolling be killed. 1.The region assignment info: Line 227: 2012-05-08 10:33:46,705 INFO [MASTER_OPEN_REGION-HADOOP-CI-AGENT-A:56223-0] handler.OpenedRegionHandler(137): The master has opened the region -ROOT-,,0.70236052 that was online on serverName=HADOOP-CI-AGENT-A,36476,1336444420154, load=(requests=0, regions=0, usedHeap=49, maxHeap=1348) Line 288: 2012-05-08 10:33:46,797 INFO [MASTER_OPEN_REGION-HADOOP-CI-AGENT-A:56223-1] handler.OpenedRegionHandler(137): The master has opened the region .META.,,1.1028785192 that was online on serverName=HADOOP-CI-AGENT-A,36476,1336444420154, load=(requests=0, regions=0, usedHeap=49, maxHeap=1348) Line 505: 2012-05-08 10:33:57,224 DEBUG [MASTER_OPEN_REGION-HADOOP-CI-AGENT-A:56223-2] handler.OpenedRegionHandler(139): The master has opened the region TestLogRolling,,1336444436745.7365edcbf8a8a078ce2c04f08a08e021. that was online on serverName=HADOOP-CI-AGENT-A,59995,1336444420124, load=(requests=0, regions=0, usedHeap=50, maxHeap=1348) Line 8497: 2012-05-08 10:34:47,678 DEBUG [MASTER_OPEN_REGION-HADOOP-CI-AGENT-A:56223-3] handler.OpenedRegionHandler(139): The master has opened the region TestLogRolling,,1336444487331.d0ba082226efdac56a16272b51d1147d. that was online on serverName=HADOOP-CI-AGENT-A,59995,1336444420124, load=(requests=0, regions=0, usedHeap=51, maxHeap=1348) 2.The piple line info,ensure the datanodes in pipeline of the rs that holding the meta region did not change,it will cause the problem. Line 8528: 2012-05-08 10:34:57,648 INFO [main] wal.TestLogRolling(339): xufeng1-------->cluster.getRegionServer(0)serverName=HADOOP-CI-AGENT-A,59995,1336444420124, load=(requests=0, regions=1, usedHeap=50, maxHeap=1348) Line 8529: 2012-05-08 10:34:57,648 INFO [main] wal.TestLogRolling(344): xufeng1-------->127.0.0.1:47107 Line 8530: 2012-05-08 10:34:57,648 INFO [main] wal.TestLogRolling(345): xufeng1-------->127.0.0.1:33211 Line 8531: 2012-05-08 10:34:57,648 INFO [main] wal.TestLogRolling(351): xufeng2-------->cluster.getRegionServer(1)serverName=HADOOP-CI-AGENT-A,36476,1336444420154, load=(requests=0, regions=2, usedHeap=50, maxHeap=1348) Line 8532: 2012-05-08 10:34:57,648 INFO [main] wal.TestLogRolling(356): xufeng2-------->127.0.0.1:33211 Line 8533: 2012-05-08 10:34:57,649 INFO [main] wal.TestLogRolling(357): xufeng2-------->127.0.0.1:47107 Line 9536: 2012-05-08 10:35:35,470 INFO [main] wal.TestLogRolling(420): xufeng3-------->cluster.getRegionServer(0)serverName=HADOOP-CI-AGENT-A,59995,1336444420124, load=(requests=9, regions=1, usedHeap=49, maxHeap=1348) Line 9537: 2012-05-08 10:35:35,470 INFO [main] wal.TestLogRolling(425): xufeng3-------->127.0.0.1:43393 Line 9538: 2012-05-08 10:35:35,471 INFO [main] wal.TestLogRolling(426): xufeng3-------->127.0.0.1:41380 Line 9723: 2012-05-08 10:35:36,440 INFO [main] wal.TestLogRolling(432): xufeng4-------->cluster.getRegionServer(1)serverName=HADOOP-CI-AGENT-A,36476,1336444420154, load=(requests=0, regions=2, usedHeap=52, maxHeap=1348) Line 9724: 2012-05-08 10:35:36,440 INFO [main] wal.TestLogRolling(437): xufeng4-------->127.0.0.1:33211 Line 9725: 2012-05-08 10:35:36,440 INFO [main] wal.TestLogRolling(438): xufeng4-------->127.0.0.1:47107 Line 9729: 2012-05-08 10:35:36,442 INFO [main] wal.TestLogRolling(468): xufeng5-------->cluster.getRegionServer(0)serverName=HADOOP-CI-AGENT-A,59995,1336444420124, load=(requests=9, regions=1, usedHeap=49, maxHeap=1348) Line 9730: 2012-05-08 10:35:36,442 INFO [main] wal.TestLogRolling(473): xufeng5-------->127.0.0.1:43393 Line 9731: 2012-05-08 10:35:36,442 INFO [main] wal.TestLogRolling(474): xufeng5-------->127.0.0.1:41380 Line 9732: 2012-05-08 10:35:36,443 INFO [main] wal.TestLogRolling(480): xufeng6-------->cluster.getRegionServer(1)serverName=HADOOP-CI-AGENT-A,36476,1336444420154, load=(requests=0, regions=2, usedHeap=52, maxHeap=1348) Line 9733: 2012-05-08 10:35:36,443 INFO [main] wal.TestLogRolling(485): xufeng6-------->127.0.0.1:33211 Line 9734: 2012-05-08 10:35:36,443 INFO [main] wal.TestLogRolling(486): xufeng6-------->127.0.0.1:47107
        Hide
        xufeng added a comment -

        I just review the code and have a speculation about this issue:
        If the region of META table assigned on regionserver_A.
        The region of user table assigned on regionserver_B.

        At first,the pipeline of regionserver_A and regionserver_B have the same datanodes.
        The testcase testLogRollOnDatanodeDeath will kill the all datanodes in pipeline which belongs to the hlog writer of regionserver_B.

        When the testcase testLogRollOnPipelineRestart to deleteTale,it will update the META data.
        Because the datanodes in pipeline of regionserver_A have been stoped,So it will fail.

        Test will wait in HBaseAdmin#deletaTable() until retries exhausted.
        After long time this test will be killed.

        Show
        xufeng added a comment - I just review the code and have a speculation about this issue: If the region of META table assigned on regionserver_A. The region of user table assigned on regionserver_B. At first,the pipeline of regionserver_A and regionserver_B have the same datanodes. The testcase testLogRollOnDatanodeDeath will kill the all datanodes in pipeline which belongs to the hlog writer of regionserver_B. When the testcase testLogRollOnPipelineRestart to deleteTale,it will update the META data. Because the datanodes in pipeline of regionserver_A have been stoped,So it will fail. Test will wait in HBaseAdmin#deletaTable() until retries exhausted. After long time this test will be killed.
        stack created issue -

          People

          • Assignee:
            xufeng
            Reporter:
            stack
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development