HBase
  1. HBase
  2. HBASE-6891 Hadoop 2 unit test failures
  3. HBASE-5995

Fix and reenable TestLogRolling.testLogRollOnPipelineRestart

    Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.98.0, 0.95.1
    • Component/s: test
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      HBASE-5984 disabled this flakey test (See the issue for more). This issue is about getting it enabled again. Made a blocker on 0.96.0 so it gets attention.

      1. hbase-5995_v3.patch
        6 kB
        Enis Soztutar
      2. hbase-5995_v2.patch
        6 kB
        Enis Soztutar
      3. hbase-5995_v1.patch
        4 kB
        Enis Soztutar

        Issue Links

          Activity

          Hide
          Andrew Purtell added a comment -

          +1 for blocker status, thanks Stack. HBase should be able to survive a rolling DN restart and we need to test for that.

          Show
          Andrew Purtell added a comment - +1 for blocker status, thanks Stack. HBase should be able to survive a rolling DN restart and we need to test for that.
          Hide
          Ted Yu added a comment -

          When I ran the test against hadoop 2.0, it failed with:

          testLogRollOnPipelineRestart(org.apache.hadoop.hbase.regionserver.wal.TestLogRolling)  Time elapsed: 0.243 sec  <<< ERROR!
          java.io.IOException: Cannot obtain block length for LocatedBlock{BP-1150895311-10.249.196.101-1349476630606:blk_7782056094701760427_1026; getBlockSize()=1472; corrupt=false; offset=0; locs=[127.0.0.1:44729, 127.0.0.1:38785]}
            at org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:232)
            at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:177)
            at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:119)
            at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:112)
            at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:966)
            at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:212)
            at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:75)
            at org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1768)
            at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.openFile(SequenceFileLogReader.java:63)
            at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1688)
            at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1709)
            at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:56)
            at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:176)
            at org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:82)
            at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnPipelineRestart(TestLogRolling.java:501)
          
          Show
          Ted Yu added a comment - When I ran the test against hadoop 2.0, it failed with: testLogRollOnPipelineRestart(org.apache.hadoop.hbase.regionserver.wal.TestLogRolling) Time elapsed: 0.243 sec <<< ERROR! java.io.IOException: Cannot obtain block length for LocatedBlock{BP-1150895311-10.249.196.101-1349476630606:blk_7782056094701760427_1026; getBlockSize()=1472; corrupt= false ; offset=0; locs=[127.0.0.1:44729, 127.0.0.1:38785]} at org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:232) at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:177) at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:119) at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:112) at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:966) at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:212) at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:75) at org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1768) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.openFile(SequenceFileLogReader.java:63) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1688) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1709) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:56) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:176) at org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:82) at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnPipelineRestart(TestLogRolling.java:501)
          Hide
          Andrew Purtell added a comment -

          This fails consistently against Hadoop 2.

          Show
          Andrew Purtell added a comment - This fails consistently against Hadoop 2.
          Hide
          stack added a comment -

          Todd said he'd have a look... might have some advice for us.

          Show
          stack added a comment - Todd said he'd have a look... might have some advice for us.
          Hide
          stack added a comment -

          I tried the test w/ hadoop 2.0.2-alpha and it fails same way.

          Show
          stack added a comment - I tried the test w/ hadoop 2.0.2-alpha and it fails same way.
          Hide
          Todd Lipcon added a comment -

          I'm remembering now that this was due to HDFS-2288, which got closed as invalid. But, I still think HDFS-2288 is valid, so I will do my best to revive it and convince other HDFS developers of that

          Show
          Todd Lipcon added a comment - I'm remembering now that this was due to HDFS-2288 , which got closed as invalid. But, I still think HDFS-2288 is valid, so I will do my best to revive it and convince other HDFS developers of that
          Hide
          Sergey Shelukhin added a comment -

          This appears to be stuck.

          Show
          Sergey Shelukhin added a comment - This appears to be stuck.
          Hide
          stack added a comment -

          Sergey Shelukhin A bunch has changed since what w/ fs op retries and the like. I have not tried the test lately. I would do that first. There is probably something to hdfs-2288 too...

          Show
          stack added a comment - Sergey Shelukhin A bunch has changed since what w/ fs op retries and the like. I have not tried the test lately. I would do that first. There is probably something to hdfs-2288 too...
          Hide
          Enis Soztutar added a comment -

          Assigning this to myself to understand what is going on. Will update with results.

          Show
          Enis Soztutar added a comment - Assigning this to myself to understand what is going on. Will update with results.
          Hide
          Enis Soztutar added a comment -

          This still fails with 2.0.4-alpha. It seems that all the replicas are in state RWR, and we cannot get the length of the block while the replicas are waiting to be recovered. Will inspect more and share the results.

          Show
          Enis Soztutar added a comment - This still fails with 2.0.4-alpha. It seems that all the replicas are in state RWR, and we cannot get the length of the block while the replicas are waiting to be recovered. Will inspect more and share the results.
          Hide
          Enis Soztutar added a comment -

          Attaching a candidate patch for this. As per my initial analysis, it seems that we have to call recoverLease before opening the wal files for read.

          We do not crash the region servers in this test, so normally all log files should be closed, and recoverLease() should not be necessary. However, we do restart all the datanodes, and when we trigger a log roll, then the DFSOuputStream.close() receives exception on the close:

          2013-05-03 11:38:28,366 ERROR [RegionServer:1;10.11.3.18,51418,1367606292279.logRoller] wal.FSHLog(691): Failed close of HLog writer
          java.io.IOException: All datanodes 127.0.0.1:51404 are bad. Aborting...
            at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:941)
            at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:756)
            at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:425)
          

          We just ride over the close() exception, thus should call recoverLease() afterwards.

          I have yet to check why on earth we are able to run this successfully with Hadoop1.

          The test succeeds with the patch.

          Show
          Enis Soztutar added a comment - Attaching a candidate patch for this. As per my initial analysis, it seems that we have to call recoverLease before opening the wal files for read. We do not crash the region servers in this test, so normally all log files should be closed, and recoverLease() should not be necessary. However, we do restart all the datanodes, and when we trigger a log roll, then the DFSOuputStream.close() receives exception on the close: 2013-05-03 11:38:28,366 ERROR [RegionServer:1;10.11.3.18,51418,1367606292279.logRoller] wal.FSHLog(691): Failed close of HLog writer java.io.IOException: All datanodes 127.0.0.1:51404 are bad. Aborting... at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:941) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:756) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:425) We just ride over the close() exception, thus should call recoverLease() afterwards. I have yet to check why on earth we are able to run this successfully with Hadoop1. The test succeeds with the patch.
          Hide
          Ted Yu added a comment -

          Under hadoop 1.0, testLogRollOnPipelineRestart took more than two minutes on MacBook:

          Running org.apache.hadoop.hbase.regionserver.wal.TestLogRolling
          2013-05-06 17:25:33.239 java[26563:1203] Unable to load realm info from SCDynamicStore
          Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 159.722 sec
          

          Can this test be separated into its own ?

          Good job.

          Show
          Ted Yu added a comment - Under hadoop 1.0, testLogRollOnPipelineRestart took more than two minutes on MacBook: Running org.apache.hadoop.hbase.regionserver.wal.TestLogRolling 2013-05-06 17:25:33.239 java[26563:1203] Unable to load realm info from SCDynamicStore Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 159.722 sec Can this test be separated into its own ? Good job.
          Hide
          stack added a comment -

          OK. Makes sense. Test always passes even though our recoverLease is currently kinda broke w/ its 4 second wait (my guess is the dn's in the test have much long socket timeouts than this? We don't seem to change the default looking at head of the test).

          -1 on moving to its own test. Why? (Or why would having test standalone make it run faster?)

          Show
          stack added a comment - OK. Makes sense. Test always passes even though our recoverLease is currently kinda broke w/ its 4 second wait (my guess is the dn's in the test have much long socket timeouts than this? We don't seem to change the default looking at head of the test). -1 on moving to its own test. Why? (Or why would having test standalone make it run faster?)
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12582007/hbase-5995_v1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 hadoop1.0. The patch compiles against the hadoop 1.0 profile.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 lineLengths. The patch introduces lines longer than 100

          +1 site. The mvn site goal succeeds with this patch.

          +1 core tests. The patch passed unit tests in .

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5572//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5572//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5572//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5572//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5572//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5572//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5572//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5572//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5572//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5572//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12582007/hbase-5995_v1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified tests. +1 hadoop1.0 . The patch compiles against the hadoop 1.0 profile. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 lineLengths . The patch introduces lines longer than 100 +1 site . The mvn site goal succeeds with this patch. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5572//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5572//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5572//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5572//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5572//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5572//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5572//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5572//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5572//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5572//console This message is automatically generated.
          Hide
          Enis Soztutar added a comment -

          Here is a second attempt. This patch also fixes a condition where we the hdfs output stream is closed because of errors, but we try to continuously do hflush() before closing the stream, although it is already closed. FSDataOutputStream does not have an isClosed() kind of API, or throw a special exception, so I had to parse the exception msg (which is ugly I admit).

          In hadoop2, we check the GS's of the block and replica to reason about the file length, which is why we get the error. In hadoop1, it seems that we do not do that, and except the replica length with checking GS (I might be wrong on this though, haven't confirmed it with hdfs folks).

          Show
          Enis Soztutar added a comment - Here is a second attempt. This patch also fixes a condition where we the hdfs output stream is closed because of errors, but we try to continuously do hflush() before closing the stream, although it is already closed. FSDataOutputStream does not have an isClosed() kind of API, or throw a special exception, so I had to parse the exception msg (which is ugly I admit). In hadoop2, we check the GS's of the block and replica to reason about the file length, which is why we get the error. In hadoop1, it seems that we do not do that, and except the replica length with checking GS (I might be wrong on this though, haven't confirmed it with hdfs folks).
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12582735/hbase-5995_v2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 hadoop1.0. The patch compiles against the hadoop 1.0 profile.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 lineLengths. The patch introduces lines longer than 100

          +1 site. The mvn site goal succeeds with this patch.

          +1 core tests. The patch passed unit tests in .

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5633//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5633//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5633//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5633//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5633//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5633//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5633//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5633//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5633//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5633//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12582735/hbase-5995_v2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified tests. +1 hadoop1.0 . The patch compiles against the hadoop 1.0 profile. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 lineLengths . The patch introduces lines longer than 100 +1 site . The mvn site goal succeeds with this patch. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5633//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5633//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5633//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5633//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5633//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5633//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5633//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5633//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5633//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5633//console This message is automatically generated.
          Hide
          Sergey Shelukhin added a comment -
          -        closeBarrier.endOp();
          +        if (!closed) {
          +          closeBarrier.endOp();
          +        }
          

          The start code is:

                  boolean isClosed = closed;
                  if (isClosed || !closeBarrier.beginOp()) {
                    LOG.debug("HLog " + (isClosed ? "closed" : "closing") + ". Skipping rolling of writer");
                    return regionsToFlush;
                  }
          

          End should not be called only if "isClosed || !closeBarrier.beginOp()" is false. isClosed needs to be checked lest "closed" changed, and whether we could start the op.
          In fact maybe it's better to move that if before "try"? (and keep logRollRunning in)

          +          try {
          +            sync();
          +          } catch (IOException ioe) {
          +            if (ioe.getMessage().contains("DFSOutputStream is closed")) {
          +              //do not throw this here, sync() failed we will just close the writer
          +            }
          +          }
          

          This "if" is pointless... should it throw in other cases? Also comment could be made more descriptive

          Show
          Sergey Shelukhin added a comment - - closeBarrier.endOp(); + if (!closed) { + closeBarrier.endOp(); + } The start code is: boolean isClosed = closed; if (isClosed || !closeBarrier.beginOp()) { LOG.debug( "HLog " + (isClosed ? "closed" : "closing" ) + ". Skipping rolling of writer" ); return regionsToFlush; } End should not be called only if "isClosed || !closeBarrier.beginOp()" is false. isClosed needs to be checked lest "closed" changed, and whether we could start the op. In fact maybe it's better to move that if before "try"? (and keep logRollRunning in) + try { + sync(); + } catch (IOException ioe) { + if (ioe.getMessage().contains( "DFSOutputStream is closed" )) { + // do not throw this here, sync() failed we will just close the writer + } + } This "if" is pointless... should it throw in other cases? Also comment could be made more descriptive
          Hide
          Enis Soztutar added a comment -

          Thanks Sergey for taking a look.

          End should not be called only if "isClosed || !closeBarrier.beginOp()" is false. isClosed needs to be checked lest "closed" changed

          Right. Closed is not changed once it is set to true, but it might change between false -> true.

          In fact maybe it's better to move that if before "try"? (and keep logRollRunning in)

          Then we have to replicate the log and return statements. This looked cleaner.

          This "if" is pointless... should it throw in other cases?

          yes, it should definitely throw the exception in else.

          Show
          Enis Soztutar added a comment - Thanks Sergey for taking a look. End should not be called only if "isClosed || !closeBarrier.beginOp()" is false. isClosed needs to be checked lest "closed" changed Right. Closed is not changed once it is set to true, but it might change between false -> true. In fact maybe it's better to move that if before "try"? (and keep logRollRunning in) Then we have to replicate the log and return statements. This looked cleaner. This "if" is pointless... should it throw in other cases? yes, it should definitely throw the exception in else.
          Hide
          stack added a comment -

          If we do this:

          -          sync();
          +          try {
          +            sync();
          +          } catch (IOException ioe) {
          +            if (ioe.getMessage().contains("DFSOutputStream is closed")) {
          +              //do not throw this here, sync() failed we will just close the writer
          +            }
          +          }
          

          ... we will skip this:

                  LOG.error("Failed close of HLog writer", e);
                  int errors = closeErrorCount.incrementAndGet();
                  if (errors <= closeErrorsTolerated && !hasDeferredEntries()) {
                    LOG.warn("Riding over HLog close failure! error count="+errors);
                  } else {
                    if (hasDeferredEntries()) {
                      LOG.error("Aborting due to unflushed edits in HLog");
                    }
                    // Failed close of log file.  Means we're losing edits.  For now,
                    // shut ourselves down to minimize loss.  Alternative is to try and
                    // keep going.  See HBASE-930.
                    FailedLogCloseException flce =
                      new FailedLogCloseException("#" + currentfilenum);
                    flce.initCause(e);
                    throw flce;
                  }
          

          ... in particular, the abort of the server because we failed sync out some deferred edits.

          Should the close into a finally?

          Else looks good

          Show
          stack added a comment - If we do this: - sync(); + try { + sync(); + } catch (IOException ioe) { + if (ioe.getMessage().contains( "DFSOutputStream is closed" )) { + // do not throw this here, sync() failed we will just close the writer + } + } ... we will skip this: LOG.error( "Failed close of HLog writer" , e); int errors = closeErrorCount.incrementAndGet(); if (errors <= closeErrorsTolerated && !hasDeferredEntries()) { LOG.warn( "Riding over HLog close failure! error count=" +errors); } else { if (hasDeferredEntries()) { LOG.error( "Aborting due to unflushed edits in HLog" ); } // Failed close of log file. Means we're losing edits. For now, // shut ourselves down to minimize loss. Alternative is to try and // keep going. See HBASE-930. FailedLogCloseException flce = new FailedLogCloseException( "#" + currentfilenum); flce.initCause(e); throw flce; } ... in particular, the abort of the server because we failed sync out some deferred edits. Should the close into a finally? Else looks good
          Hide
          Enis Soztutar added a comment -

          we will skip this:.. in particular, the abort of the server because we failed sync out some deferred edits.

          good catch. Although we are only catching the IOE in sync(), not in close(), we might need the below lines for aborting RS. I removed exception catching there altogether. I had put it there because the logs were getting filled with hundreds of "DFSOutputStream is closed" exceptions, but with the sleep statement added in LogScyner exception, I think this is not the case anymore.

          Show
          Enis Soztutar added a comment - we will skip this:.. in particular, the abort of the server because we failed sync out some deferred edits. good catch. Although we are only catching the IOE in sync(), not in close(), we might need the below lines for aborting RS. I removed exception catching there altogether. I had put it there because the logs were getting filled with hundreds of "DFSOutputStream is closed" exceptions, but with the sleep statement added in LogScyner exception, I think this is not the case anymore.
          Hide
          Enis Soztutar added a comment -

          v3 patch addressing review comments.

          Show
          Enis Soztutar added a comment - v3 patch addressing review comments.
          Hide
          stack added a comment -

          +1 if test passes.

          Show
          stack added a comment - +1 if test passes.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12583273/hbase-5995_v3.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 hadoop1.0. The patch compiles against the hadoop 1.0 profile.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 lineLengths. The patch introduces lines longer than 100

          +1 site. The mvn site goal succeeds with this patch.

          -1 core tests. The patch failed these unit tests:

          -1 core zombie tests. There are 1 zombie test(s): at org.apache.hadoop.hbase.client.TestHCM.testDeleteForZKConnLeak(TestHCM.java:855)

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5696//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5696//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5696//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5696//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5696//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5696//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5696//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5696//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5696//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5696//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12583273/hbase-5995_v3.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified tests. +1 hadoop1.0 . The patch compiles against the hadoop 1.0 profile. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 lineLengths . The patch introduces lines longer than 100 +1 site . The mvn site goal succeeds with this patch. -1 core tests . The patch failed these unit tests: -1 core zombie tests . There are 1 zombie test(s): at org.apache.hadoop.hbase.client.TestHCM.testDeleteForZKConnLeak(TestHCM.java:855) Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5696//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5696//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5696//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5696//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5696//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5696//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5696//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5696//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5696//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5696//console This message is automatically generated.
          Hide
          Enis Soztutar added a comment -

          Failing test is possibly due to HBASE-8535. Will commit this now.

          Show
          Enis Soztutar added a comment - Failing test is possibly due to HBASE-8535 . Will commit this now.
          Hide
          Enis Soztutar added a comment -

          Finally committed this. Thanks for reviews.

          Show
          Enis Soztutar added a comment - Finally committed this. Thanks for reviews.
          Hide
          Hudson added a comment -

          Integrated in hbase-0.95 #196 (See https://builds.apache.org/job/hbase-0.95/196/)
          HBASE-5995 Fix and reenable TestLogRolling.testLogRollOnPipelineRestart (Revision 1483006)

          Result = FAILURE
          enis :
          Files :

          • /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
          • /hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java
          Show
          Hudson added a comment - Integrated in hbase-0.95 #196 (See https://builds.apache.org/job/hbase-0.95/196/ ) HBASE-5995 Fix and reenable TestLogRolling.testLogRollOnPipelineRestart (Revision 1483006) Result = FAILURE enis : Files : /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java /hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #4122 (See https://builds.apache.org/job/HBase-TRUNK/4122/)
          HBASE-5995 Fix and reenable TestLogRolling.testLogRollOnPipelineRestart (Revision 1483004)

          Result = FAILURE
          enis :
          Files :

          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
          • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #4122 (See https://builds.apache.org/job/HBase-TRUNK/4122/ ) HBASE-5995 Fix and reenable TestLogRolling.testLogRollOnPipelineRestart (Revision 1483004) Result = FAILURE enis : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #532 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/532/)
          HBASE-5995 Fix and reenable TestLogRolling.testLogRollOnPipelineRestart (Revision 1483004)

          Result = FAILURE
          enis :
          Files :

          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
          • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #532 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/532/ ) HBASE-5995 Fix and reenable TestLogRolling.testLogRollOnPipelineRestart (Revision 1483004) Result = FAILURE enis : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java
          Hide
          Hudson added a comment -

          Integrated in hbase-0.95-on-hadoop2 #101 (See https://builds.apache.org/job/hbase-0.95-on-hadoop2/101/)
          HBASE-5995 Fix and reenable TestLogRolling.testLogRollOnPipelineRestart (Revision 1483006)

          Result = FAILURE
          enis :
          Files :

          • /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
          • /hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java
          Show
          Hudson added a comment - Integrated in hbase-0.95-on-hadoop2 #101 (See https://builds.apache.org/job/hbase-0.95-on-hadoop2/101/ ) HBASE-5995 Fix and reenable TestLogRolling.testLogRollOnPipelineRestart (Revision 1483006) Result = FAILURE enis : Files : /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java /hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java
          Hide
          stack added a comment -

          TestLogRolling.testLogRollOnPipelineRestart failed on hadoop1 here: https://builds.apache.org/job/hbase-0.95/352/consoleText It went zombie.

          In the double thread dump on the end:

          "pool-1-thread-1" prio=10 tid=0x73f9dc00 nid=0x3a34 in Object.wait() [0x7517d000]
             java.lang.Thread.State: TIMED_WAITING (on object monitor)
          	at java.lang.Object.wait(Native Method)
          	- waiting on <0xcf624ad0> (a java.util.concurrent.atomic.AtomicLong)
          	at org.apache.hadoop.hbase.client.AsyncProcess.waitForNextTaskDone(AsyncProcess.java:634)
          	- locked <0xcf624ad0> (a java.util.concurrent.atomic.AtomicLong)
          	at org.apache.hadoop.hbase.client.AsyncProcess.waitForMaximumCurrentTasks(AsyncProcess.java:659)
          	at org.apache.hadoop.hbase.client.AsyncProcess.waitUntilDone(AsyncProcess.java:670)
          	at org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:813)
          	at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1170)
          	at org.apache.hadoop.hbase.client.HTable.put(HTable.java:753)
          	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.doPut(TestLogRolling.java:640)
          	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.writeData(TestLogRolling.java:248)
          	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnPipelineRestart(TestLogRolling.java:515)
          

          ... we are stuck here.

          The math looks like it could go wonky. But looking in the output for the test, it seems that when this test ran we got this:

          2013-07-23 01:23:29,560 INFO [pool-1-thread-1] hbase.HBaseTestingUtility(922): Minicluster is down
          2013-07-23 01:23:29,574 INFO [pool-1-thread-1] hbase.ResourceChecker(171): after: regionserver.wal.TestLogRolling#testLogRollOnPipelineRestart Thread=39 (was 31) - Thread LEAK? -, OpenFileDescriptor=312 (was 272) - OpenFileDescriptor LEAK? -, MaxFileDescriptor=40000 (was 40000), SystemLoadAverage=351 (was 368), ProcessCount=144 (was 142) - ProcessCount LEAK? -, AvailableMemoryMB=906 (was 1995), ConnectionCount=0 (was 0)

          So we were up against the fd limit.

          Show
          stack added a comment - TestLogRolling.testLogRollOnPipelineRestart failed on hadoop1 here: https://builds.apache.org/job/hbase-0.95/352/consoleText It went zombie. In the double thread dump on the end: "pool-1-thread-1" prio=10 tid=0x73f9dc00 nid=0x3a34 in Object .wait() [0x7517d000] java.lang. Thread .State: TIMED_WAITING (on object monitor) at java.lang. Object .wait(Native Method) - waiting on <0xcf624ad0> (a java.util.concurrent.atomic.AtomicLong) at org.apache.hadoop.hbase.client.AsyncProcess.waitForNextTaskDone(AsyncProcess.java:634) - locked <0xcf624ad0> (a java.util.concurrent.atomic.AtomicLong) at org.apache.hadoop.hbase.client.AsyncProcess.waitForMaximumCurrentTasks(AsyncProcess.java:659) at org.apache.hadoop.hbase.client.AsyncProcess.waitUntilDone(AsyncProcess.java:670) at org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:813) at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1170) at org.apache.hadoop.hbase.client.HTable.put(HTable.java:753) at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.doPut(TestLogRolling.java:640) at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.writeData(TestLogRolling.java:248) at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnPipelineRestart(TestLogRolling.java:515) ... we are stuck here. The math looks like it could go wonky. But looking in the output for the test, it seems that when this test ran we got this: 2013-07-23 01:23:29,560 INFO [pool-1-thread-1] hbase.HBaseTestingUtility(922): Minicluster is down 2013-07-23 01:23:29,574 INFO [pool-1-thread-1] hbase.ResourceChecker(171): after: regionserver.wal.TestLogRolling#testLogRollOnPipelineRestart Thread=39 (was 31) - Thread LEAK? -, OpenFileDescriptor=312 (was 272) - OpenFileDescriptor LEAK? -, MaxFileDescriptor=40000 (was 40000), SystemLoadAverage=351 (was 368), ProcessCount=144 (was 142) - ProcessCount LEAK? -, AvailableMemoryMB=906 (was 1995), ConnectionCount=0 (was 0) So we were up against the fd limit.
          Hide
          stack added a comment -

          So we were up against the fd limit.

          When I say the above, I am wrong. I am misreading the output of the ResourceChecker. We are not up against the limit. This test went zombie on us. Let me open new issue.

          Show
          stack added a comment - So we were up against the fd limit. When I say the above, I am wrong. I am misreading the output of the ResourceChecker. We are not up against the limit. This test went zombie on us. Let me open new issue.

            People

            • Assignee:
              Enis Soztutar
              Reporter:
              stack
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development