HBase
  1. HBase
  2. HBASE-7728

deadlock occurs between hlog roller and hlog syncer

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.94.2
    • Fix Version/s: 0.94.5, 0.95.0
    • Component/s: wal
    • Labels:
      None
    • Environment:

      Linux 2.6.18-164.el5 x86_64 GNU/Linux

    • Hadoop Flags:
      Reviewed

      Description

      the hlog roller thread and hlog syncer thread may occur dead lock with the 'flushLock' and 'updateLock', and then cause all 'IPC Server handler' thread blocked on hlog append. the jstack info is as follow :
      "regionserver60020.logRoller":
      at org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1305)

      • waiting to lock <0x000000067bf88d58> (a java.lang.Object)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1283)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1456)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.cleanupCurrentWriter(HLog.java:876)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:657)
      • locked <0x000000067d54ace0> (a java.lang.Object)
        at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:94)
        at java.lang.Thread.run(Thread.java:662)
        "regionserver60020.logSyncer":
        at org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1314)
      • waiting to lock <0x000000067d54ace0> (a java.lang.Object)
      • locked <0x000000067bf88d58> (a java.lang.Object)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1283)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1456)
        at org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.java:1235)
        at java.lang.Thread.run(Thread.java:662)
      1. 7728-0.94.txt
        2 kB
        Ted Yu
      2. 7728-0.94-simplified.txt
        2 kB
        Ted Yu
      3. 7728-0.94-v2.txt
        2 kB
        Ted Yu
      4. 7728-suggest.txt
        2 kB
        Lars Hofhansl
      5. 7728-suggest-0.96.txt
        2 kB
        Lars Hofhansl
      6. 7728-v1.txt
        0.9 kB
        Ted Yu
      7. 7728-v2.txt
        2 kB
        Ted Yu
      8. 7728-v3.txt
        2 kB
        Ted Yu
      9. 7728-v4.txt
        2 kB
        Ted Yu

        Activity

        Hide
        Anoop Sam John added a comment -

        LogRoller thread trying to do a rolling over current log file. It captured the updateLock already.

        HLog#rollWriter(boolean force)
        synchronized (updateLock) {
                // Clean up current writer.
                Path oldFile = cleanupCurrentWriter(currentFilenum);
                this.writer = nextWriter;
        		....
        }
        

        As part of the clean up current writer, this thread try to sync the pending writes

        HLog#cleanupCurrentWriter(){
        ....
        	sync();
            }
            this.writer.close();
        }
        

        At the same time logSyncer thread was doing a defered log sync operation

        HLog#syncer(long txid){
         ...
         synchronized (flushLock) {
        	....
        	try {
        	  logSyncerThread.hlogFlush(tempWriter, pending);
        	} catch(IOException io) {
        	  synchronized (this.updateLock) {
        		// HBASE-4387, HBASE-5623, retry with updateLock held
        		tempWriter = this.writer;
        		logSyncerThread.hlogFlush(tempWriter, pending);
        	  }
        	}
        }
        

        This thread trying to grab the updateLock and holding the flushLock. Same time the roller thread coming and as part of clean up sync it tries to grab flushLock.
        IOException might have happened in the logSyncer thread(logSyncerThread.hlogFlush). At this time our assumption is a log rollover already happened. That is why we try to write again with updateLock held and getting the writer again. [The writer on which the IOE happened should have closed.]

        In roller thread the writer close happens after the cleanup operation.
        So I guess logSyncerThread.hlogFlush thrown IOE not because of a log roll.
        With out assuming the log roll in catch block we can check for tempWriter == this.writer; ??

        I am not an expert in this area. As per a quick code study adding my observation. If wrong pls correct me. Any logs with you when this happened?

        Show
        Anoop Sam John added a comment - LogRoller thread trying to do a rolling over current log file. It captured the updateLock already. HLog#rollWriter( boolean force) synchronized (updateLock) { // Clean up current writer. Path oldFile = cleanupCurrentWriter(currentFilenum); this .writer = nextWriter; .... } As part of the clean up current writer, this thread try to sync the pending writes HLog#cleanupCurrentWriter(){ .... sync(); } this .writer.close(); } At the same time logSyncer thread was doing a defered log sync operation HLog#syncer( long txid){ ... synchronized (flushLock) { .... try { logSyncerThread.hlogFlush(tempWriter, pending); } catch (IOException io) { synchronized ( this .updateLock) { // HBASE-4387, HBASE-5623, retry with updateLock held tempWriter = this .writer; logSyncerThread.hlogFlush(tempWriter, pending); } } } This thread trying to grab the updateLock and holding the flushLock. Same time the roller thread coming and as part of clean up sync it tries to grab flushLock. IOException might have happened in the logSyncer thread(logSyncerThread.hlogFlush). At this time our assumption is a log rollover already happened. That is why we try to write again with updateLock held and getting the writer again. [The writer on which the IOE happened should have closed.] In roller thread the writer close happens after the cleanup operation. So I guess logSyncerThread.hlogFlush thrown IOE not because of a log roll. With out assuming the log roll in catch block we can check for tempWriter == this.writer; ?? I am not an expert in this area. As per a quick code study adding my observation. If wrong pls correct me. Any logs with you when this happened?
        Hide
        Anoop Sam John added a comment -

        With out assuming the log roll in catch block we can check for tempWriter == this.writer; ??

        • Not correct..
          Can we know the IOE because of a parallel writer close?
        Show
        Anoop Sam John added a comment - With out assuming the log roll in catch block we can check for tempWriter == this.writer; ?? Not correct.. Can we know the IOE because of a parallel writer close?
        Hide
        ramkrishna.s.vasudevan added a comment -

        Yes logs will be needed.
        If sync is still going on from cleanUpWriter then it means that this.writer is not null still.
        If this.writer is not null then the IOE also should not have happened.

        If sync has happened then syncedTillHere should have changed. So nice thing to analyse and debug.
        The updateLock is needed too while changing the writer.

        Show
        ramkrishna.s.vasudevan added a comment - Yes logs will be needed. If sync is still going on from cleanUpWriter then it means that this.writer is not null still. If this.writer is not null then the IOE also should not have happened. If sync has happened then syncedTillHere should have changed. So nice thing to analyse and debug. The updateLock is needed too while changing the writer.
        Hide
        Anoop Sam John added a comment -

        Yes Ram the locks are correctly used AFA I have seen
        Ideally logSyncerThread.hlogFlush should not throw IOE as it is clear from the thread dumb that the roller has not even near the point where it closes and reset the current writer. Still it seems the IOE happened. That is why it is asking for the updateLock.

        Show
        Anoop Sam John added a comment - Yes Ram the locks are correctly used AFA I have seen Ideally logSyncerThread.hlogFlush should not throw IOE as it is clear from the thread dumb that the roller has not even near the point where it closes and reset the current writer. Still it seems the IOE happened. That is why it is asking for the updateLock.
        Hide
        Anoop Sam John added a comment -

        Sorry.. thread dump

        Show
        Anoop Sam John added a comment - Sorry.. thread dump
        Hide
        Anoop Sam John added a comment -

        Wang Qiang Any chance for logs?

        Show
        Anoop Sam John added a comment - Wang Qiang Any chance for logs?
        Hide
        ramkrishna.s.vasudevan added a comment -

        I am checking with latest 0.94 code. May be 0.94.2 has some changes as per the line no in the thread dump?

        Show
        ramkrishna.s.vasudevan added a comment - I am checking with latest 0.94 code. May be 0.94.2 has some changes as per the line no in the thread dump?
        Hide
        Anoop Sam John added a comment -

        I can see getting an IOE wrapping a NPE when the concurrent writer close happening
        This is from here

        SequenceFileLogWriter
        public void append(HLog.Entry entry) throws IOException {
            entry.setCompressionContext(compressionContext);
            try {
              this.writer.append(entry.getKey(), entry.getEdit());
            } catch (NullPointerException npe) {
              // Concurrent close...
              throw new IOException(npe);
            }
          }
        
        Show
        Anoop Sam John added a comment - I can see getting an IOE wrapping a NPE when the concurrent writer close happening This is from here SequenceFileLogWriter public void append(HLog.Entry entry) throws IOException { entry.setCompressionContext(compressionContext); try { this .writer.append(entry.getKey(), entry.getEdit()); } catch (NullPointerException npe) { // Concurrent close... throw new IOException(npe); } }
        Hide
        Ted Yu added a comment -

        How about this patch ?

        It tries to guard against closed this.writer.

        Show
        Ted Yu added a comment - How about this patch ? It tries to guard against closed this.writer.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12567455/7728-v1.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 lineLengths. The patch does not introduce lines longer than 100

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.regionserver.TestJoinedScanners

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4279//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4279//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4279//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4279//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4279//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4279//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4279//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4279//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12567455/7728-v1.txt against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.TestJoinedScanners Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4279//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4279//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4279//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4279//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4279//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4279//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4279//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4279//console This message is automatically generated.
        Hide
        Lars Hofhansl added a comment -

        How are we getting into HLog.syncer() with the updateLock already held?

        Show
        Lars Hofhansl added a comment - How are we getting into HLog.syncer() with the updateLock already held?
        Hide
        Lars Hofhansl added a comment -

        Oh I see:
        rollWriter -> cleanupCurrentWriter -> sync() -> syncer()

        Show
        Lars Hofhansl added a comment - Oh I see: rollWriter -> cleanupCurrentWriter -> sync() -> syncer()
        Hide
        Lars Hofhansl added a comment -

        How about this? Avoids holding the updateLock for the entire duration of the cleanup.

        100% untested.

        Show
        Lars Hofhansl added a comment - How about this? Avoids holding the updateLock for the entire duration of the cleanup. 100% untested.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12567480/7728-suggest.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4281//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12567480/7728-suggest.txt against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4281//console This message is automatically generated.
        Hide
        Lars Hofhansl added a comment -

        This was an 0.94 patch. Lemme make a trunk version. Thinking about it more, not sure it's correct. The update lock is supposed to guard against a concurrent roll... Not sure it would still reliably do that.

        Show
        Lars Hofhansl added a comment - This was an 0.94 patch. Lemme make a trunk version. Thinking about it more, not sure it's correct. The update lock is supposed to guard against a concurrent roll... Not sure it would still reliably do that.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12567486/7728-suggest-0.96.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 lineLengths. The patch does not introduce lines longer than 100

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.regionserver.wal.TestLogRollingNoCluster

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4283//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4283//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4283//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4283//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4283//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4283//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4283//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4283//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12567486/7728-suggest-0.96.txt against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.wal.TestLogRollingNoCluster Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4283//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4283//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4283//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4283//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4283//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4283//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4283//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4283//console This message is automatically generated.
        Hide
        Ted Yu added a comment -

        TestLogRollingNoCluster, the test added for concurrent log rolling and hlogFlush, failed.

        It would be nice if my patch is given a chance

        Show
        Ted Yu added a comment - TestLogRollingNoCluster, the test added for concurrent log rolling and hlogFlush, failed. It would be nice if my patch is given a chance
        Hide
        Anoop Sam John added a comment - - edited

        Ted

        synchronized (this.updateLock) {
                     // HBASE-4387, HBASE-5623, retry with updateLock held
                     tempWriter = this.writer;
        -            logSyncerThread.hlogFlush(tempWriter, pending);
                   }
        

        Here also the syncer thread is trying to get the updateLock and holding the flushLock.
        Same time roller thread holds the updateLock and it tries to get the flushLock.
        This is what happened in the issue. So will this remove the deadlock?

        A log would greatly help us if there is some thing. Any chance Wang Qiang

        Show
        Anoop Sam John added a comment - - edited Ted synchronized ( this .updateLock) { // HBASE-4387, HBASE-5623, retry with updateLock held tempWriter = this .writer; - logSyncerThread.hlogFlush(tempWriter, pending); } Here also the syncer thread is trying to get the updateLock and holding the flushLock. Same time roller thread holds the updateLock and it tries to get the flushLock. This is what happened in the issue. So will this remove the deadlock? A log would greatly help us if there is some thing. Any chance Wang Qiang
        Hide
        Lars Hofhansl added a comment -

        Of course Ted. That's why I named it ...suggest

        Show
        Lars Hofhansl added a comment - Of course Ted. That's why I named it ...suggest
        Hide
        Ted Yu added a comment -

        Patch v2 tries to align the order of locking so that there is no deadlock.

        TestLogRollingNoCluster and TestHLog passed.

        Show
        Ted Yu added a comment - Patch v2 tries to align the order of locking so that there is no deadlock. TestLogRollingNoCluster and TestHLog passed.
        Hide
        Lars Hofhansl added a comment -

        Does it need the part where it sets the writer to null and following null check?

        Show
        Lars Hofhansl added a comment - Does it need the part where it sets the writer to null and following null check?
        Hide
        Anoop Sam John added a comment -

        Does it need the part where it sets the writer to null and following null check?

        May be better this be there IMHO.
        Ted Yu
        This patch looks to solve the issue of possible deadlock I guess.
        Can we log the IOE ? Or not really needed?

        Show
        Anoop Sam John added a comment - Does it need the part where it sets the writer to null and following null check? May be better this be there IMHO. Ted Yu This patch looks to solve the issue of possible deadlock I guess. Can we log the IOE ? Or not really needed?
        Hide
        Lars Hofhansl added a comment -

        Do not we not have to check for tempWriter == null during the first attempt to flush the log?

        Show
        Lars Hofhansl added a comment - Do not we not have to check for tempWriter == null during the first attempt to flush the log?
        Hide
        Anoop Sam John added a comment -

        Do not we not have to check for tempWriter == null during the first attempt to flush the log?

        Yes .

        Show
        Anoop Sam John added a comment - Do not we not have to check for tempWriter == null during the first attempt to flush the log? Yes .
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12567512/7728-v2.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 lineLengths. The patch does not introduce lines longer than 100

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.client.TestAdmin
        org.apache.hadoop.hbase.master.TestZKBasedOpenCloseRegion

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4285//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4285//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12567512/7728-v2.txt against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.client.TestAdmin org.apache.hadoop.hbase.master.TestZKBasedOpenCloseRegion Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4285//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4285//console This message is automatically generated.
        Hide
        Ted Yu added a comment -

        Patch v3 adds error log.
        There is null check at the beginning of hlogFlush():

            void hlogFlush(Writer writer, List<Entry> pending) throws IOException {
              if (pending == null) return;
        
        Show
        Ted Yu added a comment - Patch v3 adds error log. There is null check at the beginning of hlogFlush(): void hlogFlush(Writer writer, List<Entry> pending) throws IOException { if (pending == null ) return ;
        Hide
        Ted Yu added a comment -

        I ran the two tests locally with patch and they passed:
        667 mt -Dtest=TestAdmin
        668 mt -Dtest=TestZKBasedOpenCloseRegion

        Show
        Ted Yu added a comment - I ran the two tests locally with patch and they passed: 667 mt -Dtest=TestAdmin 668 mt -Dtest=TestZKBasedOpenCloseRegion
        Hide
        Anoop Sam John added a comment -

        Ted Yu You mean this?

        synchronized (this.updateLock) {
              if (this.closed) return;
              tempWriter = this.writer; // guaranteed non-null
            }
        

        Yes here we check for this.closed state. This well guards the tempWriter to be not null during the 1st flush attempt.
        I was wrong the above comment that we need null check.

        But we need null check down the line when we do the sync() call on the writer?

        try {
                tempWriter.sync();
              } catch(IOException io) {
                synchronized (this.updateLock) {
                  // HBASE-4387, HBASE-5623, retry with updateLock held
                  tempWriter = this.writer;
                  tempWriter.sync();
                }
              }
        

        Or instead of null checks all the places can we do this.closed checks?

        Also in the log can we write that info that we will do a retry for the flush? Do we need the log level to be error or WARN is fine? May be error also okey.

        Show
        Anoop Sam John added a comment - Ted Yu You mean this? synchronized ( this .updateLock) { if ( this .closed) return ; tempWriter = this .writer; // guaranteed non- null } Yes here we check for this.closed state. This well guards the tempWriter to be not null during the 1st flush attempt. I was wrong the above comment that we need null check. But we need null check down the line when we do the sync() call on the writer? try { tempWriter.sync(); } catch (IOException io) { synchronized ( this .updateLock) { // HBASE-4387, HBASE-5623, retry with updateLock held tempWriter = this .writer; tempWriter.sync(); } } Or instead of null checks all the places can we do this.closed checks? Also in the log can we write that info that we will do a retry for the flush? Do we need the log level to be error or WARN is fine? May be error also okey.
        Hide
        Ted Yu added a comment -

        Patch v4 addresses Anoop's comments

        Show
        Ted Yu added a comment - Patch v4 addresses Anoop's comments
        Hide
        Anoop Sam John added a comment -

        Ted
        In the last patch you removed the null check for the hlogFlush() retry?
        Now the check is there in one place in sync() , the retry time.
        We need the null check at the 1st time sync also. (with out the updateLock)

        Show
        Anoop Sam John added a comment - Ted In the last patch you removed the null check for the hlogFlush() retry? Now the check is there in one place in sync() , the retry time. We need the null check at the 1st time sync also. (with out the updateLock)
        Hide
        Ted Yu added a comment - - edited

        TestLogRollingNoCluster and *HLog* tests passed for 0.94 patch.

        Show
        Ted Yu added a comment - - edited TestLogRollingNoCluster and *HLog* tests passed for 0.94 patch.
        Hide
        Ted Yu added a comment -

        In both 0.94 and trunk, we have:

            void hlogFlush(Writer writer, List<Entry> pending) throws IOException {
              if (pending == null) return;
        

        So the null check before calling hlogFlush() is not needed.

        Show
        Ted Yu added a comment - In both 0.94 and trunk, we have: void hlogFlush(Writer writer, List<Entry> pending) throws IOException { if (pending == null ) return ; So the null check before calling hlogFlush() is not needed.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12567517/7728-v3.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 lineLengths. The patch does not introduce lines longer than 100

        +1 core tests. The patch passed unit tests in .

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4286//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4286//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12567517/7728-v3.txt against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4286//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4286//console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12567526/7728-v4.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 lineLengths. The patch does not introduce lines longer than 100

        +1 core tests. The patch passed unit tests in .

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4287//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4287//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4287//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4287//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4287//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4287//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4287//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4287//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12567526/7728-v4.txt against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4287//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4287//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4287//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4287//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4287//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4287//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4287//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4287//console This message is automatically generated.
        Hide
        stack added a comment -

        How we know this fixes the deadlock? Can we manufacture it in a test bed and then see if this patch fixes it?

        Show
        stack added a comment - How we know this fixes the deadlock? Can we manufacture it in a test bed and then see if this patch fixes it?
        Hide
        Anoop Sam John added a comment -

        Ted Yu
        + this.writer = null;
        This addition in close() can make NPE possible at the places where we try hlogFlush for 2nd time and 2 places of tempWriter.sync(), one without and another with updateLock.
        I was telling abt this null check. In your older patch you have added null check on the tempWriter before repeating hlogFlush ()

        Show
        Anoop Sam John added a comment - Ted Yu + this.writer = null; This addition in close() can make NPE possible at the places where we try hlogFlush for 2nd time and 2 places of tempWriter.sync(), one without and another with updateLock. I was telling abt this null check. In your older patch you have added null check on the tempWriter before repeating hlogFlush ()
        Hide
        Anoop Sam John added a comment -

        stack Checked with 2 threads and giving break points, one is doing the append and other doing the roll. Able to reproduce the deadlock and patch solves that. The patch make sure the locking order is maintained same as that with roller.

        I think in the issue what happened is the 1st time call to hlogFlush() thrown an IOE but there were no roll happened and new writer created. The roller thread is still trying for cleanupCurrentWriter. So IOE because of some other reason happened. I dont think there was a parallel explicit close() call also happened. In close() we can see the it joins the log syncer thread and then only doing the writer close.

        Show
        Anoop Sam John added a comment - stack Checked with 2 threads and giving break points, one is doing the append and other doing the roll. Able to reproduce the deadlock and patch solves that. The patch make sure the locking order is maintained same as that with roller. I think in the issue what happened is the 1st time call to hlogFlush() thrown an IOE but there were no roll happened and new writer created. The roller thread is still trying for cleanupCurrentWriter. So IOE because of some other reason happened. I dont think there was a parallel explicit close() call also happened. In close() we can see the it joins the log syncer thread and then only doing the writer close.
        Hide
        Ted Yu added a comment -

        Patch v4 adds null check for tempWriter.sync()

        Show
        Ted Yu added a comment - Patch v4 adds null check for tempWriter.sync()
        Hide
        Lars Hofhansl added a comment -

        Am still not 100% sold that we need to set the writer to null upon close and add the null check everywhere in syncer().
        The change of the locking order should be enough. If not, please explain it to me

        Show
        Lars Hofhansl added a comment - Am still not 100% sold that we need to set the writer to null upon close and add the null check everywhere in syncer(). The change of the locking order should be enough. If not, please explain it to me
        Hide
        ramkrishna.s.vasudevan added a comment -

        Patch looks good. Nice fix.

        Show
        ramkrishna.s.vasudevan added a comment - Patch looks good. Nice fix.
        Hide
        Ted Yu added a comment -

        The null check originated from the fact that Writer doesn't provide isClosed() method:

          public interface Writer {
            void init(FileSystem fs, Path path, Configuration c) throws IOException;
        
            void close() throws IOException;
        
            void sync() throws IOException;
        
            void append(Entry entry) throws IOException;
        
            long getLength() throws IOException;
          }
        

        I would like to have a way of telling whether Writer is closed.

        Show
        Ted Yu added a comment - The null check originated from the fact that Writer doesn't provide isClosed() method: public interface Writer { void init(FileSystem fs, Path path, Configuration c) throws IOException; void close() throws IOException; void sync() throws IOException; void append(Entry entry) throws IOException; long getLength() throws IOException; } I would like to have a way of telling whether Writer is closed.
        Hide
        Lars Hofhansl added a comment -

        The fact that the writer might be closed concurrently is exactly the reason why the flush and sync are retried with the updateLock held.
        The null-check confuses the issue, IMHO.

        Show
        Lars Hofhansl added a comment - The fact that the writer might be closed concurrently is exactly the reason why the flush and sync are retried with the updateLock held. The null-check confuses the issue, IMHO.
        Hide
        Ted Yu added a comment -

        In the current formation:

                    tempWriter = this.writer;
                    logSyncerThread.hlogFlush(tempWriter, pending);
        

        The null check is the first action in hlogFlush().

              try {
                tempWriter.sync();
              } catch(IOException ex) {
        

        It is expected that NullPointerException would be wrapped in IOException and handled by the catch clause.
        When tempWriter is null, the above assumption no longer works.

        Show
        Ted Yu added a comment - In the current formation: tempWriter = this .writer; logSyncerThread.hlogFlush(tempWriter, pending); The null check is the first action in hlogFlush(). try { tempWriter.sync(); } catch (IOException ex) { It is expected that NullPointerException would be wrapped in IOException and handled by the catch clause. When tempWriter is null, the above assumption no longer works.
        Hide
        Lars Hofhansl added a comment -

        That is why we should not change that. What I am saying is that if you access the writer inside the updatalock in syncer() it will not be null.
        If we set it to null when closing, we might expose us to other NPEs in other parts of the code. As seen from this issue, this code is finicky.

        That main point I am making is that the deadlock (this issues) is fixed by your reordering of the locks (nice change, btw), and that the null stuff in an extraneous change.
        We can add the null set/check to 0.96, for 0.94 I would prefer the minimum change possible (unless you really feel strongly about this).

        Show
        Lars Hofhansl added a comment - That is why we should not change that. What I am saying is that if you access the writer inside the updatalock in syncer() it will not be null. If we set it to null when closing, we might expose us to other NPEs in other parts of the code. As seen from this issue, this code is finicky. That main point I am making is that the deadlock (this issues) is fixed by your reordering of the locks (nice change, btw), and that the null stuff in an extraneous change. We can add the null set/check to 0.96, for 0.94 I would prefer the minimum change possible (unless you really feel strongly about this).
        Hide
        Ted Yu added a comment -

        I will trim down the 0.94 patch, attach it here and run related tests.

        Show
        Ted Yu added a comment - I will trim down the 0.94 patch, attach it here and run related tests.
        Hide
        Ted Yu added a comment - - edited

        Here're the 0.94 tests I ran with simplified patch:

        Running org.apache.hadoop.hbase.mapreduce.TestHLogRecordReader
        2013-02-01 13:04:25.210 java[78223:1203] Unable to load realm info from SCDynamicStore
        Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.247 sec
        Running org.apache.hadoop.hbase.master.cleaner.TestLogsCleaner
        2013-02-01 13:04:31.229 java[78233:1203] Unable to load realm info from SCDynamicStore
        Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.905 sec
        Running org.apache.hadoop.hbase.master.TestDistributedLogSplitting
        2013-02-01 13:04:34.321 java[78235:dd03] Unable to load realm info from SCDynamicStore
        Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 135.127 sec
        Running org.apache.hadoop.hbase.master.TestSplitLogManager
        2013-02-01 13:08:28.744 java[78334:1203] Unable to load realm info from SCDynamicStore
        Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 100.334 sec
        Running org.apache.hadoop.hbase.monitoring.TestMemoryBoundedLogMessageBuffer
        Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.224 sec
        Running org.apache.hadoop.hbase.regionserver.TestSplitLogWorker
        Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 17.777 sec
        Running org.apache.hadoop.hbase.regionserver.wal.TestHLog
        2013-02-01 13:08:49.760 java[78345:1203] Unable to load realm info from SCDynamicStore
        Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 60.172 sec
        Running org.apache.hadoop.hbase.regionserver.wal.TestHLogBench
        2013-02-01 13:09:50.456 java[78397:1203] Unable to load realm info from SCDynamicStore
        Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.873 sec
        Running org.apache.hadoop.hbase.regionserver.wal.TestHLogMethods
        2013-02-01 13:09:51.531 java[78399:1203] Unable to load realm info from SCDynamicStore
        Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.798 sec
        Running org.apache.hadoop.hbase.regionserver.wal.TestHLogSplit
        2013-02-01 13:09:52.761 java[78402:1203] Unable to load realm info from SCDynamicStore
        Tests run: 30, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 242.269 sec
        Running org.apache.hadoop.hbase.regionserver.wal.TestHLogSplitCompressed
        2013-02-01 13:13:55.600 java[78424:1203] Unable to load realm info from SCDynamicStore
        Tests run: 30, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 235.546 sec
        Running org.apache.hadoop.hbase.regionserver.wal.TestLogRollAbort
        2013-02-01 13:17:51.787 java[78446:1203] Unable to load realm info from SCDynamicStore
        Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 36.362 sec
        Running org.apache.hadoop.hbase.regionserver.wal.TestLogRolling
        2013-02-01 13:18:28.753 java[78473:1203] Unable to load realm info from SCDynamicStore
        Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 283.942 sec
        Running org.apache.hadoop.hbase.regionserver.wal.TestLogRollingNoCluster
        2013-02-01 13:23:13.084 java[78536:1203] Unable to load realm info from SCDynamicStore
        Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.838 sec
        Running org.apache.hadoop.hbase.TestFullLogReconstruction
        2013-02-01 13:23:15.325 java[78538:1203] Unable to load realm info from SCDynamicStore
        Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 64.557 sec
        Running org.apache.hadoop.hbase.regionserver.wal.TestLogRollingNoCluster
        2013-02-01 13:24:32.933 java[78575:1203] Unable to load realm info from SCDynamicStore
        Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.283 sec
        

        They all passed.

        Show
        Ted Yu added a comment - - edited Here're the 0.94 tests I ran with simplified patch: Running org.apache.hadoop.hbase.mapreduce.TestHLogRecordReader 2013-02-01 13:04:25.210 java[78223:1203] Unable to load realm info from SCDynamicStore Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.247 sec Running org.apache.hadoop.hbase.master.cleaner.TestLogsCleaner 2013-02-01 13:04:31.229 java[78233:1203] Unable to load realm info from SCDynamicStore Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.905 sec Running org.apache.hadoop.hbase.master.TestDistributedLogSplitting 2013-02-01 13:04:34.321 java[78235:dd03] Unable to load realm info from SCDynamicStore Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 135.127 sec Running org.apache.hadoop.hbase.master.TestSplitLogManager 2013-02-01 13:08:28.744 java[78334:1203] Unable to load realm info from SCDynamicStore Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 100.334 sec Running org.apache.hadoop.hbase.monitoring.TestMemoryBoundedLogMessageBuffer Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.224 sec Running org.apache.hadoop.hbase.regionserver.TestSplitLogWorker Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 17.777 sec Running org.apache.hadoop.hbase.regionserver.wal.TestHLog 2013-02-01 13:08:49.760 java[78345:1203] Unable to load realm info from SCDynamicStore Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 60.172 sec Running org.apache.hadoop.hbase.regionserver.wal.TestHLogBench 2013-02-01 13:09:50.456 java[78397:1203] Unable to load realm info from SCDynamicStore Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.873 sec Running org.apache.hadoop.hbase.regionserver.wal.TestHLogMethods 2013-02-01 13:09:51.531 java[78399:1203] Unable to load realm info from SCDynamicStore Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.798 sec Running org.apache.hadoop.hbase.regionserver.wal.TestHLogSplit 2013-02-01 13:09:52.761 java[78402:1203] Unable to load realm info from SCDynamicStore Tests run: 30, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 242.269 sec Running org.apache.hadoop.hbase.regionserver.wal.TestHLogSplitCompressed 2013-02-01 13:13:55.600 java[78424:1203] Unable to load realm info from SCDynamicStore Tests run: 30, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 235.546 sec Running org.apache.hadoop.hbase.regionserver.wal.TestLogRollAbort 2013-02-01 13:17:51.787 java[78446:1203] Unable to load realm info from SCDynamicStore Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 36.362 sec Running org.apache.hadoop.hbase.regionserver.wal.TestLogRolling 2013-02-01 13:18:28.753 java[78473:1203] Unable to load realm info from SCDynamicStore Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 283.942 sec Running org.apache.hadoop.hbase.regionserver.wal.TestLogRollingNoCluster 2013-02-01 13:23:13.084 java[78536:1203] Unable to load realm info from SCDynamicStore Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.838 sec Running org.apache.hadoop.hbase.TestFullLogReconstruction 2013-02-01 13:23:15.325 java[78538:1203] Unable to load realm info from SCDynamicStore Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 64.557 sec Running org.apache.hadoop.hbase.regionserver.wal.TestLogRollingNoCluster 2013-02-01 13:24:32.933 java[78575:1203] Unable to load realm info from SCDynamicStore Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.283 sec They all passed.
        Hide
        Ted Yu added a comment -

        Integrated to 0.94 and trunk.

        Thanks for the reviews, Lars, Stack, Anoop and Ram.

        Show
        Ted Yu added a comment - Integrated to 0.94 and trunk. Thanks for the reviews, Lars, Stack, Anoop and Ram.
        Hide
        Lars Hofhansl added a comment -

        Posthumous +1

        Show
        Lars Hofhansl added a comment - Posthumous +1
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94 #816 (See https://builds.apache.org/job/HBase-0.94/816/)
        HBASE-7728 deadlock occurs between hlog roller and hlog syncer (Ted Yu) (Revision 1441637)

        Result = SUCCESS
        tedyu :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java
        Show
        Hudson added a comment - Integrated in HBase-0.94 #816 (See https://builds.apache.org/job/HBase-0.94/816/ ) HBASE-7728 deadlock occurs between hlog roller and hlog syncer (Ted Yu) (Revision 1441637) Result = SUCCESS tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java
        Hide
        stack added a comment -

        Looks like I just ran into this:

        Java stack information for the threads listed above:
        ===================================================
        "IPC Server handler 49 on 10304":
                at org.apache.hadoop.hbase.regionserver.wal.HLog.append(HLog.java:1121)
                - waiting to lock <0x0000000583d75420> (a java.lang.Object)
                at org.apache.hadoop.hbase.regionserver.wal.HLog.appendNoSync(HLog.java:1167)
                at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:2212)
                at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:1960)
                at org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3454)
                at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                at java.lang.reflect.Method.invoke(Method.java:597)
                at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364)
                at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1400)
        "regionserver10304.logRoller":
                at org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1302)
                - waiting to lock <0x0000000583d75438> (a java.lang.Object)
                at org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1279)
                at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1432)
                at org.apache.hadoop.hbase.regionserver.wal.HLog.cleanupCurrentWriter(HLog.java:865)
                at org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:646)
                - locked <0x0000000583d75420> (a java.lang.Object)
                at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:94)
                at java.lang.Thread.run(Thread.java:662)
        "IPC Server handler 46 on 10304":
                at org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1312)
                - waiting to lock <0x0000000583d75420> (a java.lang.Object)
                - locked <0x0000000583d75438> (a java.lang.Object)
                at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1436)
                at org.apache.hadoop.hbase.regionserver.HRegion.syncOrDefer(HRegion.java:5176)
                at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:2232)
                at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:1960)
                at org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3454)
                at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                at java.lang.reflect.Method.invoke(Method.java:597)
                at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364)
                at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1400)
        
        Found 1 deadlock.
        

        This is a 0.94.3 or so.

        Show
        stack added a comment - Looks like I just ran into this: Java stack information for the threads listed above: =================================================== "IPC Server handler 49 on 10304" : at org.apache.hadoop.hbase.regionserver.wal.HLog.append(HLog.java:1121) - waiting to lock <0x0000000583d75420> (a java.lang. Object ) at org.apache.hadoop.hbase.regionserver.wal.HLog.appendNoSync(HLog.java:1167) at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:2212) at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:1960) at org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3454) at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1400) "regionserver10304.logRoller" : at org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1302) - waiting to lock <0x0000000583d75438> (a java.lang. Object ) at org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1279) at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1432) at org.apache.hadoop.hbase.regionserver.wal.HLog.cleanupCurrentWriter(HLog.java:865) at org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:646) - locked <0x0000000583d75420> (a java.lang. Object ) at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:94) at java.lang. Thread .run( Thread .java:662) "IPC Server handler 46 on 10304" : at org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1312) - waiting to lock <0x0000000583d75420> (a java.lang. Object ) - locked <0x0000000583d75438> (a java.lang. Object ) at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1436) at org.apache.hadoop.hbase.regionserver.HRegion.syncOrDefer(HRegion.java:5176) at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:2232) at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:1960) at org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3454) at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1400) Found 1 deadlock. This is a 0.94.3 or so.
        Hide
        Sergey Shelukhin added a comment -

        I have a meta question about retrying with update lock held.. noticed that when working on something else earlier.
        If we fail to sync because writer was being closed (and is replaced by a new writer?), why would we want to sync the new unrelated writer?

        Show
        Sergey Shelukhin added a comment - I have a meta question about retrying with update lock held.. noticed that when working on something else earlier. If we fail to sync because writer was being closed (and is replaced by a new writer?), why would we want to sync the new unrelated writer?
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #389 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/389/)
        HBASE-7728 deadlock occurs between hlog roller and hlog syncer (Ted Yu) (Revision 1441631)

        Result = FAILURE
        tedyu :
        Files :

        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #389 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/389/ ) HBASE-7728 deadlock occurs between hlog roller and hlog syncer (Ted Yu) (Revision 1441631) Result = FAILURE tedyu : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
        Hide
        Lars Hofhansl added a comment -

        If the flush fails we'd flush these edit into the next log file.
        For sync'ing you're right. It makes no sense to sync the new writer.

        Show
        Lars Hofhansl added a comment - If the flush fails we'd flush these edit into the next log file. For sync'ing you're right. It makes no sense to sync the new writer.
        Hide
        ramkrishna.s.vasudevan added a comment -

        For sync'ing you're right. It makes no sense to sync the new writer.

        Why do you feel so Lars? Is it because anyway it would get synced in the subsequent sync calls?

        Show
        ramkrishna.s.vasudevan added a comment - For sync'ing you're right. It makes no sense to sync the new writer. Why do you feel so Lars? Is it because anyway it would get synced in the subsequent sync calls?
        Hide
        Lars Hofhansl added a comment -

        I think the reasoning would go like this: When we get past flush part, all edits have been flushed to the last log file. If we cannot sync that file, for example because the log was rolled, it does not help to sync another log file (it won't cause the prior file to be sync'ed).

        Show
        Lars Hofhansl added a comment - I think the reasoning would go like this: When we get past flush part, all edits have been flushed to the last log file. If we cannot sync that file, for example because the log was rolled, it does not help to sync another log file (it won't cause the prior file to be sync'ed).
        Hide
        Anoop Sam John added a comment -

        So in such a case we need to distinguish clearly that an Exception in flush or sync is due to log roll. Now this is not this way.
        In this defect case also an IOE has happened while flush but rolling of the log and new writer creation was not happened by that time. (A rolling was in progress but it has not reached till the point of creating new writer)

        Show
        Anoop Sam John added a comment - So in such a case we need to distinguish clearly that an Exception in flush or sync is due to log roll. Now this is not this way. In this defect case also an IOE has happened while flush but rolling of the log and new writer creation was not happened by that time. (A rolling was in progress but it has not reached till the point of creating new writer)
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94-security-on-Hadoop-23 #11 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/11/)
        HBASE-7728 deadlock occurs between hlog roller and hlog syncer (Ted Yu) (Revision 1441637)

        Result = FAILURE
        tedyu :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java
        Show
        Hudson added a comment - Integrated in HBase-0.94-security-on-Hadoop-23 #11 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/11/ ) HBASE-7728 deadlock occurs between hlog roller and hlog syncer (Ted Yu) (Revision 1441637) Result = FAILURE tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java
        Hide
        Jean-Marc Spaggiari added a comment -

        Has this been applied in 0.94? Seems that it's not. can anyone confirm? I looked an HLog and have expected to see a "this.writer = null;" close to line 1010 but it's not there.

        Show
        Jean-Marc Spaggiari added a comment - Has this been applied in 0.94? Seems that it's not. can anyone confirm? I looked an HLog and have expected to see a "this.writer = null;" close to line 1010 but it's not there.
        Show
        Harsh J added a comment - Per http://svn.apache.org/viewvc/hbase/tags/0.94.5RC1/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java?view=log and its contents, this is in the 0.94.5 release.
        Hide
        Jean-Marc Spaggiari added a comment -

        Yep, I just figured I looked at the v2 version of the patch but should have looked at the simplified one.

        Thanks.

        Show
        Jean-Marc Spaggiari added a comment - Yep, I just figured I looked at the v2 version of the patch but should have looked at the simplified one. Thanks.

          People

          • Assignee:
            Ted Yu
            Reporter:
            Wang Qiang
          • Votes:
            0 Vote for this issue
            Watchers:
            16 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development