Hadoop Common
  1. Hadoop Common
  2. HADOOP-5859

FindBugs : fix "wait() or sleep() with locks held" warnings in hdfs

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      This JIRA fixes the following warnings:

      SWL org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal() calls Thread.sleep() with a lock held
      TLW wait() with two locks held in org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.flushInternal()
      TLW wait() with two locks held in org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.flushInternal()
      TLW wait() with two locks held in org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.writeChunk(byte[], int, int, byte[])

      1. 5859-41.patch
        63 kB
        Kan Zhang
      2. 5859-40.patch
        63 kB
        Kan Zhang
      3. 5859-38.patch
        63 kB
        Kan Zhang
      4. 5859-36.patch
        63 kB
        Kan Zhang
      5. 5859-35.patch
        63 kB
        Kan Zhang
      6. 5859-33.patch
        63 kB
        Kan Zhang
      7. 5859-26.patch
        63 kB
        Kan Zhang
      8. 5859-22.patch
        62 kB
        Kan Zhang
      9. 5859-21.patch
        62 kB
        Kan Zhang
      10. 5859-8.patch
        3 kB
        Kan Zhang
      11. 5859-5.patch
        7 kB
        Kan Zhang
      12. 5859-4.patch
        2 kB
        Kan Zhang

        Issue Links

          Activity

          Hide
          Kan Zhang added a comment -

          attached 2 versions of the same patch. one with correct indentation, one without, for easier reviewing.

          Show
          Kan Zhang added a comment - attached 2 versions of the same patch. one with correct indentation, one without, for easier reviewing.
          Hide
          Kan Zhang added a comment -

          added a new patch that refactored DFSClient.DFSOutputStream code.
          1) socket write operation in DataStreamer is now done without holding dataQueue lock
          2) dataQueue lock and ackQueue lock are merged into one
          3) removed the above findbugs warnings

          Show
          Kan Zhang added a comment - added a new patch that refactored DFSClient.DFSOutputStream code. 1) socket write operation in DataStreamer is now done without holding dataQueue lock 2) dataQueue lock and ackQueue lock are merged into one 3) removed the above findbugs warnings
          Hide
          dhruba borthakur added a comment -

          > 2) dataQueue lock and ackQueue lock are merged into one

          Is this done to make the code simpler? Or was there some race condition that this fixes?

          Show
          dhruba borthakur added a comment - > 2) dataQueue lock and ackQueue lock are merged into one Is this done to make the code simpler? Or was there some race condition that this fixes?
          Hide
          Kan Zhang added a comment -

          > Is this done to make the code simpler? Or was there some race condition that this fixes?

          I don't see any race condition so far. One potential case is calling ackQueue.size() in writeChunk() without holding ackQueue lock, which is ok assuming int assignments are atomic. The major reasons for refactoring the locks are 1) existing code is bit confusing. Sometimes the dataQueue lock is used to protect accesses to not only dataQueue but also currentPacket, blockStream, blockReplyStream, etc (see closeInternal() for example), which I think is a misuse. 2) I have moved time-consuming operations like RPC call or writing to socket out of blocks sync'ed on the dataQueue. So now the only times one needs to hold dataQueue or ackQueue lock is when manipulating the queues themselves. By merging the 2 locks, code can be simpler.

          Show
          Kan Zhang added a comment - > Is this done to make the code simpler? Or was there some race condition that this fixes? I don't see any race condition so far. One potential case is calling ackQueue.size() in writeChunk() without holding ackQueue lock, which is ok assuming int assignments are atomic. The major reasons for refactoring the locks are 1) existing code is bit confusing. Sometimes the dataQueue lock is used to protect accesses to not only dataQueue but also currentPacket, blockStream, blockReplyStream, etc (see closeInternal() for example), which I think is a misuse. 2) I have moved time-consuming operations like RPC call or writing to socket out of blocks sync'ed on the dataQueue. So now the only times one needs to hold dataQueue or ackQueue lock is when manipulating the queues themselves. By merging the 2 locks, code can be simpler.
          Hide
          Todd Lipcon added a comment -

          which is ok assuming int assignments are atomic

          I haven't had a chance to look through the patch, but in general this is not a safe assumption. Although int assignment itself is atomic, if you don't do access inside a synchronization block, there's no enforcement of a memory barrier between threads, so one thread might not see the other thread's assignment. Also, if someone later decides that size should be a long, long assignment is not atomic on some JVM platforms. Thus the existence of java.util.concurrent.atomic.AtomicInteger.

          Show
          Todd Lipcon added a comment - which is ok assuming int assignments are atomic I haven't had a chance to look through the patch, but in general this is not a safe assumption. Although int assignment itself is atomic, if you don't do access inside a synchronization block, there's no enforcement of a memory barrier between threads, so one thread might not see the other thread's assignment. Also, if someone later decides that size should be a long, long assignment is not atomic on some JVM platforms. Thus the existence of java.util.concurrent.atomic.AtomicInteger.
          Hide
          dhruba borthakur added a comment -

          > so one thread might not see the other thread's assignment.

          Unless the variable is declared as "volatile", right?

          Show
          dhruba borthakur added a comment - > so one thread might not see the other thread's assignment. Unless the variable is declared as "volatile", right?
          Hide
          Todd Lipcon added a comment -

          Unless the variable is declared as "volatile", right?

          Right. volatile longs, though, are still not atomically assigned on 32-bit platforms, so I think it's asking for trouble to use the atomic assignment of volatile ints to your advantage, since someone may co

          Show
          Todd Lipcon added a comment - Unless the variable is declared as "volatile", right? Right. volatile longs, though, are still not atomically assigned on 32-bit platforms, so I think it's asking for trouble to use the atomic assignment of volatile ints to your advantage, since someone may co
          Hide
          Todd Lipcon added a comment -

          er... trackpad malfunction. sorry...
          continuing:
          someone may come along at a later date and replace the int with a long thinking it's a trivial change that couldn't possibly break anything.

          Show
          Todd Lipcon added a comment - er... trackpad malfunction. sorry... continuing: someone may come along at a later date and replace the int with a long thinking it's a trivial change that couldn't possibly break anything.
          Hide
          dhruba borthakur added a comment -

          From the section 17.7 of the Java Language Specs: "Writes and reads of volatile long and double values are always atomic. ". I think all JVMs should ensure that all "volatile longs" are atomic.

          Show
          dhruba borthakur added a comment - From the section 17.7 of the Java Language Specs: "Writes and reads of volatile long and double values are always atomic. ". I think all JVMs should ensure that all "volatile longs" are atomic.
          Hide
          Kan Zhang added a comment - - edited

          > Unless the variable is declared as "volatile", right?

          I just checked, the JDK6 implementation of LinkedList didn't declare the "size" variable volatile. So Todd is right, a second thread may see a stale cached value. In any case, this is a non-issue after my refactoring. In the refactored code, all accesses to dataQueue or ackQueue are sync'ed on dataQueue lock.

          Show
          Kan Zhang added a comment - - edited > Unless the variable is declared as "volatile", right? I just checked, the JDK6 implementation of LinkedList didn't declare the "size" variable volatile. So Todd is right, a second thread may see a stale cached value. In any case, this is a non-issue after my refactoring. In the refactored code, all accesses to dataQueue or ackQueue are sync'ed on dataQueue lock.
          Hide
          Todd Lipcon added a comment -

          Hm, thanks Dhruba - I was recalling incorrectly. The discussion is getting way off-topic to this JIRA, but if people are interested, I highly recommend this video lecture: http://video.google.com/videoplay?docid=8394326369005388010

          Show
          Todd Lipcon added a comment - Hm, thanks Dhruba - I was recalling incorrectly. The discussion is getting way off-topic to this JIRA, but if people are interested, I highly recommend this video lecture: http://video.google.com/videoplay?docid=8394326369005388010
          Hide
          Suresh Srinivas added a comment -

          Comments on the latest patch:

          1. Can you add a comment where dataQueue and ackQueue are declared, that both these data structures are synchronized on dataQueue object
          2. Can you retain the code formatting where you think the existing code is fine. That makes the patch size smaller and applying patches to the changed code easier.
          3. Move static variable to the beginning of the class DFSOutputStream
          4. maxRecoveryErrorCount should become static
          5. Do we need so many volatile variables. Can one thread flag close and the DataStreamer does all the closing of internal variables
          6. As we discussed, in ResponseProcessor.run() add notify when exception is caught
          Show
          Suresh Srinivas added a comment - Comments on the latest patch: Can you add a comment where dataQueue and ackQueue are declared, that both these data structures are synchronized on dataQueue object Can you retain the code formatting where you think the existing code is fine. That makes the patch size smaller and applying patches to the changed code easier. Move static variable to the beginning of the class DFSOutputStream maxRecoveryErrorCount should become static Do we need so many volatile variables. Can one thread flag close and the DataStreamer does all the closing of internal variables As we discussed, in ResponseProcessor.run() add notify when exception is caught
          Hide
          Kan Zhang added a comment -

          > As we discussed, in ResponseProcessor.run() add notify when exception is caught

          As we discussed again, it's better to avoid adding unnecessary notifications, but rather add a timeout to dataQueue.wait(). Other comments are addressed in the new patch. Thanks.

          Show
          Kan Zhang added a comment - > As we discussed, in ResponseProcessor.run() add notify when exception is caught As we discussed again, it's better to avoid adding unnecessary notifications, but rather add a timeout to dataQueue.wait(). Other comments are addressed in the new patch. Thanks.
          Hide
          Kan Zhang added a comment -

          > I just checked, the JDK6 implementation of LinkedList didn't declare the "size" variable volatile. So Todd is right, a second thread may see a stale cached value

          I take it back. Actually, in this particular case the writeChunk() thread will be able to see the fresh value of ackQueue size whenever it's been modified. The reason is that writeChunk() reads ackQueue size while holding the dataQueue lock and waits on the dataQueue lock in a loop when condition is not satisfied. There are 2 threads that may modify ackQueue size. One is DataStreamer, who always holds both dataQueue and ackQueue locks when modifying ackQueue. So any changes to ackQueue by DataStreamer will be visible to writeChunk() thread. The other thread that modifies ackQueue is ResponseProcessor. It may modify ackQueue while holding only the ackQueue lock. However, immediately after ackQueue modification, it grabs the dataQueue lock and calls dataQueue.notifyAll() to wake up the writeChunk() thread. So when the writeChunk() thread grabs the dataQueue lock afterwards, the fresh value of ackQueue size will be visible. Make sense?

          Show
          Kan Zhang added a comment - > I just checked, the JDK6 implementation of LinkedList didn't declare the "size" variable volatile. So Todd is right, a second thread may see a stale cached value I take it back. Actually, in this particular case the writeChunk() thread will be able to see the fresh value of ackQueue size whenever it's been modified. The reason is that writeChunk() reads ackQueue size while holding the dataQueue lock and waits on the dataQueue lock in a loop when condition is not satisfied. There are 2 threads that may modify ackQueue size. One is DataStreamer, who always holds both dataQueue and ackQueue locks when modifying ackQueue. So any changes to ackQueue by DataStreamer will be visible to writeChunk() thread. The other thread that modifies ackQueue is ResponseProcessor. It may modify ackQueue while holding only the ackQueue lock. However, immediately after ackQueue modification, it grabs the dataQueue lock and calls dataQueue.notifyAll() to wake up the writeChunk() thread. So when the writeChunk() thread grabs the dataQueue lock afterwards, the fresh value of ackQueue size will be visible. Make sense?
          Hide
          Suresh Srinivas added a comment -
          1. Remove commented line setting maxPacket to 1000
          2. DataStreamer.run() No need to set response to null, because closeResponse() called prior, sets the response to null
          3. DataStreamer.run() code could be refactor to use closeStream() method instead of duplicating the code to close blockStream. A flag could be passed to indicate indicate end-of-block.
          4. ResponseProcessor.processDataNodeError() could use closeStream method.
          Show
          Suresh Srinivas added a comment - Remove commented line setting maxPacket to 1000 DataStreamer.run() No need to set response to null, because closeResponse() called prior, sets the response to null DataStreamer.run() code could be refactor to use closeStream() method instead of duplicating the code to close blockStream . A flag could be passed to indicate indicate end-of-block. ResponseProcessor.processDataNodeError() could use closeStream method.
          Hide
          Kan Zhang added a comment -

          thanks, Suresh. Attached a new patch that addressed your comments.

          Show
          Kan Zhang added a comment - thanks, Suresh. Attached a new patch that addressed your comments.
          Hide
          Kan Zhang added a comment -

          added a new patch with some refactoring of closeStream().

          Show
          Kan Zhang added a comment - added a new patch with some refactoring of closeStream().
          Hide
          Suresh Srinivas added a comment -

          +1. Looks good.

          Show
          Suresh Srinivas added a comment - +1. Looks good.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12409409/5859-35.patch
          against trunk revision 780114.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/429/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/429/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/429/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/429/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12409409/5859-35.patch against trunk revision 780114. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/429/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/429/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/429/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/429/console This message is automatically generated.
          Hide
          Kan Zhang added a comment -

          submit again to test the newest patch - 5859-36.patch.

          Show
          Kan Zhang added a comment - submit again to test the newest patch - 5859-36.patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12409430/5859-36.patch
          against trunk revision 780465.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/442/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/442/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/442/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/442/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12409430/5859-36.patch against trunk revision 780465. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/442/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/442/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/442/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/442/console This message is automatically generated.
          Hide
          Kan Zhang added a comment -

          After discussing with Raghu, I'm uploading a new patch where a new flag is added so that end-of-block is sent only when all data has been sent and ack'ed from datanode.

          This patch doesn't include any new test since it's only a refactoring and no new feature is added.

          Show
          Kan Zhang added a comment - After discussing with Raghu, I'm uploading a new patch where a new flag is added so that end-of-block is sent only when all data has been sent and ack'ed from datanode. This patch doesn't include any new test since it's only a refactoring and no new feature is added.
          Hide
          Suresh Srinivas added a comment -
          1. DataStreamer.nextBlockOutputStream() unnecessary code setting nodes to null (I see a Dead store to nodes findbugs warnings).
          2. nit: You can get rid of parameter to DataStreamer.closeStream() and use DataStreamer.sendEndOfBlock instead. Simplifies understanding the cases when end of block is sent.
          3. nit: DataStreamer.close() - it is a good idea to move printing the log of sending end of block to DataStreamer.closeStreamer() method.
          4. nit: Move the comments added for DataStreamer.run() to the beginning of DataStreamer class.
          5. nit: Capitalize MaxRecoveryErrorCount
          Show
          Suresh Srinivas added a comment - DataStreamer.nextBlockOutputStream() unnecessary code setting nodes to null (I see a Dead store to nodes findbugs warnings). nit: You can get rid of parameter to DataStreamer.closeStream() and use DataStreamer.sendEndOfBlock instead. Simplifies understanding the cases when end of block is sent. nit: DataStreamer.close() - it is a good idea to move printing the log of sending end of block to DataStreamer.closeStreamer() method. nit: Move the comments added for DataStreamer.run() to the beginning of DataStreamer class. nit: Capitalize MaxRecoveryErrorCount
          Hide
          Kan Zhang added a comment -

          uploading a new patch, which doesn't send end-of-block when closing (end-of-block was sent unnecessarily in the original code). also addressed Suresh's earlier comments.

          Show
          Kan Zhang added a comment - uploading a new patch, which doesn't send end-of-block when closing (end-of-block was sent unnecessarily in the original code). also addressed Suresh's earlier comments.
          Hide
          Kan Zhang added a comment -

          trivial updates.

          Show
          Kan Zhang added a comment - trivial updates.
          Hide
          Suresh Srinivas added a comment -

          +1 with all the tests passing

          Show
          Suresh Srinivas added a comment - +1 with all the tests passing
          Hide
          Kan Zhang added a comment -

          my local hdfs unit tests passed. here is the test-patch result.

               [exec] -1 overall.  
               [exec] 
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec] 
               [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.
               [exec]                         Please justify why no tests are needed for this patch.
               [exec] 
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec] 
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec] 
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec] 
               [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
               [exec] 
               [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
          
          Show
          Kan Zhang added a comment - my local hdfs unit tests passed. here is the test-patch result. [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no tests are needed for this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12409796/5859-41.patch
          against trunk revision 781683.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/461/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/461/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/461/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/461/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12409796/5859-41.patch against trunk revision 781683. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/461/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/461/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/461/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/461/console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I have committed this. Thanks, Kan!

          Show
          Tsz Wo Nicholas Sze added a comment - I have committed this. Thanks, Kan!
          Hide
          Luke Forehand added a comment -

          Is this patch applied to 0.20.1 by chance? We are experiencing this issue with regularity on a 3 node cluster attempting to load 15GB of data.

          Show
          Luke Forehand added a comment - Is this patch applied to 0.20.1 by chance? We are experiencing this issue with regularity on a 3 node cluster attempting to load 15GB of data.
          Hide
          Suresh Srinivas added a comment -

          This patch is not applied to 0.20.1

          Show
          Suresh Srinivas added a comment - This patch is not applied to 0.20.1
          Hide
          Todd Lipcon added a comment -

          Note that this patch doesn't address the root issue in most cases (I've seen confusion from several users on this front).

          Usually the root issue here is that a thread gets stuck in processDatanodeError, particularly the RPC to the recovering DN. I've seen this in a number of situations, usually due to other IPC bugs that cause hung clients in various error scenarios, and sometimes due to bugs in recovery on the DN side.

          So, if you see a stack trace that looks like this issue, it's probably a symptom of some other bug, and this patch would just rejigger the stack traces, but not actually solve the underlying problem.

          Luke: do you have the stack trace from the DataStreamer thread? Is it stuck in an IPC call? If so, do you have the jstack from the recovery datanode?

          Show
          Todd Lipcon added a comment - Note that this patch doesn't address the root issue in most cases (I've seen confusion from several users on this front). Usually the root issue here is that a thread gets stuck in processDatanodeError, particularly the RPC to the recovering DN. I've seen this in a number of situations, usually due to other IPC bugs that cause hung clients in various error scenarios, and sometimes due to bugs in recovery on the DN side. So, if you see a stack trace that looks like this issue, it's probably a symptom of some other bug, and this patch would just rejigger the stack traces, but not actually solve the underlying problem. Luke: do you have the stack trace from the DataStreamer thread? Is it stuck in an IPC call? If so, do you have the jstack from the recovery datanode?

            People

            • Assignee:
              Kan Zhang
              Reporter:
              Kan Zhang
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development