Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.3.0
    • Component/s: server
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      not a blocker for 3.2, moving to 3.3

      Description

      There are two synchronized blocks locking on different objects, and to me they should be guarded by the same object. Here are the parts of the code I'm talking about:

      NIOServerCnxn.readRequest@444
      ...
                synchronized (this) {
                      outstandingRequests++;
                      // check throttling
                      if (zk.getInProcess() > factory.outstandingLimit) {
                          disableRecv();
                          // following lines should not be needed since we are already
                          // reading
                          // } else {
                          // enableRecv();
                      }
                  } 
      
      NIOServerCnxn.sendResponse@740
      ...
               synchronized (this.factory) {
                      outstandingRequests--;
                      // check throttling
                      if (zk.getInProcess() < factory.outstandingLimit
                              || outstandingRequests < 1) {
                          sk.selector().wakeup();
                          enableRecv();
                      }
                  }
      

      I think the second one is correct, and the first synchronized block should be guarded by "this.factory".

      This could be related to issue ZOOKEEPER-57, but I have no concrete indication that this is the case so far.

      1. zk-deadlock.png
        24 kB
        Todd Lipcon
      2. ZOOKEEPER-59.patch
        4 kB
        Flavio Junqueira
      3. ZOOKEEPER-59.patch
        4 kB
        Flavio Junqueira
      4. ZOOKEEPER-59.patch
        5 kB
        Flavio Junqueira

        Activity

        Hide
        Flavio Junqueira added a comment -

        Great catch, Todd! The issue you're pointing out seems fixed to me in trunk (future 3.3.0 release). I was comparing 3.2.2 code and trunk, and in readLength, when processing four letter words, we are now only cloning cnxns inside the synchronized block that locks cnxns, so we don't have a nested synchronized block for factory. I think it would be cool to check it again on our trunk code, though.

        Show
        Flavio Junqueira added a comment - Great catch, Todd! The issue you're pointing out seems fixed to me in trunk (future 3.3.0 release). I was comparing 3.2.2 code and trunk, and in readLength, when processing four letter words, we are now only cloning cnxns inside the synchronized block that locks cnxns, so we don't have a nested synchronized block for factory. I think it would be cool to check it again on our trunk code, though.
        Hide
        Todd Lipcon added a comment -

        Running jcarder on ZK 3.2.2 uncovers the deadlock in this picture. Is this patch likely to fix this problem for future versions, or is it something different?

        Show
        Todd Lipcon added a comment - Running jcarder on ZK 3.2.2 uncovers the deadlock in this picture. Is this patch likely to fix this problem for future versions, or is it something different?
        Hide
        Mahadev konar added a comment -

        I just committed this. thanks flavio.

        Show
        Mahadev konar added a comment - I just committed this. thanks flavio.
        Hide
        Benjamin Reed added a comment -

        +1 looks good

        Show
        Benjamin Reed added a comment - +1 looks good
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12438130/ZOOKEEPER-59.patch
        against trunk revision 919706.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/135/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/135/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/135/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12438130/ZOOKEEPER-59.patch against trunk revision 919706. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/135/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/135/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/135/console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        Updating patch according to Ben's comment.

        Show
        Flavio Junqueira added a comment - Updating patch according to Ben's comment.
        Hide
        Benjamin Reed added a comment -

        ah yes, i see what you mean, but i think i would prefer a outstandingRequests that might be stale to a possible deadlock.

        i agree that this doesn't need a test. this is a latent bug that is hard to be reproduced and a test would be very specific to the implementation of the bug.

        Show
        Benjamin Reed added a comment - ah yes, i see what you mean, but i think i would prefer a outstandingRequests that might be stale to a possible deadlock. i agree that this doesn't need a test. this is a latent bug that is hard to be reproduced and a test would be very specific to the implementation of the bug.
        Hide
        Flavio Junqueira added a comment -

        As I already mentioned above, I don't think it requires a test. Ben, could you please check my previous post to see if you agree?

        Show
        Flavio Junqueira added a comment - As I already mentioned above, I don't think it requires a test. Ben, could you please check my previous post to see if you agree?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12437656/ZOOKEEPER-59.patch
        against trunk revision 919640.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/131/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/131/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/131/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12437656/ZOOKEEPER-59.patch against trunk revision 919640. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/131/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/131/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/131/console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        Good point, Ben. I haven't done it because of the predicate:

        outstandingRequests < 1
        

        in that block.

        Show
        Flavio Junqueira added a comment - Good point, Ben. I haven't done it because of the predicate: outstandingRequests < 1 in that block.
        Hide
        Benjamin Reed added a comment -

        i like how you removed the nested locking, but it looks like you missed an opportunity in the last block. don't you really want to

        synchronized(this) {outstandingRequests--;}
        

        so that you aren't locking this and this.factory?

        Show
        Benjamin Reed added a comment - i like how you removed the nested locking, but it looks like you missed an opportunity in the last block. don't you really want to synchronized(this) {outstandingRequests--;} so that you aren't locking this and this.factory?
        Hide
        Flavio Junqueira added a comment -

        Retrying Hudson.

        Show
        Flavio Junqueira added a comment - Retrying Hudson.
        Hide
        Flavio Junqueira added a comment -

        This patch fixes some synchronization blocks on NIOServerCnxn, and I believe it doesn't require a test, since it is not coming from the observation of bug during a run. I also think that reproducing an interleaving that can generate a problem would be extremely difficult, so my I suggest we don't have a test for it.

        Regarding the core test that failed, it is unrelated to this issue, and I have opened a separate jira for it ZOOKEEPER-684.

        Show
        Flavio Junqueira added a comment - This patch fixes some synchronization blocks on NIOServerCnxn, and I believe it doesn't require a test, since it is not coming from the observation of bug during a run. I also think that reproducing an interleaving that can generate a problem would be extremely difficult, so my I suggest we don't have a test for it. Regarding the core test that failed, it is unrelated to this issue, and I have opened a separate jira for it ZOOKEEPER-684 .
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12437656/ZOOKEEPER-59.patch
        against trunk revision 915956.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/125/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/125/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/125/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12437656/ZOOKEEPER-59.patch against trunk revision 915956. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/125/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/125/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/125/console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        I have updated the patch in the following way:

        • I have changed synchronization blocks to reflect the discussion of this issue. In the trunk code, there is at least one instance increment outstandingRequests that is not synchronized with "this". I have also tried to make sure that all necessary blocks of code that update the SelectionKey are synchronized with factory.
        • I have not included the change of ZooKeeperServer that corresponds to making requestsInProcess an AtomicInteger. This is really a minor change, and not strictly necessary.

        Ben made this comment that the methods of SelectionKey are thread-safe. However, I believe the synchronization blocks are there to make the block of code atomic, and not because of the individual calls. As mentioned above, I have tried to make sure that all blocks are properly synchronized. I'm not exactly sure, though, why some of the blocks related to SelectionKey operations are synchronized. I have the impression that they have been introduced conservatively, but I would appreciate if someone with an opinion could shed some light here.

        Show
        Flavio Junqueira added a comment - I have updated the patch in the following way: I have changed synchronization blocks to reflect the discussion of this issue. In the trunk code, there is at least one instance increment outstandingRequests that is not synchronized with "this". I have also tried to make sure that all necessary blocks of code that update the SelectionKey are synchronized with factory. I have not included the change of ZooKeeperServer that corresponds to making requestsInProcess an AtomicInteger. This is really a minor change, and not strictly necessary. Ben made this comment that the methods of SelectionKey are thread-safe. However, I believe the synchronization blocks are there to make the block of code atomic, and not because of the individual calls. As mentioned above, I have tried to make sure that all blocks are properly synchronized. I'm not exactly sure, though, why some of the blocks related to SelectionKey operations are synchronized. I have the impression that they have been introduced conservatively, but I would appreciate if someone with an opinion could shed some light here.
        Hide
        Mahadev konar added a comment -

        flavio, will you be updating this patch?

        Show
        Mahadev konar added a comment - flavio, will you be updating this patch?
        Hide
        Patrick Hunt added a comment -

        Hi Flavio, what's the status of this patch?

        Show
        Patrick Hunt added a comment - Hi Flavio, what's the status of this patch?
        Hide
        Patrick Hunt added a comment -

        Whatever happened with this patch? Seems to have gotten dropped. re-investigate for 3.2

        Show
        Patrick Hunt added a comment - Whatever happened with this patch? Seems to have gotten dropped. re-investigate for 3.2
        Hide
        Flavio Junqueira added a comment -

        New patch, AtomicInteger problem fixed. It now passes all tests, but AsyncTest. AsyncTest is failing even on trunk in my computer, though.

        Show
        Flavio Junqueira added a comment - New patch, AtomicInteger problem fixed. It now passes all tests, but AsyncTest. AsyncTest is failing even on trunk in my computer, though.
        Hide
        Patrick Hunt added a comment -

        Cancelling this patch due to test failure - the vm itself is crashing. Flavio, do the tests pass for you with this patch applied?

        I'm seeing the following after applying this patch - the mainline/HEAD code is testing fine. This is ubuntu with java 1.6.0_06.

        junit.run:
        [junit] Running org.apache.zookeeper.server.DeserializationPerfTest
        [junit] Tests run: 8, Failures: 0, Errors: 0, Time elapsed: 15.073 sec
        [junit] Running org.apache.zookeeper.server.SerializationPerfTest
        [junit] Tests run: 8, Failures: 0, Errors: 0, Time elapsed: 12.307 sec
        [junit] Running org.apache.zookeeper.server.ZooKeeperServerTest
        [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.141 sec
        [junit] Running org.apache.zookeeper.test.AsyncTest
        [junit] Tests run: 2, Failures: 2, Errors: 0, Time elapsed: 108.481 sec
        [junit] Test org.apache.zookeeper.test.AsyncTest FAILED
        [junit] Running org.apache.zookeeper.test.ClientTest
        [junit] Running org.apache.zookeeper.test.ClientTest
        [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec
        [junit] Test org.apache.zookeeper.test.ClientTest FAILED (crashed)
        [junit] Running org.apache.zookeeper.test.DataTreeTest
        [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.15 sec
        [junit] Running org.apache.zookeeper.test.OOMTest
        [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.022 sec
        [junit] Running org.apache.zookeeper.test.QuorumTest
        [junit] Tests run: 5, Failures: 5, Errors: 0, Time elapsed: 201.057 sec
        [junit] Test org.apache.zookeeper.test.QuorumTest FAILED
        [junit] Running org.apache.zookeeper.test.RecoveryTest
        [junit] Running org.apache.zookeeper.test.RecoveryTest
        [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec
        [junit] Test org.apache.zookeeper.test.RecoveryTest FAILED (crashed)
        [junit] Running org.apache.zookeeper.test.SessionTest
        [junit] Running org.apache.zookeeper.test.SessionTest
        [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec
        [junit] Test org.apache.zookeeper.test.SessionTest FAILED (crashed)
        [junit] Running org.apache.zookeeper.test.WatcherFuncTest
        [junit] Running org.apache.zookeeper.test.WatcherFuncTest
        [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec
        [junit] Test org.apache.zookeeper.test.WatcherFuncTest FAILED (crashed)

        BUILD FAILED
        /home/phunt/dev/Workspaces/gitzk/build.xml:370: Tests failed!

        Show
        Patrick Hunt added a comment - Cancelling this patch due to test failure - the vm itself is crashing. Flavio, do the tests pass for you with this patch applied? I'm seeing the following after applying this patch - the mainline/HEAD code is testing fine. This is ubuntu with java 1.6.0_06. junit.run: [junit] Running org.apache.zookeeper.server.DeserializationPerfTest [junit] Tests run: 8, Failures: 0, Errors: 0, Time elapsed: 15.073 sec [junit] Running org.apache.zookeeper.server.SerializationPerfTest [junit] Tests run: 8, Failures: 0, Errors: 0, Time elapsed: 12.307 sec [junit] Running org.apache.zookeeper.server.ZooKeeperServerTest [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.141 sec [junit] Running org.apache.zookeeper.test.AsyncTest [junit] Tests run: 2, Failures: 2, Errors: 0, Time elapsed: 108.481 sec [junit] Test org.apache.zookeeper.test.AsyncTest FAILED [junit] Running org.apache.zookeeper.test.ClientTest [junit] Running org.apache.zookeeper.test.ClientTest [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec [junit] Test org.apache.zookeeper.test.ClientTest FAILED (crashed) [junit] Running org.apache.zookeeper.test.DataTreeTest [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.15 sec [junit] Running org.apache.zookeeper.test.OOMTest [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.022 sec [junit] Running org.apache.zookeeper.test.QuorumTest [junit] Tests run: 5, Failures: 5, Errors: 0, Time elapsed: 201.057 sec [junit] Test org.apache.zookeeper.test.QuorumTest FAILED [junit] Running org.apache.zookeeper.test.RecoveryTest [junit] Running org.apache.zookeeper.test.RecoveryTest [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec [junit] Test org.apache.zookeeper.test.RecoveryTest FAILED (crashed) [junit] Running org.apache.zookeeper.test.SessionTest [junit] Running org.apache.zookeeper.test.SessionTest [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec [junit] Test org.apache.zookeeper.test.SessionTest FAILED (crashed) [junit] Running org.apache.zookeeper.test.WatcherFuncTest [junit] Running org.apache.zookeeper.test.WatcherFuncTest [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec [junit] Test org.apache.zookeeper.test.WatcherFuncTest FAILED (crashed) BUILD FAILED /home/phunt/dev/Workspaces/gitzk/build.xml:370: Tests failed!
        Hide
        Flavio Junqueira added a comment -

        Have done what Pat requested...

        Show
        Flavio Junqueira added a comment - Have done what Pat requested...
        Hide
        Flavio Junqueira added a comment -

        Deleting old patches, and adding a new one that addresses all discussed points.

        Show
        Flavio Junqueira added a comment - Deleting old patches, and adding a new one that addresses all discussed points.
        Hide
        Patrick Hunt added a comment -

        Looks like patch _2 replaces _1? Canceling the patch for now. Please resubmit after addressing.

        See http://wiki.apache.org/hadoop/ZooKeeper/HowToContribute, remember that we are replacing (re-attaching with the same name) the patch.

        Show
        Patrick Hunt added a comment - Looks like patch _2 replaces _1? Canceling the patch for now. Please resubmit after addressing. See http://wiki.apache.org/hadoop/ZooKeeper/HowToContribute , remember that we are replacing (re-attaching with the same name) the patch.
        Hide
        Flavio Junqueira added a comment -

        Mahadev, Thanks for fixing the patch. My eclipse still hasn't leaned how to generate diffs appropriately. Your assessment of the patch is right, and the observation on "outstandingRequests" seems correct to me.

        Andrew, I agree with you on getInProcess(). It is probably best to make it synchronized or simply to use an atomic integer for requestsInProcess. Now, on your description of the connection problem, I think this is an important observation because when I observed the problem, the log messages did not show the server receiving anything. Also, if the server receives and we don't disable sending, then either ping requests are not going through the pipeline of request processors (once it gets to FinalRequestProcessor, the server sends a response immediately) or sendResponse is not doing its job.

        Show
        Flavio Junqueira added a comment - Mahadev, Thanks for fixing the patch. My eclipse still hasn't leaned how to generate diffs appropriately. Your assessment of the patch is right, and the observation on "outstandingRequests" seems correct to me. Andrew, I agree with you on getInProcess(). It is probably best to make it synchronized or simply to use an atomic integer for requestsInProcess. Now, on your description of the connection problem, I think this is an important observation because when I observed the problem, the log messages did not show the server receiving anything. Also, if the server receives and we don't disable sending, then either ping requests are not going through the pipeline of request processors (once it gets to FinalRequestProcessor, the server sends a response immediately) or sendResponse is not doing its job.
        Hide
        Andrew Kornev added a comment -

        I agree with Flavio's comment in https://issues.apache.org/jira/browse/ZOOKEEPER-59?focusedCommentId=12611527#action_12611527 : incInProcess() should be called prior to calling first request processor.

        Also, getInProcess() should be a synchronized method. Even better solution would be to make requestsInProcess an atomic integer: there is really no need to synchronize on the ZooKeeper instance to update/access the variable.

        Having said that, I don't think this will fix the the timeout problem. From what I saw in the server logs during my investigation of a similar incident was that the server was still receiving and processing client requests (in other words, the counters we're trying to fix here didn't cause throttling). The problem was that the responses weren't being sent to the client which was causing the client to timeout.

        Show
        Andrew Kornev added a comment - I agree with Flavio's comment in https://issues.apache.org/jira/browse/ZOOKEEPER-59?focusedCommentId=12611527#action_12611527 : incInProcess() should be called prior to calling first request processor. Also, getInProcess() should be a synchronized method. Even better solution would be to make requestsInProcess an atomic integer: there is really no need to synchronize on the ZooKeeper instance to update/access the variable. Having said that, I don't think this will fix the the timeout problem. From what I saw in the server logs during my investigation of a similar incident was that the server was still receiving and processing client requests (in other words, the counters we're trying to fix here didn't cause throttling). The problem was that the responses weren't being sent to the client which was causing the client to timeout.
        Hide
        Mahadev konar added a comment -

        since outstandingrequests does nto control the enable and disable receive then this is just a patch to fix the counters right? This cannot be responsible for connection loss right?

        also – just to understand the patch —

        it makes three changes —

        1) remove synchronized block around disable recv
        2) consistent guard arnd incr/decr outstandingrequests
        3) removed synchronization around sk ops in doIO()

        is that right?

        Show
        Mahadev konar added a comment - since outstandingrequests does nto control the enable and disable receive then this is just a patch to fix the counters right? This cannot be responsible for connection loss right? also – just to understand the patch — it makes three changes — 1) remove synchronized block around disable recv 2) consistent guard arnd incr/decr outstandingrequests 3) removed synchronization around sk ops in doIO() is that right?
        Hide
        Mahadev konar added a comment -

        the patch wasnt filed from trunk... just attaching a patch that is svn diffed from the trunk .

        Show
        Mahadev konar added a comment - the patch wasnt filed from trunk... just attaching a patch that is svn diffed from the trunk .
        Hide
        Flavio Junqueira added a comment -

        It turns out that outstandingRequests is not the variable controlling the throttling, but ZooKeeperServer.requestsInProcess. It seems to me that we should do the increment before calling the first request processor in the following code block:

                  if (validpacket) {
                        firstProcessor.processRequest(si);
                        if (cnxn != null) {
                            incInProcess();
                        }
                    }
        

        This is in ZooKeeperServer@870.

        Show
        Flavio Junqueira added a comment - It turns out that outstandingRequests is not the variable controlling the throttling, but ZooKeeperServer.requestsInProcess. It seems to me that we should do the increment before calling the first request processor in the following code block: if (validpacket) { firstProcessor.processRequest(si); if (cnxn != null) { incInProcess(); } } This is in ZooKeeperServer@870.
        Hide
        Flavio Junqueira added a comment -

        Addresses the concurrency problems discussed.

        Show
        Flavio Junqueira added a comment - Addresses the concurrency problems discussed.
        Hide
        Benjamin Reed added a comment -

        There are actually two things to guard here: the outstandingBuffers and the selector loop.

        In read request, we really just need:

        synchronized (this)

        { outstandingRequests++; }

        In sendResponse we need to synchronize on this for outstandingRequests and on factory for the selector:

        synchronized (this)

        { outstandingRequests--; }

        synchronized (this.factory) {
        // check throttling
        if (zk.getInProcess() < factory.outstandingLimit

        outstandingRequests < 1) { sk.selector().wakeup(); enableRecv(); }

        }

        I don't think the synchronize block is needed in doIO. The SectionKey methods are supposed to be thread safe.

        I think you might have found something! If the outstandingRequests count gets messed up, bad things will happen including the server becoming unresponsive to a client.

        Show
        Benjamin Reed added a comment - There are actually two things to guard here: the outstandingBuffers and the selector loop. In read request, we really just need: synchronized (this) { outstandingRequests++; } In sendResponse we need to synchronize on this for outstandingRequests and on factory for the selector: synchronized (this) { outstandingRequests--; } synchronized (this.factory) { // check throttling if (zk.getInProcess() < factory.outstandingLimit outstandingRequests < 1) { sk.selector().wakeup(); enableRecv(); } } I don't think the synchronize block is needed in doIO. The SectionKey methods are supposed to be thread safe. I think you might have found something! If the outstandingRequests count gets messed up, bad things will happen including the server becoming unresponsive to a client.
        Hide
        Flavio Junqueira added a comment -

        The following block might also have the wrong guard:

        NIOServerCnxn.doIO@379
        synchronized (this) {
                            if (outgoingBuffers.size() == 0) {
                                if (!initialized
                                        && (sk.interestOps() & SelectionKey.OP_READ) == 0) {
                                    throw new IOException("Responded to info probe");
                                }
                                sk.interestOps(sk.interestOps()
                                        & (~SelectionKey.OP_WRITE));
                            } else {
                                sk.interestOps(sk.interestOps()
                                                | SelectionKey.OP_WRITE);
                            }
                        }
        
        Show
        Flavio Junqueira added a comment - The following block might also have the wrong guard: NIOServerCnxn.doIO@379 synchronized (this) { if (outgoingBuffers.size() == 0) { if (!initialized && (sk.interestOps() & SelectionKey.OP_READ) == 0) { throw new IOException("Responded to info probe"); } sk.interestOps(sk.interestOps() & (~SelectionKey.OP_WRITE)); } else { sk.interestOps(sk.interestOps() | SelectionKey.OP_WRITE); } }

          People

          • Assignee:
            Flavio Junqueira
            Reporter:
            Flavio Junqueira
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development