Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-3072

Race condition in throttling

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.5.0, 3.5.1, 3.5.2, 3.5.3, 3.5.4
    • Fix Version/s: 3.5.4, 3.6.0
    • Component/s: server

      Description

      There is a race condition in the server throttling code. It is possible that the disableRecv is called after enableRecv.

      Basically, the I/O work thread does this in processPacket: https://github.com/apache/zookeeper/blob/release-3.5.3/src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java#L1102 

                      submitRequest(si);

                  }

              }

              cnxn.incrOutstandingRequests(h);

          }

       

      incrOutstandingRequests() checks for limit breach, and potentially turns on throttling, https://github.com/apache/zookeeper/blob/release-3.5.3/src/java/main/org/apache/zookeeper/server/NIOServerCnxn.java#L384

       

      submitRequest() will create a logical request and en-queue it so that Processor thread can pick it up. After being de-queued by Processor thread, it does necessary handling, and then calls this https://github.com/apache/zookeeper/blob/release-3.5.3/src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java#L459 :

       

                  cnxn.sendResponse(hdr, rsp, "response");

       

      and in sendResponse(), it first appends to outgoing buffer, and then checks if un-throttle is needed:  https://github.com/apache/zookeeper/blob/release-3.5.3/src/java/main/org/apache/zookeeper/server/NIOServerCnxn.java#L708

       

      However, if there is a context switch between submitRequest() and cnxn.incrOutstandingRequests(), so that Processor thread completes cnxn.sendResponse() call before I/O thread switches back, then enableRecv() will happen before disableRecv(), and enableRecv() will fail the CAS ops, while disableRecv() will succeed, resulting in a deadlock: un-throttle is needed for letting in requests, and sendResponse is needed to trigger un-throttle, but sendResponse() requires an incoming message. From that point on, ZK server will no longer select the affected client socket for read, leading to the observed client-side failure in the subject.

      If you would like to reproduce this than setting the globalOutstandingLimit down to 1 makes this reproducible easier as throttling starts with less requests. 

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                botond.hejj Botond Hejj
                Reporter:
                botond.hejj Botond Hejj
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 40m
                  2h 40m