XML-RPC
  1. XML-RPC
  2. XMLRPC-168

XML-RPC server deadlocks under heavy load

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.1.2
    • Fix Version/s: 3.1.3
    • Component/s: Source
    • Labels:
      None
    • Environment:
      Soalris

      Description

      When running a XML-RPC server under heavy load, it eventually deadlocks inside the thread pool that manages the 'worker' threads which handle the individual XML-RPC requests - the classes involved are org.apache.xmlrpc.util.ThreadPool and org.apache.xmlrpc.util.ThreadPool$Poolable. jstack on the hung process shows:

      ----------
      Found one Java-level deadlock:
      =============================
      "XML-RPC-13":
      waiting to lock monitor 0x08d10bec (object 0xbb6959c0, a org.apache.xmlrpc.util.ThreadPool),
      which is held by "XML-RPC Weblistener"
      "XML-RPC Weblistener":
      waiting to lock monitor 0x08d1186c (object 0xbd2ed340, a org.apache.xmlrpc.util.ThreadPool$Poolable$1),
      which is held by "XML-RPC-2"
      "XML-RPC-2":
      waiting to lock monitor 0x08d112f4 (object 0xbd2ed570, a org.apache.xmlrpc.util.ThreadPool$Poolable),
      which is held by "XML-RPC Weblistener"

      Java stack information for the threads listed above:
      ===================================================
      "XML-RPC-13":
      at org.apache.xmlrpc.util.ThreadPool.repool(Unknown Source)

      • waiting to lock <0xbb6959c0> (a org.apache.xmlrpc.util.ThreadPool)
        at org.apache.xmlrpc.util.ThreadPool$Poolable$1.run(Unknown Source)
        "XML-RPC Weblistener":
        at org.apache.xmlrpc.util.ThreadPool$Poolable.start(Unknown Source)
      • waiting to lock <0xbd2ed340> (a org.apache.xmlrpc.util.ThreadPool$Poolable$1)
      • locked <0xbd2ed570> (a org.apache.xmlrpc.util.ThreadPool$Poolable)
        at org.apache.xmlrpc.util.ThreadPool.startTask(Unknown Source)
      • locked <0xbb6959c0> (a org.apache.xmlrpc.util.ThreadPool)
        at org.apache.xmlrpc.webserver.WebServer.run(Unknown Source)
        at java.lang.Thread.run(Thread.java:619)
        "XML-RPC-2":
        at org.apache.xmlrpc.util.ThreadPool$Poolable.isShuttingDown(Unknown Source)
      • waiting to lock <0xbd2ed570> (a org.apache.xmlrpc.util.ThreadPool$Poolable)
        at org.apache.xmlrpc.util.ThreadPool$Poolable.access$000(Unknown Source)
        at org.apache.xmlrpc.util.ThreadPool$Poolable$1.run(Unknown Source)
      • locked <0xbd2ed340> (a org.apache.xmlrpc.util.ThreadPool$Poolable$1)

      Found 1 deadlock.
      ----------

      And another slight variant on the above, from a debug build of revision 769436.

      ----------
      Found one Java-level deadlock:
      =============================
      "XML-RPC-6":
      waiting to lock monitor 0x0870d8ec (object 0xbac020f8, a org.apache.xmlrpc.util.ThreadPool),
      which is held by "XML-RPC Weblistener"
      "XML-RPC Weblistener":
      waiting to lock monitor 0x0814de4c (object 0xbad73820, a org.apache.xmlrpc.util.ThreadPool$Poolable$1),
      which is held by "XML-RPC-5"
      "XML-RPC-5":
      waiting to lock monitor 0x0814eacc (object 0xbad73b48, a org.apache.xmlrpc.util.ThreadPool$Poolable),
      which is held by "XML-RPC Weblistener"

      Java stack information for the threads listed above:
      ===================================================
      "XML-RPC-6":
      at org.apache.xmlrpc.util.ThreadPool.repool(ThreadPool.java:136)

      • waiting to lock <0xbac020f8> (a org.apache.xmlrpc.util.ThreadPool)
        at org.apache.xmlrpc.util.ThreadPool$Poolable$1.run(ThreadPool.java:70)
        "XML-RPC Weblistener":
        at org.apache.xmlrpc.util.ThreadPool$Poolable.start(ThreadPool.java:106)
      • waiting to lock <0xbad73820> (a org.apache.xmlrpc.util.ThreadPool$Poolable$1)
      • locked <0xbad73b48> (a org.apache.xmlrpc.util.ThreadPool$Poolable)
        at org.apache.xmlrpc.util.ThreadPool.startTask(ThreadPool.java:168)
      • locked <0xbac020f8> (a org.apache.xmlrpc.util.ThreadPool)
        at org.apache.xmlrpc.webserver.WebServer.run(WebServer.java:338)
        at java.lang.Thread.run(Thread.java:619)
        "XML-RPC-5":
        at org.apache.xmlrpc.util.ThreadPool$Poolable.getTask(ThreadPool.java:99)
      • waiting to lock <0xbad73b48> (a org.apache.xmlrpc.util.ThreadPool$Poolable)
        at org.apache.xmlrpc.util.ThreadPool$Poolable.access$100(ThreadPool.java:47)
        at org.apache.xmlrpc.util.ThreadPool$Poolable$1.run(ThreadPool.java:59)
      • locked <0xbad73820> (a org.apache.xmlrpc.util.ThreadPool$Poolable$1)

      Found 1 deadlock.
      ----------

      1. Client.java
        2 kB
        Jochen Wiedmann
      2. Client.java
        1 kB
        Alan Burlison
      3. Server.java
        4 kB
        Alan Burlison
      4. ThreadPool.java
        7 kB
        Jochen Wiedmann
      5. ThreadPool.java
        8 kB
        Jochen Wiedmann

        Issue Links

          Activity

          Hide
          Alan Burlison added a comment - - edited

          Fix confirmed - I've stress tested this with 10 concurrent clients each with 10 threads and couldn't break it. Thanks!

          Show
          Alan Burlison added a comment - - edited Fix confirmed - I've stress tested this with 10 concurrent clients each with 10 threads and couldn't break it. Thanks!
          Hide
          Alan Burlison added a comment -

          Wow, that was quick - thanks a bunch, much appreciated. I'll pull your fix tomorrow and give it a good hammering. The two variables you identified (Poolable.shuttingDown and Poolable.task) were the one's I'd noted too, and the volatile declarations are a neat way of resolving the lock ordering issue that the various synchronized methods & blocks were causing, so I'm pretty confident you will have nailed it - I'll let you know the results of my testing. Thanks again!

          Show
          Alan Burlison added a comment - Wow, that was quick - thanks a bunch, much appreciated. I'll pull your fix tomorrow and give it a good hammering. The two variables you identified (Poolable.shuttingDown and Poolable.task) were the one's I'd noted too, and the volatile declarations are a neat way of resolving the lock ordering issue that the various synchronized methods & blocks were causing, so I'm pretty confident you will have nailed it - I'll let you know the results of my testing. Thanks again!
          Hide
          Jochen Wiedmann added a comment -

          Fixed in r769900. Thanks!

          Show
          Jochen Wiedmann added a comment - Fixed in r769900. Thanks!
          Hide
          Jochen Wiedmann added a comment -

          Thanks for the hint to jps and jstack. It showed me that the patched version of the ThreadPool at least did indeed decouple the ThreadPool and the Poolable class. With that knowledge it was easy to deduce a fix: By declaring the variables Poolable.shuttingDown and Poolable.task as volatile I could remove the synchronized keyword from several Poolable methods.

          With the new version of ThreadPool and a slightly more interesting version of the Client, I was able to perform 1000000 calls to the server from 10 concurrent clients. No client ceased to work.

          Show
          Jochen Wiedmann added a comment - Thanks for the hint to jps and jstack. It showed me that the patched version of the ThreadPool at least did indeed decouple the ThreadPool and the Poolable class. With that knowledge it was easy to deduce a fix: By declaring the variables Poolable.shuttingDown and Poolable.task as volatile I could remove the synchronized keyword from several Poolable methods. With the new version of ThreadPool and a slightly more interesting version of the Client, I was able to perform 1000000 calls to the server from 10 concurrent clients. No client ceased to work.
          Hide
          Alan Burlison added a comment -

          The traces come from jstack. When the server has deadlocked, use jps to get the pid of its process, then run 'jstack <pid>'.

          In my testing the new code isn't any different. The place where it deadlocks hasn't been touched by your changes, so that's not too surprising. The problem is the ordering of the locks. If you look in the trace above, the weblistener lock ordering is (ThreadPool$Poolable, ThreadPool$Poolable$1) and the worker lock ordering is (ThreadPool$Poolable$1, ThreadPool$Poolable). The $1 object is the anonymous Thread object created inside the Poolable constructor.

          The only way of fixing this is to change things so that both the weblistener and the worker thread acquire those locks in the same order. As the locks are in part taken out by synchronized methods (e.g. isShuttingDown), the only way of doing that is to change the order in which the calls are made, and that basically means a major rewrite.

          The nub of the problem is that the weblistener thread is locking 'down' into the Poolable and the Poolable thread is locking 'up' into the weblistener, and if those two things happen at the same time, you'll get a deadlock. Taking out locks via the use of synchronized methods means that the lock ordering is going to be determined by the call ordering, and if object A locks and then calls synchronized methods on object B whist object B simultaneously locks and calls synchronized methods on object A the resultant lock ordering is (A, B) and (B, A) - i.e. deadlock. That's an architectural issue, and I suspect the only way of fixing it is to rewrite most of the code.

          That's why I suggested using the java.util.concurrent thread pool - that code will have been tested, and will scale. There's a trade-off between reinventing the wheel and losing JDK 1.4 compatibility as the j.u.concurrent stuff didn't come in until JDK 1.5. However there is a backport of the j.u.concurrent stuff - see http://backport-jsr166.sourceforge.net/

          Show
          Alan Burlison added a comment - The traces come from jstack. When the server has deadlocked, use jps to get the pid of its process, then run 'jstack <pid>'. In my testing the new code isn't any different. The place where it deadlocks hasn't been touched by your changes, so that's not too surprising. The problem is the ordering of the locks. If you look in the trace above, the weblistener lock ordering is (ThreadPool$Poolable, ThreadPool$Poolable$1) and the worker lock ordering is (ThreadPool$Poolable$1, ThreadPool$Poolable). The $1 object is the anonymous Thread object created inside the Poolable constructor. The only way of fixing this is to change things so that both the weblistener and the worker thread acquire those locks in the same order. As the locks are in part taken out by synchronized methods (e.g. isShuttingDown), the only way of doing that is to change the order in which the calls are made, and that basically means a major rewrite. The nub of the problem is that the weblistener thread is locking 'down' into the Poolable and the Poolable thread is locking 'up' into the weblistener, and if those two things happen at the same time, you'll get a deadlock. Taking out locks via the use of synchronized methods means that the lock ordering is going to be determined by the call ordering, and if object A locks and then calls synchronized methods on object B whist object B simultaneously locks and calls synchronized methods on object A the resultant lock ordering is (A, B) and (B, A) - i.e. deadlock. That's an architectural issue, and I suspect the only way of fixing it is to rewrite most of the code. That's why I suggested using the java.util.concurrent thread pool - that code will have been tested, and will scale. There's a trade-off between reinventing the wheel and losing JDK 1.4 compatibility as the j.u.concurrent stuff didn't come in until JDK 1.5. However there is a backport of the j.u.concurrent stuff - see http://backport-jsr166.sourceforge.net/
          Hide
          Jochen Wiedmann added a comment -

          Ok, I was able to reproduce your problem. (The new ThreadPool seems to be better, btw, though it is still locking.) Question: How can I get those neat informational traces you provided above?

          Show
          Jochen Wiedmann added a comment - Ok, I was able to reproduce your problem. (The new ThreadPool seems to be better, btw, though it is still locking.) Question: How can I get those neat informational traces you provided above?
          Hide
          Alan Burlison added a comment -

          Here are the test files I'm using. To test I'm running the server, then running multiple instances of the client (up to 10). I'm running this on a 4 x 3Ghz machine, and it deadlocks in a matter of seconds.

          Show
          Alan Burlison added a comment - Here are the test files I'm using. To test I'm running the server, then running multiple instances of the client (up to 10). I'm running this on a 4 x 3Ghz machine, and it deadlocks in a matter of seconds.
          Hide
          Alan Burlison added a comment -

          Here's the deadlock info from the updated class you attached:

          ----------
          Found one Java-level deadlock:
          =============================
          "XML-RPC-4":
          waiting to lock monitor 0x0814dbdc (object 0xf46b3d50, a org.apache.xmlrpc.util.ThreadPool$Poolable),
          which is held by "XML-RPC Weblistener"
          "XML-RPC Weblistener":
          waiting to lock monitor 0x0814ee38 (object 0xf46b3d68, a org.apache.xmlrpc.util.ThreadPool$Poolable$1),
          which is held by "XML-RPC-4"

          Java stack information for the threads listed above:
          ===================================================
          "XML-RPC-4":
          at org.apache.xmlrpc.util.ThreadPool$Poolable.isShuttingDown(ThreadPool.java:97)

          • waiting to lock <0xf46b3d50> (a org.apache.xmlrpc.util.ThreadPool$Poolable)
            at org.apache.xmlrpc.util.ThreadPool$Poolable.access$000(ThreadPool.java:47)
            at org.apache.xmlrpc.util.ThreadPool$Poolable$1.run(ThreadPool.java:59)
          • locked <0xf46b3d68> (a org.apache.xmlrpc.util.ThreadPool$Poolable$1)
            "XML-RPC Weblistener":
            at org.apache.xmlrpc.util.ThreadPool$Poolable.start(ThreadPool.java:108)
          • waiting to lock <0xf46b3d68> (a org.apache.xmlrpc.util.ThreadPool$Poolable$1)
          • locked <0xf46b3d50> (a org.apache.xmlrpc.util.ThreadPool$Poolable)
            at org.apache.xmlrpc.util.ThreadPool.startTask(ThreadPool.java:176)
            at org.apache.xmlrpc.webserver.WebServer.run(WebServer.java:338)
            at java.lang.Thread.run(Thread.java:619)

          Found 1 deadlock.
          ----------

          Show
          Alan Burlison added a comment - Here's the deadlock info from the updated class you attached: ---------- Found one Java-level deadlock: ============================= "XML-RPC-4": waiting to lock monitor 0x0814dbdc (object 0xf46b3d50, a org.apache.xmlrpc.util.ThreadPool$Poolable), which is held by "XML-RPC Weblistener" "XML-RPC Weblistener": waiting to lock monitor 0x0814ee38 (object 0xf46b3d68, a org.apache.xmlrpc.util.ThreadPool$Poolable$1), which is held by "XML-RPC-4" Java stack information for the threads listed above: =================================================== "XML-RPC-4": at org.apache.xmlrpc.util.ThreadPool$Poolable.isShuttingDown(ThreadPool.java:97) waiting to lock <0xf46b3d50> (a org.apache.xmlrpc.util.ThreadPool$Poolable) at org.apache.xmlrpc.util.ThreadPool$Poolable.access$000(ThreadPool.java:47) at org.apache.xmlrpc.util.ThreadPool$Poolable$1.run(ThreadPool.java:59) locked <0xf46b3d68> (a org.apache.xmlrpc.util.ThreadPool$Poolable$1) "XML-RPC Weblistener": at org.apache.xmlrpc.util.ThreadPool$Poolable.start(ThreadPool.java:108) waiting to lock <0xf46b3d68> (a org.apache.xmlrpc.util.ThreadPool$Poolable$1) locked <0xf46b3d50> (a org.apache.xmlrpc.util.ThreadPool$Poolable) at org.apache.xmlrpc.util.ThreadPool.startTask(ThreadPool.java:176) at org.apache.xmlrpc.webserver.WebServer.run(WebServer.java:338) at java.lang.Thread.run(Thread.java:619) Found 1 deadlock. ----------
          Hide
          Jochen Wiedmann added a comment -

          Right. The problem of cross locking should be addressed by my updated version,

          Show
          Jochen Wiedmann added a comment - Right. The problem of cross locking should be addressed by my updated version,
          Hide
          Alan Burlison added a comment -

          I agree the problem is not the class hierarchy, it is that outer class A is locking inner class B, and in turn B is trying to lock A. You could get exactly the same brokenness with non-nested classes.

          Show
          Alan Burlison added a comment - I agree the problem is not the class hierarchy, it is that outer class A is locking inner class B, and in turn B is trying to lock A. You could get exactly the same brokenness with non-nested classes.
          Hide
          Jochen Wiedmann added a comment -

          Please try this updated version. Btw, I do not believe that the concept of inner classes has any relatiionship with synchronization. Although the non-static inner class has a reference to the outer class, they are nevertheless completely separate objects.

          Show
          Jochen Wiedmann added a comment - Please try this updated version. Btw, I do not believe that the concept of inner classes has any relatiionship with synchronization. Although the non-static inner class has a reference to the outer class, they are nevertheless completely separate objects.
          Hide
          Alan Burlison added a comment -

          New stack trace for revision 769436

          Show
          Alan Burlison added a comment - New stack trace for revision 769436
          Hide
          Alan Burlison added a comment -

          I've pulled from the repo so what I have is basically the same as head (rev 769427), except that I've added a few debugging bits and pieces. I can roll back and update the stack traces.

          However I think the cause of the problem is that the implementation is fatally flawed - there's a nonstatic member class (ThreadPool.Poolable) that synchronizes on the enclosing class (ThreadPool), and the enclosing class in turn synchronizes on the enclosed class. No surprise therefore that it deadlocks.

          I suggest we replace the custom ThreadPool with the java.util.concurrent one in JDK 1.5

          Show
          Alan Burlison added a comment - I've pulled from the repo so what I have is basically the same as head (rev 769427), except that I've added a few debugging bits and pieces. I can roll back and update the stack traces. However I think the cause of the problem is that the implementation is fatally flawed - there's a nonstatic member class (ThreadPool.Poolable) that synchronizes on the enclosing class (ThreadPool), and the enclosing class in turn synchronizes on the enclosed class. No surprise therefore that it deadlocks. I suggest we replace the custom ThreadPool with the java.util.concurrent one in JDK 1.5
          Hide
          Jochen Wiedmann added a comment -

          The line numbers in the stack information don't match my version of the sources. Any idea why that's the case? Any possibility to supply a matching version?

          Show
          Jochen Wiedmann added a comment - The line numbers in the stack information don't match my version of the sources. Any idea why that's the case? Any possibility to supply a matching version?
          Hide
          Alan Burlison added a comment - - edited
          Show
          Alan Burlison added a comment - - edited Found this issue when testing https://issues.apache.org/jira/browse/XMLRPC-167

            People

            • Assignee:
              Jochen Wiedmann
              Reporter:
              Alan Burlison
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development