I've got a multi-threaded web application running on a heavily loaded server. Monitor deadlocks occured in log4j 1.2.13, which seem to be caused by improper use of wait() and notify(). See the thread trace that shows the logging thread and the dispatcher thread waiting on each other on the same BoundedFIFO. "TP-Processor163" daemon prio=1 tid=0x08b564b0 nid=0x6369 in Object.wait() [0x6cee2000..0x6cee3558] at java.lang.Object.wait(Native Method) - waiting on <0x4a854cf8> (a org.apache.log4j.helpers.BoundedFIFO) at java.lang.Object.wait(Object.java:474) at org.apache.log4j.AsyncAppender.append(AsyncAppender.java:104) - locked <0x4a854cf8> (a org.apache.log4j.helpers.BoundedFIFO) at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:230) - locked <0x4b41d5f8> (a org.apache.log4j.AsyncAppender) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:65) at org.apache.log4j.Category.callAppenders(Category.java:203) - locked <0x4b407ac0> (a org.apache.log4j.spi.RootLogger) at org.apache.log4j.Category.forcedLog(Category.java:388) at org.apache.log4j.Category.info(Category.java:663) [...] "Dispatcher-Thread-17" daemon prio=1 tid=0x69d6ce80 nid=0x4a83 in Object.wait() [0x6b4af000..0x6b4af7d8] at java.lang.Object.wait(Native Method) - waiting on <0x4a854cf8> (a org.apache.log4j.helpers.BoundedFIFO) at java.lang.Object.wait(Object.java:474) at org.apache.log4j.Dispatcher.run(AsyncAppender.java:312) - locked <0x4a854cf8> (a org.apache.log4j.helpers.BoundedFIFO) [...] And lots of other threads blocked by the deadlocked logger, e.g.: "TP-Processor3" daemon prio=1 tid=0x6a628430 nid=0x4ba4 waiting for monitor entry [0x6c957000..0x6c958458] at org.apache.log4j.Category.callAppenders(Category.java:202) - waiting to lock <0x4b407ac0> (a org.apache.log4j.spi.RootLogger) at org.apache.log4j.Category.forcedLog(Category.java:388) at org.apache.log4j.Category.info(Category.java:663) [...] Please study again the producer and consumer pattern regarding monitors. See this and the following pages: http://java.sun.com/docs/books/tutorial/essential/threads/waitAndNotify.html Having a quick look at the AsyncAppender source code I suggest to remove the checks wasEmpty() and wasFull() before sending the notify, so a notify is always sent.
I found a second situation where the buggy check bf.length() == 0 leads to a deadlock: Very often it happens that AsyncAppender.close() hangs like this: "TP-Processor251" daemon prio=1 tid=0x699f7ed8 nid=0x39b7 in Object.wait() [0x70202000..0x70203558] at java.lang.Object.wait(Native Method) - waiting on <0x4b4f6208> (a org.apache.log4j.Dispatcher) at java.lang.Thread.join(Thread.java:1095) - locked <0x4b4f6208> (a org.apache.log4j.Dispatcher) at java.lang.Thread.join(Thread.java:1148) at org.apache.log4j.AsyncAppender.close(AsyncAppender.java:144) and "Dispatcher-Thread-134" daemon prio=1 tid=0x08ebbaf8 nid=0x1bde in Object.wait() [0x6bc9c000..0x6bc9c658] at java.lang.Object.wait(Native Method) - waiting on <0x4d420e68> (a org.apache.log4j.helpers.BoundedFIFO) at java.lang.Object.wait(Object.java:474) at org.apache.log4j.Dispatcher.run(AsyncAppender.java:312) - locked <0x4d420e68> (a org.apache.log4j.helpers.BoundedFIFO) The reason is, that Dispatcher.close() DID NOT notify Dispatcher.run(), so Thread.join() in AsyncAppender.close() waits forever! Please fix that now, since AsyncAppender is totally unusable with that many concurrency bugs.
Created attachment 17468 [details] Hotfix for solving the problem now. The patched version runs perfectly on our test and production servers.
Since there was no reaction :-(, we had to fix the problem for ourselves. I strongly urge everybody who uses the AsyncAppender to apply our patch to avoid deadlock/starvation problems. There might be possibilities to avoid some notify() calls, but it's always better to make no assumptions if a different thread is currently wait()ing or not. Even when you programmed both threads, this technique is more fault tolerant.
The patch appears to be in some binary form. Can you provide a patch created by using diff? Thanks
It's only gzipped, seems that the suffix (.gz) was lost.
I would appreciate if you could check if new AsyncAppender implementation (see bug 38982) resolves the issues observed. See first patch file for instructions on modifying a log4j 1.2 build to use the new AsyncAppender. Any critique or comments would be appreciated, particularly if you would recommend incorporating the new AsyncAppender in a log4j 1.2 bug fix release or developing a more incremental modification.
*** Bug 26224 has been marked as a duplicate of this bug. ***
*** Bug 28006 has been marked as a duplicate of this bug. ***
Still waiting for feedback if rewrite of log4j's AsyncAppender resolved your issue. The SVN for the log4j 1.2 branch now contains the new AsyncAppender, so you no longer have to back port the appender from log4j 1.3. Will eventually close this bug as resolved if there is no feedback.
Curt, we have also experienced this exact problem in our production system. We are running a multi-clustered J2EE application. I would recommend increasing the priority on this Bug to P1 as this deadlock brings down our production system. We are implementing Boris' hotfix and will post the results as soon as they are available. If the hotfix is valid, I would also recommend a new release of log4j immediately to address the issue. We are currently using log4j 1.2.8. Thanks, Zev.
*** Bug 37904 has been marked as a duplicate of this bug. ***
(In reply to comment #10) > [..] If the hotfix is valid, I would also recommend a new > release of log4j immediately to address the issue. We are currently using log4j > 1.2.8. > Thanks, > Zev. Zev, does version 1.2.14 solve the issue for you?
We have modified our development and system test environments to use the 1.2.14 version of log4j and we have not had any problems. We will deploy this new version to production in about 6-8 weeks. If we have issues, i will post them here. however, to date, we have not had any problems with the new version. Zev.
If Boris has no issues with 1.2.14 and Zev doesn't re-reply to this bug within the next 12 weeks, this bug can be closed around February.
No problem reported with 1.2.14 AsyncAppender, closing issue.
(In reply to comment #6) I am seeing this bug in 1.2.14 version of log4j. We are running standalone java process. Here is the new thread dump: "Thread xxxx" prio=1 tid=0x0000002b3f8b4810 nid=0x6b91 in Object.wait() [0x0000000044270000..0x0000000044270c30] at java.lang.Object.wait(Native Method) - waiting on <0x0000002ade1969b0> (a java.util.ArrayList) at java.lang.Object.wait(Object.java:474) at org.apache.log4j.AsyncAppender.append(AsyncAppender.java:193) - locked < 0x0000002ade1969b0> (a java.util.ArrayList) at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:230) - locked < 0x0000002ade1dbd30> (a org.apache.log4j.AsyncAppender) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders (AppenderAttachableImpl.java:65) at org.apache.log4j.Category.callAppenders(Category.java:203) - locked < 0x0000002ade21cf90> (a org.apache.log4j.spi.RootLogger) at org.apache.log4j.Category.forcedLog(Category.java:388) at org.apache.log4j.Category.log(Category.java:853) at org.apache.commons.logging.impl.Log4JLogger.debug(Log4JLogger.java:110) at org.apache.axis.encoding.DeserializationContextImpl.endElement (DeserializationContextImpl.java:1034) at org.apache.axis.message.SAX2EventRecorder.replay (SAX2EventRecorder.java:204) at org.apache.axis.message.MessageElement.publishToHandler (MessageElement.java:722) at org.apache.axis.message.RPCElement.deserialize(RPCElement.java:233) at org.apache.axis.message.RPCElement.getParams(RPCElement.java:347) at org.apache.axis.client.Call.invoke(Call.java:2272) at org.apache.axis.client.Call.invoke(Call.java:2171) at org.apache.axis.client.Call.invoke(Call.java:1691) at com.ml.marginhub.fipb.MHBridgeSoapBindingStub.getHairCut(Unknown Source) at java.lang.Thread.run(Thread.java:595)
Comment #16 if it is a bug is a different bug and should be filed as a new bug. Waiting on line 193 is expected when blocking is set to true and the buffer is full. Waiting there doesn't by itself doesn't indicate a bug. Please retry using log4j 1.2.15 (only some slight changes since 1.2.14, but maybe enough) and if the problem still occurs please describe the situation more fully, such as any deadlocks on other threads, the nature of the wrapped appender, any status on the dispatch thread, and diagnostic messages on the console, etc. Please report if the situation is repeatable and if it occurs frequently, occasionally, etc.