Bug 38137

Summary: Monitor deadlock in AsyncAppender
Product: Log4j - Now in Jira Reporter: Boris Folgmann <boris>
Component: AppenderAssignee: log4j-dev <log4j-dev>
Status: RESOLVED FIXED    
Severity: major CC: beckermanz, hallorant, jan.kronquist, kay.abendroth, schulz
Priority: P2    
Version: 1.2   
Target Milestone: ---   
Hardware: All   
OS: Linux   
Attachments: Hotfix for solving the problem now.

Description Boris Folgmann 2006-01-05 15:04:09 UTC
I've got a multi-threaded web application running on a heavily loaded server.
Monitor deadlocks occured in log4j 1.2.13, which seem to be caused by improper
use of wait() and notify(). See the thread trace that shows the logging thread
and the dispatcher thread waiting on each other on the same BoundedFIFO.

"TP-Processor163" daemon prio=1 tid=0x08b564b0 nid=0x6369 in Object.wait()
[0x6cee2000..0x6cee3558]
        at java.lang.Object.wait(Native Method)
        - waiting on <0x4a854cf8> (a org.apache.log4j.helpers.BoundedFIFO)
        at java.lang.Object.wait(Object.java:474)
        at org.apache.log4j.AsyncAppender.append(AsyncAppender.java:104)
        - locked <0x4a854cf8> (a org.apache.log4j.helpers.BoundedFIFO)
        at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:230)
        - locked <0x4b41d5f8> (a org.apache.log4j.AsyncAppender)
        at
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:65)
        at org.apache.log4j.Category.callAppenders(Category.java:203)
        - locked <0x4b407ac0> (a org.apache.log4j.spi.RootLogger)
        at org.apache.log4j.Category.forcedLog(Category.java:388)
        at org.apache.log4j.Category.info(Category.java:663)
[...]

"Dispatcher-Thread-17" daemon prio=1 tid=0x69d6ce80 nid=0x4a83 in Object.wait()
[0x6b4af000..0x6b4af7d8]
        at java.lang.Object.wait(Native Method)
        - waiting on <0x4a854cf8> (a org.apache.log4j.helpers.BoundedFIFO)
        at java.lang.Object.wait(Object.java:474)
        at org.apache.log4j.Dispatcher.run(AsyncAppender.java:312)
        - locked <0x4a854cf8> (a org.apache.log4j.helpers.BoundedFIFO)
[...]

And lots of other threads blocked by the deadlocked logger, e.g.:
"TP-Processor3" daemon prio=1 tid=0x6a628430 nid=0x4ba4 waiting for monitor
entry [0x6c957000..0x6c958458]
        at org.apache.log4j.Category.callAppenders(Category.java:202)
        - waiting to lock <0x4b407ac0> (a org.apache.log4j.spi.RootLogger)
        at org.apache.log4j.Category.forcedLog(Category.java:388)
        at org.apache.log4j.Category.info(Category.java:663)
[...]

Please study again the producer and consumer pattern regarding monitors. See
this and the following pages:
http://java.sun.com/docs/books/tutorial/essential/threads/waitAndNotify.html

Having a quick look at the AsyncAppender source code I suggest to remove the
checks wasEmpty() and wasFull() before sending the notify, so a notify is always
sent.
Comment 1 Boris Folgmann 2006-01-17 18:18:57 UTC
I found a second situation where the buggy check bf.length() == 0 leads to a
deadlock:

Very often it happens that AsyncAppender.close() hangs like this:
"TP-Processor251" daemon prio=1 tid=0x699f7ed8 nid=0x39b7 in Object.wait()
[0x70202000..0x70203558]
        at java.lang.Object.wait(Native Method)
        - waiting on <0x4b4f6208> (a org.apache.log4j.Dispatcher)
        at java.lang.Thread.join(Thread.java:1095)
        - locked <0x4b4f6208> (a org.apache.log4j.Dispatcher)
        at java.lang.Thread.join(Thread.java:1148)
        at org.apache.log4j.AsyncAppender.close(AsyncAppender.java:144)

and

"Dispatcher-Thread-134" daemon prio=1 tid=0x08ebbaf8 nid=0x1bde in Object.wait()
[0x6bc9c000..0x6bc9c658]
        at java.lang.Object.wait(Native Method)
        - waiting on <0x4d420e68> (a org.apache.log4j.helpers.BoundedFIFO)
        at java.lang.Object.wait(Object.java:474)
        at org.apache.log4j.Dispatcher.run(AsyncAppender.java:312)
        - locked <0x4d420e68> (a org.apache.log4j.helpers.BoundedFIFO)

The reason is, that Dispatcher.close() DID NOT notify Dispatcher.run(), so
Thread.join() in AsyncAppender.close() waits forever!

Please fix that now, since AsyncAppender is totally unusable with that many
concurrency bugs.

Comment 2 Boris Folgmann 2006-01-20 13:00:05 UTC
Created attachment 17468 [details]
Hotfix for solving the problem now.

The patched version runs perfectly on our test and production servers.
Comment 3 Boris Folgmann 2006-01-20 13:04:35 UTC
Since there was no reaction :-(, we had to fix the problem for ourselves. I
strongly urge everybody who uses the AsyncAppender to apply our patch to avoid
deadlock/starvation problems.
There might be possibilities to avoid some notify() calls, but it's always
better to make no assumptions if a different thread is currently wait()ing or
not. Even when you programmed both threads, this technique is more fault tolerant.
Comment 4 Scott Deboy 2006-01-20 17:53:16 UTC
The patch appears to be in some binary form.  Can you provide a patch created by
using diff?

Thanks
Comment 5 Boris Folgmann 2006-01-21 16:56:28 UTC
It's only gzipped, seems that the suffix (.gz) was lost.
Comment 6 Curt Arnold 2006-03-16 06:07:20 UTC
I would appreciate if you could check if new AsyncAppender implementation (see
bug 38982) resolves the issues observed.  See first patch file for instructions
on modifying a log4j 1.2 build to use the new AsyncAppender.  Any critique or
comments would be appreciated, particularly if you would recommend incorporating
the new AsyncAppender in a log4j 1.2 bug fix release or developing a more
incremental modification.
Comment 7 Curt Arnold 2006-03-16 06:21:23 UTC
*** Bug 26224 has been marked as a duplicate of this bug. ***
Comment 8 Curt Arnold 2006-03-16 06:31:09 UTC
*** Bug 28006 has been marked as a duplicate of this bug. ***
Comment 9 Curt Arnold 2006-06-28 16:17:57 UTC
Still waiting for feedback if rewrite of log4j's AsyncAppender resolved your issue.  The SVN for the log4j 
1.2 branch now contains the new AsyncAppender, so you no longer have to back port the appender from 
log4j 1.3.  Will eventually close this bug as resolved if there is no feedback.
Comment 10 Zev 2006-07-03 14:53:08 UTC
Curt, we have also experienced this exact problem in our production system. We 
are running a multi-clustered J2EE application. I would recommend increasing 
the priority on this Bug to P1 as this deadlock brings down our production 
system. We are implementing Boris' hotfix and will post the results as soon as 
they are available. If the hotfix is valid, I would also recommend a new 
release of log4j immediately to address the issue. We are currently using log4j 
1.2.8.
Thanks,
Zev.
Comment 11 Curt Arnold 2006-08-31 21:12:39 UTC
*** Bug 37904 has been marked as a duplicate of this bug. ***
Comment 12 Kay Abendroth 2006-11-12 02:43:43 UTC
(In reply to comment #10)
> [..] If the hotfix is valid, I would also recommend a new 
> release of log4j immediately to address the issue. We are currently using log4j 
> 1.2.8.
> Thanks,
> Zev.

Zev, does version 1.2.14 solve the issue for you?

Comment 13 Zev 2006-11-13 07:39:40 UTC
We have modified our development and system test environments to use the 1.2.14 
version of log4j and we have not had any problems. We will deploy this new 
version to production in about 6-8 weeks. If we have issues, i will post them 
here. however, to date, we have not had any problems with the new version.
Zev.
Comment 14 Kay Abendroth 2006-11-13 10:07:18 UTC
If Boris has no issues with 1.2.14 and Zev doesn't re-reply to this bug within
the next 12 weeks, this bug can be closed around February.
Comment 15 Curt Arnold 2007-02-22 13:59:30 UTC
No problem reported with 1.2.14 AsyncAppender, closing issue.
Comment 16 Yufeng Tsui 2007-11-29 13:44:17 UTC
(In reply to comment #6)
I am seeing this bug in 1.2.14 version of log4j.  We are running standalone 
java process.  Here is the new thread dump:
"Thread xxxx" prio=1 tid=0x0000002b3f8b4810 nid=0x6b91 in Object.wait() 
[0x0000000044270000..0x0000000044270c30]     
at java.lang.Object.wait(Native Method)
     - waiting on <0x0000002ade1969b0> (a java.util.ArrayList)
     at java.lang.Object.wait(Object.java:474)
     at org.apache.log4j.AsyncAppender.append(AsyncAppender.java:193)
     - locked < 0x0000002ade1969b0> (a java.util.ArrayList)
     at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:230)
     - locked < 0x0000002ade1dbd30> (a org.apache.log4j.AsyncAppender)
     at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders
(AppenderAttachableImpl.java:65)
     at org.apache.log4j.Category.callAppenders(Category.java:203)
     - locked < 0x0000002ade21cf90> (a org.apache.log4j.spi.RootLogger)
     at org.apache.log4j.Category.forcedLog(Category.java:388)
     at org.apache.log4j.Category.log(Category.java:853)
     at org.apache.commons.logging.impl.Log4JLogger.debug(Log4JLogger.java:110)
     at org.apache.axis.encoding.DeserializationContextImpl.endElement
(DeserializationContextImpl.java:1034)
     at org.apache.axis.message.SAX2EventRecorder.replay
(SAX2EventRecorder.java:204)
     at org.apache.axis.message.MessageElement.publishToHandler
(MessageElement.java:722)
     at org.apache.axis.message.RPCElement.deserialize(RPCElement.java:233)
     at org.apache.axis.message.RPCElement.getParams(RPCElement.java:347)
     at org.apache.axis.client.Call.invoke(Call.java:2272)
     at org.apache.axis.client.Call.invoke(Call.java:2171)
     at org.apache.axis.client.Call.invoke(Call.java:1691)
     at com.ml.marginhub.fipb.MHBridgeSoapBindingStub.getHairCut(Unknown 
Source)
     at java.lang.Thread.run(Thread.java:595)
Comment 17 Curt Arnold 2007-11-29 14:27:02 UTC
Comment #16 if it is a bug is a different bug and should be filed as a new bug.

Waiting on line 193 is expected when blocking is set to true and the buffer is full.  Waiting there doesn't by 
itself doesn't indicate a bug.

Please retry using log4j 1.2.15 (only some slight changes since 1.2.14, but maybe enough) and if the 
problem still occurs please describe the situation more fully, such as any deadlocks on other threads, the 
nature of the wrapped appender, any status on the dispatch thread, and diagnostic messages on the 
console, etc.  Please report if the situation is repeatable and if it occurs frequently, occasionally, etc.