The following has been arrived on the mailing list, providing it here as well mainly to additionally collect the given patches etc.:
I'd like to share some experience as well as some patches with regard
to using log4cxx in timing-critical application. First few words about
our requirements: it's a service which must generate some network
packets with up to hundred of microseconds precision. Thus, it's very
important to have predictable code timing. One can argue that log4cxx
is not very well suited for such applications, but surprisingly it
works pretty well after some light tuning.
So, what were the issues?
Basically from library user's point of view they looked the same: one
of a sudden logging done with LOG4CXX_DEBUG() macro could take
unexpectedly long time to complete. For example the same trace which
takes several μs for 99% of the time would take hundreds microseconds
and even few milliseconds sometimes. After further investigation this
has been traced down to few of the root-causes:
1. Asyns logger (which we have been using of course) has internal queue
to pass log entries to background disk-writer thread. This queue is
mutex-protected which might seem fine unless you think a little bit
more about it. First of all, someone calling LOG4CXX_DEBUG() to simply
put something into the log might not expect to be blocked inside
waiting for a mutex at all. Second point is that, although there were
measures taken to minimize time disk-thread holds that lock,
OS-schedulers often work in a way that thread which is blocked on a
mutex gets de-scheduled. With normal OS-scheduler quantum that means
that the logging thread can be preempted for milliseconds.
2. There are some mutexes protecting internal states of both loggers
and appenders. This means that two separate threads calling
LOG4CXX_DEBUG() can block each other. Even if they are using different
loggers they would block on appender! This has the same consequences
for execution timing and the performance as described above.
3. std::stringstream class constructor has some internal locks on it's
own. Unfortunately each MessageBuffer has it's own instance of this
class. And also unfortunately MessageBuffer is created inside
LOG4CXX_DEBUG() macro. There is optimization to not create stringstream
for logging simple strings, but as soon as your log statement has
single '<<' operator it's created.
4. Dynamic memory allocations. Unfortunately there are still quite few
of them even though memory pool is used in some other places. Thus,
hidden calls to new and malloc induce unpredictable delays.
So, what we did to mitigate these problems?
1. Natural solution for this issue was to use atomic queue. There are
few of them available, but we made use of boost::lockfree::queue as it
can serve as a drop-in replacement allowing us to keep all present
2. After looking more into the code it has appeared that two concurrent
calls to LOG4CXX_DEBUG() from within different threads are not harmful
because internal structures of logger and appender are not being
changed there. What only really requires protection is concurrency
between logging and configuring. Thus, we came to a solution -
read-write locks where logging calls act as readers and
configuration/exiting calls are writers. With such approach multiple
threads calling LOG4CXX_DEBUG() became free of any contention.
3. This problem also has one simple solution - make one static
std::stringstream object per thread using std::thread_local.
Unfortunately we found one drawback - thread_local memory is not
released if thread is not detached or joined. As there is some code
which does neither of this we made static stringstream a xml file
configuration option. Also, there could be an issue with using multiple
MessageBuffer instances from within single thread, but LOG4CXX_DEBUG()
is not doing that.
4. At this time we didn't do anything to address dynamic memory
So, if you want to give our patch a try it is attached. It's based on
log4cxx 0.10.0-12 as found in Debian. There are new SHARED_MUTEX and
LOCK_R/LOCK_W macros defined in mutex.h and synchronized.h for easy
switching between RW and convenient mutexes for benchmarking. Also,
there is an test application which spawns two threads doing logging
with some small sleep between iterations. It prints to stdout if trace
statement took more than 500 microseconds. It might look familiar too
you because it's based on one of the examples from StackOverflow.
In our testing modified log4cxx has about 20% better performance
overall, but more importantly for us it has much less cases when log
statement takes excessive time to complete. The second part is only
true for CPUs with >2 cores where all threads can physically run in
parallel. Unfortunately we still see rare cases with prolonged logging
calls, we have traced that down to dynamic memory allocation issue.
Senior SW Engineer