Log4cxx
  1. Log4cxx
  2. LOGCXX-186

Garbage characters in log files when log requests from multiple threads with hyperthreading enabled

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.10.0
    • Fix Version/s: 0.10.0
    • Component/s: Appender
    • Labels:
      None
    • Environment:
      Ubuntu, Intel, multi-processor hyperthreading

      Description

      On Jun 29, 2007, at 8:09 AM, pete@mu.org wrote:

      Hi, I've been using the subversion log4cxx-10.0.0, which is currently at revision 551872.

      I'm running on an ubuntu linux box running kernel 2.6.20-16, and g++ 4.1.2

      I've been seeing severe race conditions between threads when logging to a FileAppender.

      Basically, when multiple threads write to the same log file, some of the data is corrupt.

      Here is a little program will show the issue.

      I've been pouring over the docs, and the mailing list archives, but I can't find a clear explanation as to why this shouldn't work.

      Any ideas what I'm doing wrong? Thanks, Pete

      ----------------------------------------------------------------------------
      #include <pthread.h>

      #include <iostream>
      #include <sstream>
      #include <vector>

      #include <log4cxx/logger.h>

      const int NUMTHREADS = 100 ;
      const int NUMWRITES = 100 ;

      log4cxx::LoggerPtr thread_logger = log4cxx::Logger::getLogger("ThreadManager");
      using std::cerr;
      using std::endl;

      void *Run( void *ptr ){

      int thread_number = reinterpret_cast<int>(ptr);

      for (int i=0 ; i< NUMWRITES ; i++)

      { std::ostringstream msg; msg << "Thread[" << *thread_number << "], Write [" << ( i + 1 ) << "]"; LOG4CXX_DEBUG(thread_logger, msg.str()); }

      pthread_exit(NULL);

      }

      int
      main(int argc, char *argv[]) {

      std::vector<pthread_t> threads(NUMTHREADS);

      log4cxx::LoggerPtr rootlogger = log4cxx::Logger::getRootLogger();

      int thread_number = 0;
      int *thread_cnt = 0;
      while (thread_number++ < NUMTHREADS)

      { thread_cnt = new int(thread_number); pthread_create(&threads[thread_number -1], NULL, Run, (void *) thread_cnt); }

      thread_number = NUMTHREADS;

      while (--thread_number >= 0)

      { pthread_join(threads[thread_number], NULL); }

      return 0;

      }
      ----------------------------------------------------------------------------

      Here is my log4cxx.properties:

      log4j.debug=true

      1. Loggers
        log4j.rootLogger=DEBUG, rootlog
        log4j.logger.ThreadManager=DEBUG, ThreadManager
      1. Additivity
        log4j.additivity.ThreadManager=false
      1. rootLogger
        log4j.appender.rootlog.File=root.log
        log4j.appender.rootlog.Append=false
        log4j.appender.rootlog=org.apache.log4j.FileAppender
        log4j.appender.rootlog.layout=org.apache.log4j.PatternLayout
        log4j.appender.rootlog.layout.ConversionPattern=%-4r [%t] %-5p %c %x - %m%n
      1. ThreadManager
        log4j.appender.ThreadManager=org.apache.log4j.FileAppender
        log4j.appender.ThreadManager.File=ThreadManager.log
        log4j.appender.ThreadManager.Append=false
        log4j.appender.ThreadManager.layout=org.apache.log4j.PatternLayout
        log4j.appender.ThreadManager.layout.ConversionPattern=%-4r [%t] %-5p %c %x - %m%n

      On Jun 29, 2007, at 10:53 PM, pete@mu.org wrote:

      Nevermind. This is a known bug.

      http://issues.apache.org/jira/browse/LOGCXX-129

      I should of just stayed with 97 which was working fine.

      Thanks,

      Pete

      On Jun 29, 2007, at 11:26 PM, Curt Arnold wrote:

      LOGCXX-129 only reported problems with AsyncAppender which was badly flawed in log4j 1.2.13 and earlier and deadlock reports in log4j started getting more frequent. The log4cxx AsyncAppender was pretty much a straight port of the early log4j equivalent and would be expected to have the same problems under stress. The new log4j AsyncAppender should be ported over to log4cxx and that is one of the remaining issues on a push to release.

      However, your test case does not appear to involve AsyncAppender and the anticipated resolution for LOGCXX-129 would have no impact. If you have any stack traces or other information to help localize the problem, they would be appreciated. Otherwise, it will be my top issue to investigate when I shift to log4cxx development tomorrow.

      On Jun 29, 2007, at 11:37 PM, pete@mu.org wrote:

      If you just build the small program that I sent and link it to the subversion log4cxx, then check out the log files it creates. They're full of junk along with valid lines too.

      It does it everytime. I'm working on a threaded project that uses 100's
      of threads, and it was always working great with log4cxx, then I
      upgraded to subversion source, and it started leaving garbage in the
      logs.

      Tonight, I finally decided to go back and reinstall the standard
      log4cxx-9 libs, and link with them. It completely fixed my problem.

      Have you tried to run that program I sent you and link it against the
      newest subversion log4cxx?

      Please try and and let me know what you find in the log files.

      That's the best example you could have of the problem.

      Thanks,

      Pete

      On Jun 30, 2007, at 12:32 AM, Curt Arnold wrote:

      I appreciate the test code, but have not had a chance to attempt to build and run it. I can't rapidly switch projects without endlessly churning so when I'm active on log4j development, I usually only answer questions or give hints until I finish whatever I'm currently working on. I expect to do so tomorrow morning (US Central Time, UTC-5). If you have any other insights in the interim please post them. I do appreciate the specifics on platform and compiler, that helps a huge amount.

      What indicates to you that the problem is in FileAppender and not in PatternLayout? I would think that PatternLayout would be a might more likely source of problems. Have you tried other appenders (telnet might be reasonable) and not had the problem? Have you tried a custom layout (all the stock layouts are preconfigured PatternLayout's)? Does the same issue still appear if you simplify the pattern (maybe down to just %m or %m%n)?

      On Jun 30, 2007, at 8:32 AM, pete@mu.org wrote:
      I'm still very new to log4cxx internals and all the options. I just
      learned enough to get it to do what I needed---back with log4cxx9.

      So, I'm afraid I don't have any insight at all into what might be
      causing the race condition. And I haven't used any other appenders or
      layouts.

      I tried to just put %m%n in the pattern, per your request, and I've attached the results of the root log so you can see what I'm talking about. As you'll see
      there are lots of valid lines, then just junk, then more good lines.
      See attached.

      I've also added my log4cxx.properties. If you want to make any changes
      and define some different appenders or whatever, and mail it back, I can run it for you and send you the results.

      Please let me know if there is anything else I can do.

      Thanks,

      Pete

      On Jul 1, 2007, at 1:34 AM, Curt Arnold wrote:

      I have been unable to reproduce the problem on Ubuntu Linux 6.06-1 and gcc 4.0.3 running under VMWare Fusion on Mac OS/X either in single processor or double processor mode. That is the glorious thing about race conditions, the slightly little change might hide them. Thanks for the output when you were seeing corruption, it at least suggests that the problem isn't in the layout since all the bogus characters appear between perfectly formatted messages. That suggests that the problem is in the FileAppender or probably more likely in the CharsetEncoder or CharsetDecoder.

      I won't be able to work on it until Monday, however it would be helpful to know if setting LOG4CXX_LOCALE_ENCODING_UTF8=1 (which will replace the APR encoders/decoders with straight byte copies) or the equivalent manual hack to src/charsetencoder.cpp and src/charsetdecoder.cpp changes the behavior.

      I've also run the helgrind tool from valgrind 2.2.0 and generated a report of possible race conditions in the executed code, but I haven't had time to review them yet.

      On Jul 1, 2007, at 5:49 AM, pete@mu.org wrote:

      Hmm., that's really odd. I can reproduce it on my two boxes with pent4
      with hyperthreading enabled, but I can't on my weenie single lapetop cpu.

      I'll try this and get back to you.

      LOG4CXX_LOCALE_ENCODING_UTF8=1

      Thanks,

      Pete

      1. root.log
        469 kB
        Curt Arnold
      2. helgrind.pid21835
        91 kB
        Curt Arnold

        Issue Links

          Activity

          Hide
          Curt Arnold added a comment -

          Looks like it is fixed (or at least less frequent).

          Show
          Curt Arnold added a comment - Looks like it is fixed (or at least less frequent).
          Hide
          Curt Arnold added a comment -

          On the example with LoggingEventPtr, I copied a line from logger.cpp but I didn't switch out some variable names. "this" should have been replaced with "thread_logger" and "level1" with "log4cxx::Level::getDebug()". However, probably likely needs a:

          thread_logger->isDebugEnabled();

          or

          LOG4CXX_DEBUG(thread_logger, "Hello");

          in the main code to trigger default initialization.

          The idea there was to determine if there was to isolate just the appending code to see if the problem was in the appender or in the logging event creation and dispatching code.

          My understanding is that:

          Test program using older SVN on hyperthreaded processor --> corrupted data
          Test program with preliminary log request on main thread using older SVN on hyp... -> corrupted data
          Test program using current SVN -> corrupted data
          Test program with preliminary log request using current SVN -> good data

          That would suggest a couple of potential problems:

          1. default initialization is performed more than once. I can review, but I think I did properly synchronize that so it shouldn't happen.
          2. default initialization doesn't block all logging requests and logging while configuring isn't safe.

          While it would be great to be safe to bulletproof while logging and configuring at the same time, I don't think that log4j achieved that and log4cxx inherited the same design. So if it is the second, I think we may have to live with it.

          Show
          Curt Arnold added a comment - On the example with LoggingEventPtr, I copied a line from logger.cpp but I didn't switch out some variable names. "this" should have been replaced with "thread_logger" and "level1" with "log4cxx::Level::getDebug()". However, probably likely needs a: thread_logger->isDebugEnabled(); or LOG4CXX_DEBUG(thread_logger, "Hello"); in the main code to trigger default initialization. The idea there was to determine if there was to isolate just the appending code to see if the problem was in the appender or in the logging event creation and dispatching code. My understanding is that: Test program using older SVN on hyperthreaded processor --> corrupted data Test program with preliminary log request on main thread using older SVN on hyp... -> corrupted data Test program using current SVN -> corrupted data Test program with preliminary log request using current SVN -> good data That would suggest a couple of potential problems: 1. default initialization is performed more than once. I can review, but I think I did properly synchronize that so it shouldn't happen. 2. default initialization doesn't block all logging requests and logging while configuring isn't safe. While it would be great to be safe to bulletproof while logging and configuring at the same time, I don't think that log4j achieved that and log4cxx inherited the same design. So if it is the second, I think we may have to live with it.
          Hide
          Pete G. added a comment - - edited

          I tried readding a log to rootlogger in main per your request:

          int
          main(int argc, char *argv[]) {
          ...
          log4cxx::LoggerPtr rootlogger = log4cxx::Logger::getRootLogger();
          --> LOG4CXX_DEBUG(rootlogger, "starting");

          ...
          int thread_number = 0;

          That actually worked, and it didn't before.

          In reference to your other post about changing the function to:

          std::ostringstream os;
          os << "Thread[" << *thread_number << "], Write [-]";
          std::string msg(os.str());

          for (int i=0 ; i< NUMWRITES ; i++)

          { LOG4CXX_DEBUG(thread_logger, msg); }

          That, did not fix the issue, but it still works as long as the rootlogger is logging in main too.

          Then I tried the second version you asked for, but I couldn't get the event logger to compile:

          #include <log4cxx/spi/loggingevent.h>
          #include <log4cxx/appender.h>
          ...
          std::ostringstream os;
          os << "Thread[" << *thread_number << "], Write [-]";
          std::string msg(os.str());
          log4cxx::spi::LoggingEventPtr event(new log4cxx::spi::LoggingEvent(this, level1, msg, log4cxx::spi::LocationInfo::NA));
          log4cxx::helpers::Pool p;

          for (int i=0 ; i< NUMWRITES ; i++)

          { thread_logger->callAppenders(event, p); }

          ERROR:

          main.cpp:34: error: invalid use of 'this' in non-member function
          main.cpp:34: error: 'level1' was not declared in this scope

          I'm not quite sure what
          log4cxx::spi::LoggingEventPtr does and how it works. Can you spot the error?

          Any way, it looks like it might be an initialization issue that you must of fixed in part, because I did try doing a log from main in my orignal, and it didn't help at the time.

          Please send me anything else you would like me to try.

          You're definitely close!

          Thanks,
          Pete

          Show
          Pete G. added a comment - - edited I tried readding a log to rootlogger in main per your request: int main(int argc, char *argv[]) { ... log4cxx::LoggerPtr rootlogger = log4cxx::Logger::getRootLogger(); --> LOG4CXX_DEBUG(rootlogger, "starting"); ... int thread_number = 0; That actually worked, and it didn't before. In reference to your other post about changing the function to: std::ostringstream os; os << "Thread [" << *thread_number << "] , Write [-] "; std::string msg(os.str()); for (int i=0 ; i< NUMWRITES ; i++) { LOG4CXX_DEBUG(thread_logger, msg); } That, did not fix the issue, but it still works as long as the rootlogger is logging in main too. Then I tried the second version you asked for, but I couldn't get the event logger to compile: #include <log4cxx/spi/loggingevent.h> #include <log4cxx/appender.h> ... std::ostringstream os; os << "Thread [" << *thread_number << "] , Write [-] "; std::string msg(os.str()); log4cxx::spi::LoggingEventPtr event(new log4cxx::spi::LoggingEvent(this, level1, msg, log4cxx::spi::LocationInfo::NA)); log4cxx::helpers::Pool p; for (int i=0 ; i< NUMWRITES ; i++) { thread_logger->callAppenders(event, p); } ERROR: main.cpp:34: error: invalid use of 'this' in non-member function main.cpp:34: error: 'level1' was not declared in this scope I'm not quite sure what log4cxx::spi::LoggingEventPtr does and how it works. Can you spot the error? Any way, it looks like it might be an initialization issue that you must of fixed in part, because I did try doing a log from main in my orignal, and it didn't help at the time. Please send me anything else you would like me to try. You're definitely close! Thanks, Pete
          Hide
          Curt Arnold added a comment -

          I've committed rev 552987 against this bug with a shopping list of synchronization fixes, however none of them appear to be likely related to your observed problem as they are not in the forcedLog code path, but mostly in the configuration or shutdown sequences. There will be a subsequent commit that will add some Windows specific code to ObjectPtrBase::exchange.

          One thing very atypical of your example is that it could result in many nearly simulatenous requests for default configuration (that is reading the log4cxx.properties). Typically an app would have made at least one logging call before launching 100 threads at a time. log4cxx should be able to handle it, but it would be a useful experiment to add one LOG4CXX_DEBUG call to the main program before launching all the threads.

          Please let me the results of the experiments from this and the previous comment. I'm at a dead end of what I can do without being able to reproduce the problem.

          Show
          Curt Arnold added a comment - I've committed rev 552987 against this bug with a shopping list of synchronization fixes, however none of them appear to be likely related to your observed problem as they are not in the forcedLog code path, but mostly in the configuration or shutdown sequences. There will be a subsequent commit that will add some Windows specific code to ObjectPtrBase::exchange. One thing very atypical of your example is that it could result in many nearly simulatenous requests for default configuration (that is reading the log4cxx.properties). Typically an app would have made at least one logging call before launching 100 threads at a time. log4cxx should be able to handle it, but it would be a useful experiment to add one LOG4CXX_DEBUG call to the main program before launching all the threads. Please let me the results of the experiments from this and the previous comment. I'm at a dead end of what I can do without being able to reproduce the problem.
          Hide
          Curt Arnold added a comment -

          I've spent most of the day looking over helgrind dumps. I've seen a couple of things that needed fixing (like the LOGCXX-187 fix), but none that seem to be in the pathway for LOGCXX_DEBUG statement. helgrind checks for writes to memory on different threads where there is not a common locked mutex between the threads. Unfortunately that gives false positives when a variable is set in a constructor (typically without synchronization since the object is not exposed) and a later synchronized write.

          The only thing that I was reported in forceLog code path was some potential race conditions in apr_pool.c. However, it looked a lot like the unsynchronized initialization that was resulting in the other false positives.

          The helgrind run doesn't need to actually reproduce the problem (which I have yet to be able to do), it just needs to walk over the offending code and detect that there is the opportunity for a race condition. So the lower number of threads doesn't make test useless. I would need to get rid of the false positives before increasing the number of threads.

          If you are still willing to experiment and can still reproduce the problem, try changing the core of Run to something like:

          std::ostringstream os;
          os << "Thread[" << *thread_number << "], Write [-]";
          std::string msg(os.str());

          for (int i=0 ; i< NUMWRITES ; i++)

          { LOG4CXX_DEBUG(thread_logger, msg); }

          If it still fails, do something like:

          #include <log4cxx/spi/loggingevent.h>
          #include <log4cxx/appender.h>
          ...
          std::ostringstream os;
          os << "Thread[" << *thread_number << "], Write [-]";
          std::string msg(os.str());
          log4cxx::spi::LoggingEventPtr event(new log4cxx::spi::LoggingEvent(this, level1, msg, log4cxx::spi::LocationInfo::NA));
          log4cxx::helpers::Pool p;

          for (int i=0 ; i< NUMWRITES ; i++)

          { thread_logger->callAppenders(event, p); }

          If that succeeds, move the event constructor and/or pool constructor into the loop.

          Show
          Curt Arnold added a comment - I've spent most of the day looking over helgrind dumps. I've seen a couple of things that needed fixing (like the LOGCXX-187 fix), but none that seem to be in the pathway for LOGCXX_DEBUG statement. helgrind checks for writes to memory on different threads where there is not a common locked mutex between the threads. Unfortunately that gives false positives when a variable is set in a constructor (typically without synchronization since the object is not exposed) and a later synchronized write. The only thing that I was reported in forceLog code path was some potential race conditions in apr_pool.c. However, it looked a lot like the unsynchronized initialization that was resulting in the other false positives. The helgrind run doesn't need to actually reproduce the problem (which I have yet to be able to do), it just needs to walk over the offending code and detect that there is the opportunity for a race condition. So the lower number of threads doesn't make test useless. I would need to get rid of the false positives before increasing the number of threads. If you are still willing to experiment and can still reproduce the problem, try changing the core of Run to something like: std::ostringstream os; os << "Thread [" << *thread_number << "] , Write [-] "; std::string msg(os.str()); for (int i=0 ; i< NUMWRITES ; i++) { LOG4CXX_DEBUG(thread_logger, msg); } If it still fails, do something like: #include <log4cxx/spi/loggingevent.h> #include <log4cxx/appender.h> ... std::ostringstream os; os << "Thread [" << *thread_number << "] , Write [-] "; std::string msg(os.str()); log4cxx::spi::LoggingEventPtr event(new log4cxx::spi::LoggingEvent(this, level1, msg, log4cxx::spi::LocationInfo::NA)); log4cxx::helpers::Pool p; for (int i=0 ; i< NUMWRITES ; i++) { thread_logger->callAppenders(event, p); } If that succeeds, move the event constructor and/or pool constructor into the loop.
          Hide
          Pete G. added a comment -

          I did try it with:

          LOG4CXX_LOCALE_ENCODING_UTF8=1

          The behavior was the same.

          By the way, you'll usually need more than 5 threads reproduce the problem. Even 20 threads only leaves garbage about half the time.

          An easy method to see the results under unix is to run the program like this. If there is garbage in the either file, the type will be "data".

          > logcheckthreadtest ; file *log

          root.log: data
          ThreadManager.log: data

          If there is no garbage, you should see something like this, at least under linux:

          root.log: ASCII text
          ThreadManager.log: ASCII text

          So, if you want to see the issue in helgrind, please be sure to confirm that it did produce the problem---especially with only 5 threads.

          You might of thought of that, but if not, I hope that helps.

          I've had situations where I need to run the program 10 times, before reproducing it even with 100 threads doing 100 writes.

          Thanks,

          Pete

          Show
          Pete G. added a comment - I did try it with: LOG4CXX_LOCALE_ENCODING_UTF8=1 The behavior was the same. By the way, you'll usually need more than 5 threads reproduce the problem. Even 20 threads only leaves garbage about half the time. An easy method to see the results under unix is to run the program like this. If there is garbage in the either file, the type will be "data". > logcheckthreadtest ; file *log root.log: data ThreadManager.log: data If there is no garbage, you should see something like this, at least under linux: root.log: ASCII text ThreadManager.log: ASCII text So, if you want to see the issue in helgrind, please be sure to confirm that it did produce the problem---especially with only 5 threads. You might of thought of that, but if not, I hope that helps. I've had situations where I need to run the program 10 times, before reproducing it even with 100 threads doing 100 writes. Thanks, Pete
          Hide
          Curt Arnold added a comment -

          Helgrind output (thread count in example was dropped to 5)

          Show
          Curt Arnold added a comment - Helgrind output (thread count in example was dropped to 5)
          Hide
          Curt Arnold added a comment -

          Example of corrupted output

          Show
          Curt Arnold added a comment - Example of corrupted output

            People

            • Assignee:
              Curt Arnold
              Reporter:
              Curt Arnold
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development