A graceful restart SIGTERMs the 'rotatelogs' child, but a long-lived httpd process may still be serving a connection via KeepAlive. The httpd child continues to write logdata into the pipe, but there's no process reading from the pipe anymore. Eventually the pipe fills up, and the httpd child just hangs. This bug was discovered when doing a large Subversion commit with apache configured to use piped logging to 'rotatelogs'. It's a pretty common setup, and it's likely to burn future Apache (and Subversion) users. Joe Orton says, "To fix this properly, I suppose piped loggers should not get SIGTERMedduring a graceful restart, they should read till EOF then exit(0): then when the last child attached to the piped logger for a particular generation quits, the pipe is closed and the piped logger terminates gracefully too, without losing log messages." Here's the complete mail thread: http://www.mail-archive.com/dev%40httpd.apache.org/msg19247.html
I've been pouring over Google search results trying to figure out the cause of an extremely similar bug which seems to affect the 1.3.x branch of the Apache HTTP server also. Has there been any movement on this issue. Is it likely 1.3.x would also have this problem?
Yes, this bug also affects 1.3.
Created attachment 15021 [details] Kill hung child httpd procs This little script (requires Proc::ProcessTable) is a hack to find child processes other than the root one that have been around since before the last rotatelogs process. Just in case it's useful to anyone else... can run from cron.
> processes other than the root one that have been around since before the last > rotatelogs process. Sorry... s/last/first/
FWIW, I'm playing with a patch to prevent the hang. Hopefully I can commit to 2.1 this weekend. Changing rotatelogs to snarf up all entries from these gracefully dying children would be nice, but first order of business is to prevent the hang occurring, since it can occur with any piped logger, and it shouldn't depend on how reliable the piped logger is. I can't reproduce the hang when I take care to close the read side of the logger pipe in the MPM children (i.e., the processes which serve requests and which would write to the logger pipe).
I just posted a patch and further comments to dev@httpd. [PATCH] PR 26467 child process can hang when piped logger goes away
A fix to avoid the hang has been committed to 2.1-dev and hopefully will be in 2.0.next. I stuck a patch for 1.3.x at http://people.apache.org/~trawick/pr26467_13.txt This 1.3 patch appears to Do The Right Thing, but YYMV. I haven't tested it in detail.
I have patched the version of Apache 1.3 I am running on two web servers that use piped loggers and I can already see that the symptoms of this problem are gone. There are no hanging PIDs being leftover at all anymore. I had a script that went through and killed the ones that were clearly hanging previously, and I've modified it to only tell me if any would be killed for hanging. None have come up yet, which would certainly never happen in the past even for the short time I've been running with the patch. This is certainly not proof that the 1.3 patch is perfect or that it doesn't cause any other problems, but it definitely does the right thing as best I can tell. What's the chances something like this could land in 1.3.34?
Regarding the 1.3 patch: Thanks for the feedback. I need to make sure it does the right thing on non-Unix, then I'll add it to 1.3 STATUS file, asking for approval to commit for 1.3.next. On Unix, just looking at the processes with lsof it is easy to see that it is doing the right thing and there should be no worries.
The 2.x fix has now been merged into the stable 2.0.x branch for the upcoming 2.0.55 release.
The hang is now fixed for 2.0.55, so marking this fixed. Avoiding the problem completely (i.e. not forcibly killing piped loggers at restart) is a slightly different issue.
Can someone clarify this bug for me? We'd been having a hang about every 2-14 days that started once we turned on the DisableWin32AcceptEx option. We noticed that the hang occurred after the child process restarted for whatever reason. Usually this was not triggered by a graceful restart, but occasionally it was. Meaning it would show up in the logs, but a person didn't usually choose to do a restart, something else caused it. We figured the build up of rotatelogs had something to do with the problem, as there were a ton of rotatelogs.exe sitting in there. I noticed this bug mentioned in the 2.0.55 release and tried the upgrade. We've been pretty stable since. It hung once during a graceful restart after 29 days, and has been up for 26 days since. I'm confident this bug was causing the hang, but still notice a tremendous number of rotatelogs.exe. After a fresh reboot, there are about 6 rotatelogs.exe is the process list. After 26 days on Parent Server Generation 14, we have 40+ rotatelogs.exe sitting in the process list. When we hang, there's no way to recover without rebooting Windows. At least not that I'm aware of. Apache seems to hang, tying up the socket(s), and stopping and starting apache fails. My concern is the number of accumulating rotatelogs, and the fact that we still hung after a month under similar circumstances during a graceful restart. Of course once a month is better than every couple of days. I just wanted to share what we have been experiencing with this bug, and it's apparent patch, and see if anyone could clarify what we are experiencing, if it is/was related to this bug, and why we still see so many rotatelogs.
*** Bug 40041 has been marked as a duplicate of this bug. ***