Bug 22484 - semaphore problem takes httpd down
Summary: semaphore problem takes httpd down
Status: REOPENED
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: mpm_prefork (show other bugs)
Version: 2.0.47
Hardware: HP HP-UX
: P3 major (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords: FAQ
: 22516 25418 (view as bug list)
Depends on:
Blocks:
 
Reported: 2003-08-16 19:42 UTC by vtmue
Modified: 2006-01-06 19:44 UTC (History)
2 users (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description vtmue 2003-08-16 19:42:45 UTC
Hello,

Basically I run into the problem which is discussed here:

http://forums.itrc.hp.com/cm/QuestionAnswer/1,,0xf91e36e69499d611abdb0090277a778c,00.html

But the proposed fix (rising semaphore-related kernel parameters) does not help.
We run 11.11 at a fairly recent patch level. A full trace of Apache 2.0.47 until
it's "suicide" is available (given httpd is alive...) at:

http://vorsprung-durch-denken.de/apache-trace.txt

And the main error log holds:

tons of: [emerg] (22)Invalid argument: couldn't grab the accept mutex
some: [emerg] (36)Identifier removed: couldn't grab the accept mutex
few: [emerg] (28)No space left on device: couldn't grab the accept mutex

The symptom does not appear to be related to the number of children. I watched
the parent die with 46 and another time with 17 child processes. The load of the
machine is around 0.2 all the time.

To put it straight: I'm stuck and hoping some good soul out there can help!

Any hint is appreciated - TIA
vt
Comment 1 vtmue 2003-08-17 12:54:35 UTC
We worked around the problem by temporarily setting AcceptMutex to fcntl.
Comment 2 Jeff Trawick 2003-08-19 11:20:53 UTC
*** Bug 22516 has been marked as a duplicate of this bug. ***
Comment 3 Jeff Trawick 2003-08-19 11:36:28 UTC
BTW, the trace provided is just of the parent process, so it doesn't show the
semaphore errors encountered in the children.

I don't think this is an Apache or APR problem.  (The APR codebase has the code
that uses SysV semaphores.)  While there have been at least a few people
encountering this on HP-UX, semaphore problems that could not be resolved by
system tuning haven't been reported elsewhere, and presumably many other HP-UX
users are running Apache successfully.  Maybe there is further tuning necessary
on your system, maybe you have a bad level of some kernel code, maybe I don't
know what I'm talking about :)

If you want to pursue this further with us, we need a trace that shows
Apache+APR doing something invalid with the semaphores.  If you have OS support
from HP, you might describe to them what tuning you performed already and see if
they have additional recommendations.
Comment 4 vtmue 2003-08-19 12:51:12 UTC
Jeff,

I read about other HP-UX und Solaris users who appear to face the very same
symptom. HP suppplies a compiled binary so many users will stick with this I
suppose.

My trace has the capability to follow forks but there are a couple of
showstoppers here on my side: the affected server is productive and we are
rather in the process of downgrading back to 1.3 . Then there are about 170
vhosts configured; httpd has approximately 50-60 concurrent active childs during
the day.

One of our first thoughts here was that one of the vhosts may generate an error
that causes the parent to shut down but we could not confirm this when searching
the logs. And I have to admit we haven't got the time to trace down this any
further right now.

We are about to set up an 11i system with current patch level during the next
week. We can possibly set up 2.0.47 there and see if httperf can reproduce the
problem.

Cheers,
vt
Comment 5 Jeff Trawick 2003-08-22 11:56:53 UTC
>I read about other HP-UX und Solaris users who appear to face the very same
>symptom. HP suppplies a compiled binary so many users will stick with this I
>suppose.

If there is some fix for this in the HP-supplied binary but not in Apache or
APR, we'd love to hear about it :)  I hope that isn't the situation.

>One of our first thoughts here was that one of the vhosts may generate an error
>that causes the parent to shut down but we could not confirm this when >searching
>the logs. And I have to admit we haven't got the time to trace down this any
>further right now.

In the case that a child returned a fatal error which forced a shutdown, there
should be a message in error_log written by the parent by this code:

            ap_log_error(APLOG_MARK, APLOG_ALERT,
                         0, ap_server_conf,
                         "Child %" APR_PID_T_FMT
                         " returned a Fatal error..." APR_EOL_STR
                         "Apache is exiting!",
                         pid->pid);

In all likelihood the fatal error was simply the first unexpected ENOSPC from
attempting to acquire the mutex, then that child returned a fatal error, then
the semaphore got cleaned up, then remaining children that hadn't already died
due to shutdown started getting EINVAL on their semaphore operations.
Comment 6 vtmue 2003-08-22 17:29:39 UTC
Hi Jeff,

Ok, I have to admit we have those:
[Sat Aug 16 16:08:16 2003] [notice] Apache/2.0.47 configured -- resuming normal
operations
[Sat Aug 16 16:38:09 2003] [emerg] (28)No space left on device: couldn't grab
the accept mutex
[Sat Aug 16 16:38:09 2003] [alert] Child 16480 returned a Fatal error...
Apache is exiting!
[Sat Aug 16 16:38:10 2003] [emerg] (36)Identifier removed: couldn't grab the
accept mutex
[...]

Unfortunately a colleague deleted the client' logs of that day so... :(

Then hp: from what I see in their relasenotes they fixed a bug related to
semaphores/modssl/dbm in 2.0.43 so it seems that is s/th different. Besides I
take it for granted that they'll report problems once they find/fix them.

At this time, I'm a bit clueless because I see no way how we could track this
down. Can you give me a hint where I can read about what could cause a child to
produce an "Fatal error"? (I googled but didn't find s/th hot). I'm willing to
investigate, but I can't trace 170 vhosts one after the other - many of them
using PHP.

Thanks, vt
Comment 7 Jeff Trawick 2003-10-10 18:18:28 UTC
This first error message from your last error log submission is the entire story:

[Sat Aug 16 16:38:09 2003] [emerg] (28)No space left on device: couldn't grab
the accept mutex

The kernel failed the semaphore acquire.  If you can't fix it with OS tuning,
than avoid it with "AcceptMutex fcntl" or some other mutex type.
Comment 8 Jeff Trawick 2003-12-10 18:24:41 UTC
*** Bug 25418 has been marked as a duplicate of this bug. ***
Comment 9 Jeff Trawick 2003-12-10 18:29:58 UTC
Not a problem in httpd or APR as far as anyone can tell...   If OS tuning can't
resolve the problems, then use AcceptMutex directive to try a different mutex
mechanism.

The not-uncommon occurrences with mutex problems that defy easy resolution or
even explanation is why there is an AcceptMutex directive to start with :)
Comment 10 Phil White 2006-01-07 04:44:10 UTC
This bug is cropping up in a gentoo build of 2.0.54 (revision 31, if anyone cares).

I've tried removing user limits on apache without success.  Is there a definitive
solution for Linux which doesn't involve switching the AcceptMutex directive?