Created attachment 26286 [details] Build configuration command I have symptoms like those described on bug 42829. However, since that bug was marked RESOLVED FIXED on 2009-02-13 I am starting a new bug. I am sorry if this is the wrong thing to do. I am running Apache 2.2.17 (prefork) compiled from source using the Sun Studio 12.1 C compiler on a Solaris 10 x86/64 system at kernel patch level 142910-17. I will attach the configuration options used for the build, and the cut-down httpd.conf I have used to reproduce the problem. The problem is that, almost every time I do a graceful (USR1) restart, one or more child processes remain stuck indefinitely in the 'Gracefully finishing' state (represented bu a 'G' in the status display). My configuration contains more that one Listen directive. In the simplified example, I listen on an IPv4 address and an IPv6 address. I cannot reproduce the problem if I have only one Listen directive. It is this detail which leads me to suspect that my problem is related to bug 42829. I have managed to reproduce the problem with an httpd running under truss, so I have a system call trace covering a graceful restart which left two stuck processes. I will attach this, and pstack backtraces of the two processes which were stuck. (Truss alters the timing, and reduces the chance of stuck processes.)
Created attachment 26287 [details] Cut down cinguration for the test
Created attachment 26288 [details] System call trace
Created attachment 26289 [details] Backtrace of one stuck process ...
Created attachment 26290 [details] ... and the second one
Charles, can you hit the problem with "AcceptMutex sysvsem" ? I wonder if the stall of these two children in pthread_mutex_lock is caused by the pthread mutex getting cleaned up in the parent when pconf is destroyed while there are still users of the mutex. I don't know what happens when the parent munmaps the storage for the mutex or if that could be system dependent. The following is just a quick hack to try to see if killing the pthread mutex in the parent during this graceful restart scenario is what causes the children to hang. (It never deletes the old mutex.) Charles, perhaps you could try to recreate with this patch and the default AcceptMutex? Index: server/mpm/prefork/prefork.c =================================================================== --- server/mpm/prefork/prefork.c (revision 1034057) +++ server/mpm/prefork/prefork.c (working copy) @@ -940,7 +940,7 @@ ap_my_pid); rv = apr_proc_mutex_create(&accept_mutex, ap_lock_fname, - ap_accept_lock_mech, _pconf); + ap_accept_lock_mech, s->process->pool); if (rv != APR_SUCCESS) { ap_log_error(APLOG_MARK, APLOG_EMERG, rv, s, "Couldn't create accept lock (%s) (%d)", It isn't a permanent solution because it leaks pthread mutexes across graceful restart, but it may be helpful for the investigation.
"AcceptMutex sysvsem" chases the problem away. The patch also stops the process drain. This is all good news. Thank you.
We also had the same problem on a SunOS/sparc host, and we have replaced "AcceptMutex pthread" (which is 100% fine on Linux) by "AcceptMutex posixsem" Our investigations have shown that the current code makes the assumption that the 'apr_proc_mutex_lock(accept_mutex)' call /may/ exit when the process is notified of a cancellation. Unfortunately, pthread_mutex_lock (when "AcceptMutex pthread") is not a cancellation point, so the call will not exit with a status code different of APR_SUCCESS, ... and the process will actually continue to wait for the next time it will be lucky enough to capture the mutex, and then exit thanks to 'if(listener_may_exit)'. On SunOS, it seems that the old processes are forgotten by the scheduler, so they stay visible "forever", until a genuine 'apachectl stop' is called ('apr_proc_mutex_lock()' exits when the mutex is destroyed). We have used the 'posixsem' type as a quick workaround (a semaphore is a cancellation point), but not compared the performance drop yet. We have not tried the 'sysvsem' type because 'posixsem' seemed to work.
I just thought I would add something I noticed. I have Sparc Solaris 10 and I had noticed the graceful restart resulting in G's in server-status never going away with apache 2.2.13 years ago, compiled with SUN cc 12.1 and preinstalled ssl from /usr/sfw. I installed 2.2.14 and the problem went away. If you do not implement ssl there is no problem. This week I set up a clean, patched Solaris 10, with httpd 2.2.18, compiled using gcc and openssl 0.9.8R in /usr/local/ssl. I saw that the problem returned, G in server-status. I installed 2.2.19 the same way and the problem remained. I returned to 2.2.14, configured and compiled the same way as 18 and 19, and the problem is NOT there. I"m just an admin but I thought this might help someone figure out why. Thanks
The only two changes to prefork between 2.2.14 and 2.2.18 are http://svn.apache.org/viewvc?view=revision&revision=1069428 http://svn.apache.org/viewvc?view=revision&revision=1021621 Someone could try if reverting one of these two fixes the issue. If not, it may be a change in apr. You could try running 2.2.19 with the version of apr shipped with 2.2.14, then.
(In reply to comment #9) > The only two changes to prefork between 2.2.14 and 2.2.18 are > > http://svn.apache.org/viewvc?view=revision&revision=1069428 > http://svn.apache.org/viewvc?view=revision&revision=1021621 > > Someone could try if reverting one of these two fixes the issue. If not, it may > be a change in apr. You could try running 2.2.19 with the version of apr > shipped with 2.2.14, then. I just installed another http 2.2.19/openssl 1.0 and when I graceful restart I get the G's in server-status. If I do not load the ssl.conf, there is no problem. This is on Solaris 10 and the compiler doesn't matter. I have a linux server running 2.2.15 and openssl 1.0e and there is no problem, as well as my 2.2.14. Has there been any fix or patch for this that I can try please? Someone suggested in this post to use the apr from 2.2.14 but I am not sure what he means by that, and what could be the problems with doing that please? Thanks
Hi, Has anyone found a fix for this, the Graceful being stuck in G in server-status please? I really want to start upgrading my apache and we only do graceful restarts. thanks
(In reply to comment #11) > Hi, > > Has anyone found a fix for this, the Graceful being stuck in G in server-status > please? I really want to start upgrading my apache and we only do graceful > restarts. > > thanks There is a workaround which I have found satisfactory. I have the following lines in my configuration file: # The default setting of AcceptMutex is dependent on both # platform and version. For Solaris on 2.0 it was 'fcntl'. # In versions 2.2.16 and 17 the default is 'pthread', which # promises better performance. However, it doesn't work if there # is more than one Listen directive. See CJJ's bug at # https://issues.apache.org/bugzilla/show_bug.cgi?id=50261. # The following directive makes no change for 2.0 and # circumvents the bug for 2.2. AcceptMutex fcntl # LockFile is needed - /var/run is a good place. Docs say # 'The PID of the main server process is automatically # appended to the filename', so we can use the same name for # all instances. Note, the lock files are invisible when # the server is running - presumably unlinked. LockFile /var/run/httpd.lock I have not noticed any performance problems caused by this. I would like to suggest that, if the bug is not to be fixed, the default for AcceptMutex for Solaris should be changed back to 'fnctl'.
Hi, Thank you for the workaround. I did find that it was happening on later versions too, but I will try it and see how it goes. Is there a way to tell if the connection is being closed gracefully or abruptly shut with your fix? Can you shed some light on what happens in a graceful stop? Thanks you very very much.
Please help us to refine our list of open and current defects; this is a mass update of old and inactive Bugzilla reports which reflect user error, already resolved defects, and still-existing defects in httpd. As repeatedly announced, the Apache HTTP Server Project has discontinued all development and patch review of the 2.2.x series of releases. The final release 2.2.34 was published in July 2017, and no further evaluation of bug reports or security risks will be considered or published for 2.2.x releases. All reports older than 2.4.x have been updated to status RESOLVED/LATER; no further action is expected unless the report still applies to a current version of httpd. If your report represented a question or confusion about how to use an httpd feature, an unexpected server behavior, problems building or installing httpd, or working with an external component (a third party module, browser etc.) we ask you to start by bringing your question to the User Support and Discussion mailing list, see [https://httpd.apache.org/lists.html#http-users] for details. Include a link to this Bugzilla report for completeness with your question. If your report was clearly a defect in httpd or a feature request, we ask that you retest using a modern httpd release (2.4.33 or later) released in the past year. If it can be reproduced, please reopen this bug and change the Version field above to the httpd version you have reconfirmed with. Your help in identifying defects or enhancements still applicable to the current httpd server software release is greatly appreciated.