Bug 50261 - graceful restart with multiple listeners using prefork MPM can result in hung processes
Summary: graceful restart with multiple listeners using prefork MPM can result in hung...
Status: RESOLVED LATER
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: mpm_prefork (show other bugs)
Version: 2.2.17
Hardware: PC Solaris
: P2 normal (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords: MassUpdate
Depends on:
Blocks:
 
Reported: 2010-11-12 10:08 UTC by Charles Jardine
Modified: 2018-11-07 21:08 UTC (History)
3 users (show)



Attachments
Build configuration command (145 bytes, text/plain)
2010-11-12 10:08 UTC, Charles Jardine
Details
Cut down cinguration for the test (847 bytes, text/plain)
2010-11-12 10:09 UTC, Charles Jardine
Details
System call trace (51.67 KB, text/plain)
2010-11-12 10:10 UTC, Charles Jardine
Details
Backtrace of one stuck process ... (717 bytes, text/plain)
2010-11-12 10:11 UTC, Charles Jardine
Details
... and the second one (717 bytes, text/plain)
2010-11-12 10:11 UTC, Charles Jardine
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Charles Jardine 2010-11-12 10:08:01 UTC
Created attachment 26286 [details]
Build configuration command

I have symptoms like those described on bug 42829. However, since that bug was
marked RESOLVED FIXED on 2009-02-13 I am starting a new bug. I am sorry if this
is the wrong thing to do.

I am running Apache 2.2.17 (prefork) compiled from source using the Sun
Studio 12.1 C compiler on a Solaris 10 x86/64 system at kernel patch
level 142910-17. I will attach the configuration options used for the
build, and the cut-down httpd.conf I have used to reproduce the problem.

The problem is that, almost every time I do a graceful (USR1) restart, one
or more child processes remain stuck indefinitely in the 'Gracefully
finishing' state (represented bu a 'G' in the status display). My
configuration contains more that one Listen directive. In the simplified
example, I listen on an IPv4 address and an IPv6 address.

I cannot reproduce the problem if I have only one Listen directive. It is
this detail which leads me to suspect that my problem is related to bug
42829.

I have managed to reproduce the problem with an httpd running under truss,
so I have a system call trace covering a graceful restart which left two
stuck processes. I will attach this, and pstack backtraces of the two
processes which were stuck. (Truss alters the timing, and reduces the
chance of stuck processes.)
Comment 1 Charles Jardine 2010-11-12 10:09:49 UTC
Created attachment 26287 [details]
Cut down cinguration for the test
Comment 2 Charles Jardine 2010-11-12 10:10:31 UTC
Created attachment 26288 [details]
System call trace
Comment 3 Charles Jardine 2010-11-12 10:11:09 UTC
Created attachment 26289 [details]
Backtrace of one stuck process ...
Comment 4 Charles Jardine 2010-11-12 10:11:55 UTC
Created attachment 26290 [details]
... and the second one
Comment 5 Jeff Trawick 2010-11-12 11:13:20 UTC
Charles, can you hit the problem with "AcceptMutex sysvsem" ?

I wonder if the stall of these two children in pthread_mutex_lock is caused by the pthread mutex getting cleaned up in the parent when pconf is destroyed while there are still users of the mutex.  I don't know what happens when the parent munmaps the storage for the mutex or if that could be system dependent.

The following is just a quick hack to try to see if killing the pthread mutex in the parent during this graceful restart scenario is what causes the children to hang.  (It never deletes the old mutex.)

Charles, perhaps you could try to recreate with this patch and the default AcceptMutex?

Index: server/mpm/prefork/prefork.c
===================================================================
--- server/mpm/prefork/prefork.c	(revision 1034057)
+++ server/mpm/prefork/prefork.c	(working copy)
@@ -940,7 +940,7 @@
                                  ap_my_pid);
 
     rv = apr_proc_mutex_create(&accept_mutex, ap_lock_fname,
-                               ap_accept_lock_mech, _pconf);
+                               ap_accept_lock_mech, s->process->pool);
     if (rv != APR_SUCCESS) {
         ap_log_error(APLOG_MARK, APLOG_EMERG, rv, s,
                      "Couldn't create accept lock (%s) (%d)",

It isn't a permanent solution because it leaks pthread mutexes across graceful restart, but it may be helpful for the investigation.
Comment 6 Charles Jardine 2010-11-12 12:01:31 UTC
"AcceptMutex sysvsem" chases the problem away.

The patch also stops the process drain.

This is all good news. Thank you.
Comment 7 Eric Garreau 2010-11-15 04:50:50 UTC
We also had the same problem on a SunOS/sparc host, and we have replaced "AcceptMutex pthread" (which is 100% fine on Linux) by "AcceptMutex posixsem"

Our investigations have shown that the current code makes the assumption that the 'apr_proc_mutex_lock(accept_mutex)' call /may/ exit when the process is notified of a cancellation.

Unfortunately, pthread_mutex_lock (when "AcceptMutex pthread") is not a cancellation point, so the call will not exit with a status code different of APR_SUCCESS, ... and the process will actually continue to wait for the next time it will be lucky enough to capture the mutex, and then exit thanks to 'if(listener_may_exit)'.

On SunOS, it seems that the old processes are forgotten by the scheduler, so they stay visible "forever", until a genuine 'apachectl stop' is called ('apr_proc_mutex_lock()' exits when the mutex is destroyed).

We have used the 'posixsem' type as a quick workaround (a semaphore is a cancellation point), but not compared the performance drop yet. We have not tried the 'sysvsem' type because 'posixsem' seemed to work.
Comment 8 chris 2011-08-01 19:24:12 UTC
I just thought I would add something I noticed.

I have Sparc Solaris 10 and I had noticed the graceful restart resulting in G's in server-status never going away with apache 2.2.13 years ago, compiled with SUN cc 12.1 and preinstalled ssl from /usr/sfw.  I installed 2.2.14 and the problem went away.  If you do not implement ssl there is no problem.

This week I set up a clean, patched Solaris 10, with httpd 2.2.18, compiled using gcc and openssl 0.9.8R in /usr/local/ssl.  I saw that the problem returned, G in server-status.  I installed 2.2.19 the same way and the problem remained.  

I returned to 2.2.14, configured and compiled the same way as 18 and 19, and the problem is NOT there.  

I"m just an admin but I thought this might help someone figure out why.

Thanks
Comment 9 Stefan Fritsch 2011-08-06 11:58:42 UTC
The only two changes to prefork between 2.2.14 and 2.2.18 are

http://svn.apache.org/viewvc?view=revision&revision=1069428
http://svn.apache.org/viewvc?view=revision&revision=1021621

Someone could try if reverting one of these two fixes the issue. If not, it may be a change in apr. You could try running 2.2.19 with the version of apr shipped with 2.2.14, then.
Comment 10 chris 2011-11-02 20:29:37 UTC
(In reply to comment #9)
> The only two changes to prefork between 2.2.14 and 2.2.18 are
> 
> http://svn.apache.org/viewvc?view=revision&revision=1069428
> http://svn.apache.org/viewvc?view=revision&revision=1021621
> 
> Someone could try if reverting one of these two fixes the issue. If not, it may
> be a change in apr. You could try running 2.2.19 with the version of apr
> shipped with 2.2.14, then.

I just installed another http 2.2.19/openssl 1.0 and when I graceful restart I get the G's in server-status.  If I do not load the ssl.conf, there is no problem.  This is on Solaris 10 and the compiler doesn't matter.  I have a linux server running 2.2.15 and openssl 1.0e and there is no problem, as well as my 2.2.14.  

Has there been any fix or patch for this that I can try please?

Someone suggested in this post to use the apr from 2.2.14 but I am not sure what he means by that, and what could be the problems with doing that please?

Thanks
Comment 11 chris 2012-03-27 21:43:33 UTC
Hi,

Has anyone found a fix for this, the Graceful being stuck in G in server-status please?  I really want to start upgrading my apache and we only do graceful restarts.

thanks
Comment 12 Charles Jardine 2012-03-28 07:49:03 UTC
(In reply to comment #11)
> Hi,
> 
> Has anyone found a fix for this, the Graceful being stuck in G in server-status
> please?  I really want to start upgrading my apache and we only do graceful
> restarts.
> 
> thanks

There is a workaround which I have found satisfactory. I have
the following lines in my configuration file:


  # The default setting of AcceptMutex is dependent on both
  # platform and version. For Solaris on 2.0 it was 'fcntl'.
  # In versions 2.2.16 and 17 the default is 'pthread', which
  # promises better performance. However, it doesn't work if there
  # is more than one Listen directive. See CJJ's bug at
  # https://issues.apache.org/bugzilla/show_bug.cgi?id=50261.
  # The following directive makes no change for 2.0 and
  # circumvents the bug for 2.2.

  AcceptMutex fcntl

  # LockFile is needed - /var/run is a good place. Docs say
  # 'The PID of the main server process is automatically
  # appended to the filename', so we can use the same name for
  # all instances. Note, the lock files are invisible when
  # the server is running - presumably unlinked.

  LockFile /var/run/httpd.lock

I have not noticed any performance problems caused by this. I would
like to suggest that, if the bug is not to be fixed, the default for
AcceptMutex for Solaris should be changed back to 'fnctl'.
Comment 13 chris 2012-04-16 20:12:50 UTC
Hi,

Thank you for the workaround.  I did find that it was happening on later versions too, but I will try it and see how it goes.  Is there a way to tell if the connection is being closed gracefully or abruptly shut with your fix?  Can you shed some light on what happens in a graceful stop?  

Thanks you very very much.
Comment 14 William A. Rowe Jr. 2018-11-07 21:08:43 UTC
Please help us to refine our list of open and current defects; this is a mass update of old and inactive Bugzilla reports which reflect user error, already resolved defects, and still-existing defects in httpd.

As repeatedly announced, the Apache HTTP Server Project has discontinued all development and patch review of the 2.2.x series of releases. The final release 2.2.34 was published in July 2017, and no further evaluation of bug reports or security risks will be considered or published for 2.2.x releases. All reports older than 2.4.x have been updated to status RESOLVED/LATER; no further action is expected unless the report still applies to a current version of httpd.

If your report represented a question or confusion about how to use an httpd feature, an unexpected server behavior, problems building or installing httpd, or working with an external component (a third party module, browser etc.) we ask you to start by bringing your question to the User Support and Discussion mailing list, see [https://httpd.apache.org/lists.html#http-users] for details. Include a link to this Bugzilla report for completeness with your question.

If your report was clearly a defect in httpd or a feature request, we ask that you retest using a modern httpd release (2.4.33 or later) released in the past year. If it can be reproduced, please reopen this bug and change the Version field above to the httpd version you have reconfirmed with.

Your help in identifying defects or enhancements still applicable to the current httpd server software release is greatly appreciated.