Intermittently after a graceful restart (apachectl graceful) extra Apache processes are left running on the server. A check to server-status shows a number of them to be in state 'G' (gracefully finishing). They remain in that state indefinitely, and will not go away until a stop-and-start or a restart is issued. The 'top' command shows one of these 'G' processes is stuck in kqread, while the rest are in lockf. (I believe the processes in lockf are blocking on the accept mutex, which is held by the process in kqread and therefore is never cleared.) A stack trace of the process in kqread state shows the process calling kevent() from the apr_pollset_poll() function in srclib/apr/poll/unix/kqueue.c, which is called from the main loop in child_main() of server/mpm/prefork/profork.c. When the process is allowed to continue, it never returns from kevent(). Frequently the error log contains the following message when the problem has occurred: [error] (9)Bad file descriptor: apr_socket_accept: (client socket) The behavior can be replicated in the Apache 2.2.4 release version on FreeBSD 6.2 stable. The build configuration is as follows: ./configure \ --with-apr=/usr/local/apr-httpd/ \ --with-apr-util=/usr/local/apr-util-httpd/ \ --prefix=/usr/local/httpd-2.2.4 \ --with-mpm=prefork \ --enable-mods=info The problem only occurs when at least one extra listener is configured in httpd.conf: Listen 80 Listen 8080 When replicating the problem it is useful to enable server-status by uncommenting the Include line for httpd-info.conf and adding an 'Allow from' line in the file conf/extra/httpd-info.conf. After Apache has been restarted a series of 'apachectl graceful' commands will usually trigger the problem. It speeds things up to run several of them on one command line (separated by semi-colons), but care is required not to run into another Apache bug with similar steps to duplicate (ID 39311). Usually the hung 'G' process will show up within 2-3 minutes by continually repeating in this way. Here is my take on what may happening, along with a patch that works for me: There appears to be a race condition between Apache prefork and kevent() that depends upon the timing of the receipt of SIGUSR1 in the child process. When SIGUSR1 is received, the current signal handler for prefork closes all listeners immediately. I believe if the close occurs before the process gets into kevent() to poll the sockets, it can cause kevent() to hang. My solution was to wait until after the !die_now loop to close listeners. The signal handler merely sets the die_now flag, nothing more. Here is my proposed patch: --- prefork.c.orig Fri Jul 6 11:12:53 2007 +++ prefork.c Fri Jul 6 11:18:24 2007 @@ -330,8 +330,6 @@ static void stop_listening(int sig) { - ap_close_listeners(); - /* For a graceful stop, we want the child to exit when done */ die_now = 1; } @@ -657,6 +655,7 @@ die_now = 1; } } + ap_close_listeners(); clean_child_exit(0); }
Created attachment 20470 [details] Proposed changes to the SIGUSR1 handler and prefork code
I have the exactly same problem on jailed Apache 2.2.4 running on FreeBSD 6.2.
i have same issue on my test/preproduction machine with apache 2.2.4 (with mod_ssl, mod_php) on top of gentoo 2007.0 and slackware 12.0 linux distribution ... tested with the package from distribution and with my own package builded from vanilla apache (the arhives from httpd.apache.org)
Same problem on Apache 2.2.4 (Debian/Lenny, amd x86_64, kernel 2.6.18)
I'm also seeing this on Debian testing on x86 . Looking at the error logs the problem has appeared before 2.2.4. I see the message: [Fri Nov 24 09:25:09 2006] [notice] Graceful restart requested, doing restart [Fri Nov 24 09:25:09 2006] [error] (9)Bad file descriptor: apr_socket_accept: (client socket) [Fri Nov 24 09:25:23 2006] [notice] Apache configured -- resuming normal operations At this time the apache version on testing was 2.2.3 (so was on this box) Recently, after the upgrade to 2.2.4 and the fix to CVE-2007-3304 I'm also seeing this: [Thu Aug 23 13:00:14 2007] [notice] Graceful restart requested, doing restart [Thu Aug 23 13:00:14 2007] [notice] (10)No child processes: cannot send signal 10 to pid 19020 (non-child or already dead) [Thu Aug 23 13:00:14 2007] [error] (9)Bad file descriptor: apr_socket_accept: (client socket) [Thu Aug 23 13:00:17 2007] [notice] Digest: generating secret for digest authentication ... [Thu Aug 23 13:00:21 2007] [notice] Digest: done [Thu Aug 23 13:00:21 2007] [notice] Apache/2.2.4 (Debian) mod_ssl/2.2.4 OpenSSL/0.9.8e DAV/2 configured -- resuming normal operations I've noticed that after such unsuccessfull gracefull restart the number of the remaining alive children stands almost constant(about 32-33, I've got MinSpareServers set to 30). I've started 200 parallel requests for a big file, but the number of the active children remained about 30, although I've got MaxClients set to 256.
Confirmed on Mandriva Linux Cooker as well with 2.2.6. The proposed patch seems to fix this.
the same problem on sles 10 sp1 (using 2.2.3). what are the implications of this, could any problems arise if this is not patched ?
(In reply to comment #7) > what are the implications of this, could any problems arise if this is not > patched ? It caused us problems in cases where a cron entry was saving off log files and then issuing a graceful restart every night. More and more processes ended up hung until we reached our MaxClients limit, resulting in denial of service.
We are running sles 10 and also getting the following: [Wed Sep 26 03:00:02 2007] [notice] Graceful restart requested, doing restart [Wed Sep 26 03:00:02 2007] [error] (9)Bad file descriptor: apr_socket_accept: (client socket) [Wed Sep 26 03:00:02 2007] [error] (9)Bad file descriptor: apr_socket_accept: (client socket) Syntax error on line 101 of /etc/apache2/server-tuning.conf: Invalid command 'BrowserMatch', perhaps misspelled or defined by a module not included in the server configuration [Wed Sep 26 03:10:02 2007] [warn] pid file /var/run/httpd2.pid overwritten -- Unclean shutdown of previous Apache run? [Wed Sep 26 03:10:02 2007] [notice] Apache/2.2.3 (Linux/SUSE) configured -- resuming normal operations It doesn't happen every night, very sporadically. For the mean time, I have a script that checks to make sure apache is running, and if not, restart it.
The patch from the bug submitter does not close the listening sockets until the current request has been finished, therefore breaking graceful-stop behaviour. Depending on OS and used apr_pollset_poll variant, it seems that both apr_pollset_poll() and lr->accept_func() can block if the sockets have been closed in the meantime. This can be mitigated by checking for die_now before calling these two functions (though a short race condition still remains). Maybe it would also be possible to change apr to not block if all listening sockets in the pollset are closed.
Created attachment 21002 [details] new patch for graceful-stop / graceful reload problem
it seems this bug may be related to similar issues for the event mpm? PR 43359: http://issues.apache.org/bugzilla/show_bug.cgi?id=43359 Has a more complicated patch...
(In reply to comment #12) > it seems this bug may be related to similar issues for the event mpm? > PR 43359: > http://issues.apache.org/bugzilla/show_bug.cgi?id=43359 > > Has a more complicated patch... I think that's a different bug. Here, calling apr_pollset_poll and/or accept when the listening sockets have already been closed, causing a block. In PR43359 not continuing to process the request when the listening sockets have been closed. I don't see how the patch from PR43359 would help in prefork, as we have to close the sockets in the signal handler (at least when we are in ap_process_connection). Otherwise the sockets will only be closed after the current request is finished, which would break graceful stop. Or am I missing something?
Created attachment 21137 [details] patch for graceful-stop / graceful reload problem without race condition This patch fixes the remaining race condition in the previous patch
> More and more processes ended up hung until we reached > our MaxClients limit, resulting in denial of service. Denial of service is also resulting from the memory occupied by the hung processes. This has been severe enough to cause to cause errors such as "/libexec/ld-elf.so.1: /lib/libc.so.6: mmap of entire address space failed: Cannot allocate memory." when attempting to connect via ssh.
So, am I correct in the assumption that the root cause of this bug is a sequence as follows: 1. child in child_main enters the while (!die_now) loop 2. child is SIGUSR1'ed 3. listening socket, fd N is close()d 4. child calls accept or epoll_wait/whatever *on closed fd N* and calling accept/epoll_wait/whatever on a closed fd results in undefined behaviour and hence the process may hang.
Presuming the assumption above is correct; Stefan's patch doesn't strictly close the race since the signal could still occur e.g. between checking die_now and the invocation of the accept function. To strictly close this race, I think ap_close_listeners() is going to have to dup something on to the listening fds which will guarantee that the accept/etc function call has defined behaviour. In the single-listener case this could be a simple hack like dup2(open("/dev/null", blah), listener). For the multi-listener case, it means invalidating the pollset fd, which is method-specific and will require a new APR interface.
Joe's description of the problem in comment #16 is accurate. I think I closed the race completely, because if the signal occurs between the check for die_now and accept(), the signal handler will only set die_now and not call ap_close_listeners() (because active_connection == 0). In this case the ap_close_listeners() is done in the while loop just before ap_process_connection(). But fixing it in apr, so that apr_pollset_poll, etc. have a defined behaviour would probably be better. Maybe
Ah, sorry, I missed that, thanks Stefan. The dup2 is actually simple in implementation, I'll attach a patch. I don't think it is necessary to worry about invalidating the pollsets -- the pollsets will contain references *to the listening fds*. So I think that simply dup2ing onto those fds should be sufficient. I'm not sure which approach is better/safer.
Created attachment 21406 [details] close listeners using dup2 to avoid undefined behaviour This closes the listeners by dup2'ing a dummy socket onto them. This should restore the guarantee of defined behaviour by a call to accept/apr_pollset_poll/ which references this fd, as opposed to closing it outright.
Using dup2 like this does have the advantage that it makes stop_listening() (more) async-signal-safe; calling apr_socket_close(), as the current code does, is particularly async-signal-unsafe since it modifies pool state.
(In reply to comment #20) > Created an attachment (id=21406) [edit] > close listeners using dup2 to avoid undefined behaviour > > This closes the listeners by dup2'ing a dummy socket onto them. This should > restore the guarantee of defined behaviour by a call to > accept/apr_pollset_poll/ which references this fd, as opposed to closing it > outright. Hm, and apr_pollset_poll called on the dummy_listener with APR_POLLIN as event for which we are looking for returns immediately?
Taking this up dev@httpd. http://marc.info/?l=apache-httpd-dev&m=120069757111682&w=2
I don't know if this is related or no. I have a similar problem with apache 2.2.4-3ubuntu0.1 of latest ubuntu gutsy. Apache prefork uses a "internal dummy connection" to tell childs they should die. When apache emits that request to a child (too many idle workers, MaxRequestsPerChild reached...), the children correctly dies (is not longer a proc in the system), but the apache main process still thinks it's "Gracefully finishing". Server status page shows processes on this state in a number increasing until Maxclients is reached and apache stops answering connections, so it's a very buig problem for me. It can be forced on my system easily issuing a /etc/init.d/apache2 reload. With that I get a lot of connections in the 'G' state (finished but the main apache still thinks they are finishing). It's on a xen virtual machine, maybe it's a problem related to this.
(In reply to comment #24) > I don't know if this is related or no. I have a similar problem with apache > 2.2.4-3ubuntu0.1 of latest ubuntu gutsy. Yes this is the same problem. See https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/174805 and possibly add a comment there to get the fix backported to gutsy.
I can confirm this bug for Apache 2.2.6 prefork on Solaris 10. Here is a trace of an unsuccessful SIGUSR1 call: Received signal #16, SIGUSR1, in fcntl() [caught] siginfo: SIGUSR1 pid=2185 uid=2002 fcntl(20, F_SETLKW, 0xFFFFFFFF7D925510) Err#4 EINTR lwp_sigmask(SIG_SETMASK, 0x00008000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] close(7) = 0 close(5) = 0 close(3) = 0 setcontext(0xFFFFFFFF7FFFE210) fcntl(20, F_SETLKW, 0xFFFFFFFF7D925510) (sleeping...) fcntl(20, F_SETLKW, 0xFFFFFFFF7D925510) = 0 port_associate(22, 4, 0xFFFFFFFFFFFFFFFF, 0x00000001, 0x10039FE68) Err#81 EBADFD port_getn(22, 0x10039FD40, 3, 1, 0x00000000) (sleeping...) Received signal #16, SIGUSR1, in port_getn() [caught] siginfo: SIGUSR1 pid=2185 uid=2002 port_getn(22, 0x10039FD40, 3, 1, 0x00000000) Err#4 EINTR lwp_sigmask(SIG_SETMASK, 0x00008000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] close(-1) Err#9 EBADF close(-1) Err#9 EBADF close(-1) Err#9 EBADF setcontext(0xFFFFFFFF7FFFE360) write(13, " [ F r i A u g 2 2 ".., 104) = 104 close(22) = 0 close(11) = 0 close(10) = 0 _exit(1) error-log: [Fri Aug 22 15:08:04 2008] [error] (70007)The timeout specified has expired: apr_pollset_poll: (listen) Even after sending a 2nd USR1-signal the child doesn't exit: 3.6337 read(0, 0xFFFFFFFF7FFFEB78, 512) = 0 3.6338 close(0) = 0 3.6339 read(10, 0xFFFFFFFF7FFFED7F, 1) Err#11 EAGAIN 4.4083 Received signal #16, SIGUSR1, in fcntl() [caught] 4.4086 fcntl(20, F_SETLKW, 0xFFFFFFFF7D925510) Err#4 EINTR 4.4087 lwp_sigmask(SIG_SETMASK, 0x00008000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] 4.4088 close(7) = 0 4.4089 close(5) = 0 4.4089 close(3) = 0 4.4090 setcontext(0xFFFFFFFF7FFFE2F0) fcntl(20, F_SETLKW, 0xFFFFFFFF7D925510) (sleeping...) 38.5953 Received signal #16, SIGUSR1, in fcntl() [caught] 38.5955 fcntl(20, F_SETLKW, 0xFFFFFFFF7D925510) Err#4 EINTR 38.5956 lwp_sigmask(SIG_SETMASK, 0x00008000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] 38.5957 close(-1) Err#9 EBADF 38.5957 close(-1) Err#9 EBADF 38.5958 close(-1) Err#9 EBADF 38.5969 setcontext(0xFFFFFFFF7FFFE2F0) fcntl(20, F_SETLKW, 0xFFFFFFFF7D925510) (sleeping...) 56.4129 Received signal #16, SIGUSR1, in fcntl() [caught] 56.4130 fcntl(20, F_SETLKW, 0xFFFFFFFF7D925510) Err#4 EINTR 56.4131 lwp_sigmask(SIG_SETMASK, 0x00008000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] 56.4132 close(-1) Err#9 EBADF 56.4133 close(-1) Err#9 EBADF 56.4133 close(-1) Err#9 EBADF 56.4134 setcontext(0xFFFFFFFF7FFFE2F0) 57.3615 Received signal #16, SIGUSR1, in fcntl() [caught] 57.3617 fcntl(20, F_SETLKW, 0xFFFFFFFF7D925510) Err#4 EINTR 57.3618 lwp_sigmask(SIG_SETMASK, 0x00008000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] 57.3618 close(-1) Err#9 EBADF 57.3619 close(-1) Err#9 EBADF 57.3619 close(-1) Err#9 EBADF 57.3620 setcontext(0xFFFFFFFF7FFFE2F0) 57.8297 Received signal #16, SIGUSR1, in fcntl() [caught] 57.8298 fcntl(20, F_SETLKW, 0xFFFFFFFF7D925510) Err#4 EINTR 57.8299 lwp_sigmask(SIG_SETMASK, 0x00008000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] 57.8301 close(-1) Err#9 EBADF 57.8301 close(-1) Err#9 EBADF 57.8302 close(-1) Err#9 EBADF 57.8302 setcontext(0xFFFFFFFF7FFFE2F0) This is what a successful USR1 handling looks like: Received signal #16, SIGUSR1, in pollsys() [caught] pollsys(0xFFFFFFFF7FFFE928, 1, 0xFFFFFFFF7FFFE860, 0x00000000) Err#4 EINTR lwp_sigmask(SIG_SETMASK, 0x00008000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] close(7) = 0 close(5) = 0 close(3) = 0 setcontext(0xFFFFFFFF7FFFDF50) pollsys(0xFFFFFFFF7FFFE928, 1, 0xFFFFFFFF7FFFE860, 0x00000000) = 1 read(0, 0xFFFFFFFF7FFFEAB8, 512) = 0 close(0) = 0 read(10, 0xFFFFFFFF7FFFECBF, 1) Err#11 EAGAIN close(22) = 0 close(11) = 0 close(10) = 0 mmap(0x00000000, 8192, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFFFFFFFF76F00000 munmap(0xFFFFFFFF76F00000, 8192) = 0 _exit(0)
seems this bug still applies in 2.2.9 for graceful stop, can someone confirm?
Seems that this problem only appears if I have more than one socked opened by listen directive...
(In reply to comment #27) > seems this bug still applies in 2.2.9 for graceful stop, can someone confirm? > Yes, I can confirm. We currently have the same problem on 2.2.9 on Solaris 10 in production. I've also tested Joe's patch and that did not solve the issue for me. I'm not going to give Stephan's patch a try.
Joe's didn't work for me either, but Stefans' seemed to work. If nothing else it made the problem much less apparent.
(In reply to comment #30) > Joe's didn't work for me either, but Stefans' seemed to work. If nothing else > it made the problem much less apparent. Yes, I can confirm this. We have deployed Stefan Fritsch's patch in our test environment, and the issue seems to be gone now. I will deploy to production as soon as possible, but it would be a good idea to incorporate this patch into the official source.
Stefan Fritsch's latest patch has now been running in our production environment for quite some time, and all seems to be well. The issue is solved, and we can do graceful restarts again. I vote for this patch to be checked in the official source.
I've just upgraded our test-environment to Apache 2.2.10, and included Stefan's patch that used to work fine for me in version 2.2.9. However, even with this patch the issue is now still present in 2.2.10. Can anyone have a look, or will I be forced to start using a different threading model?
(In reply to comment #33) > I've just upgraded our test-environment to Apache 2.2.10, and included Stefan's > patch that used to work fine for me in version 2.2.9. However, even with this > patch the issue is now still present in 2.2.10. Nothing has changed in the relevant code in apache 2.2.10. But there was a change in apr that might have something to do with it. Can you get a stacktrace of a hanging process (should be possible with pstack)?
Created attachment 22822 [details] combination of the two patches above (21137 and 21406) You could also try if applying both patches helps.
For the record: I don't see any problems under Linux using 2.2.10 with the 2007-11-16 patch.
Comment on attachment 21406 [details] close listeners using dup2 to avoid undefined behaviour My patch won't help solve the problem, so there is no use anybody trying it; as explained in the thread on dev@httpd, the listening fds are removed from the epoll set on closure by the kernel, dup2'ing onto the fd is equivalent to closure, so the behaviour is equivalent.
(In reply to comment #37) > (From update of attachment 21406 [details]) > My patch won't help solve the problem, so there is no use anybody trying it; as > explained in the thread on dev@httpd, the listening fds are removed from the > epoll set on closure by the kernel, dup2'ing onto the fd is equivalent to > closure, so the behaviour is equivalent. This is true for epoll, but IIRC solaris 10 uses port_create and Cedric uses Solaris 10. I haven't checked if the semantics are the same as for epoll.
On a Solaris 10 machine with 2.2.10 with the 2007-11-16 patch, I didn't see any problems.
A proposed fix was committed to trunk (http://svn.apache.org/viewvc?rev=732451&view=rev). Is anyone able to try it out? I've tested it successfully with graceful stop/restart on OpenSolaris 2008.11.
(In reply to comment #40) > A proposed fix was committed to trunk > (http://svn.apache.org/viewvc?rev=732451&view=rev). Is anyone able to try it > out? I've tested it successfully with graceful stop/restart on OpenSolaris > 2008.11. > Seems to work under Linux, too.
CC myself on FreeBSD related bugs
now what should I apply to 2.2.11? it rejectsthe one from trunk
This is the patch which was applied to the 2.2.x branch: http://svn.apache.org/viewvc/httpd/httpd/branches/2.2.x/server/mpm/prefork/prefork.c?r1=739608&r2=739607&pathrev=739608
WIll this be in 2.2.12 ? Is there an ETA on when 2.2.12 will arrive? thanks!
>WIll this be in 2.2.12 ? yes >Is there an ETA on when 2.2.12 will arrive? no ETA available at present
for information, same problem here ( www-servers/apache-2.2.11-r2 gentoo hardened )
FYI, also an issue on SLES 11 64 bit. graceful-stop leaves threads hanging. the sockets appear to have been closed. /usr/sbin/httpd2 -V Server version: Apache/2.2.10 (Linux/SUSE) Server built: Feb 23 2009 21:08:14 Server's Module Magic Number: 20051115:18 Server loaded: APR 1.3.3, APR-Util 1.3.4 Compiled using: APR 1.3.3, APR-Util 1.3.4 Architecture: 64-bit Server MPM: Prefork threaded: no forked: yes (variable process count) Server compiled with.... -D APACHE_MPM_DIR="server/mpm/prefork" -D APR_HAS_SENDFILE -D APR_HAS_MMAP -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled) -D APR_USE_SYSVSEM_SERIALIZE -D APR_USE_PTHREAD_SERIALIZE -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT -D APR_HAS_OTHER_CHILD -D AP_HAVE_RELIABLE_PIPED_LOGS -D DYNAMIC_MODULE_LIMIT=128 -D HTTPD_ROOT="/srv/www" -D SUEXEC_BIN="/usr/sbin/suexec2" -D DEFAULT_PIDLOG="/var/run/httpd2.pid" -D DEFAULT_SCOREBOARD="logs/apache_runtime_status" -D DEFAULT_LOCKFILE="/var/run/accept.lock" -D DEFAULT_ERRORLOG="/var/log/apache2/err
Jon, I expect that graceful-stop issues aren't related to this bug. Plz open a new bug and post backtraces of some hanging threads.
Hi all, I think this is not related with normal use of Apache and formally coded modules. I have 3 servers running 3 different Slackware servers. Each is forced by cron, for a graceful restart at certain times of the day. This morning, on one of my servers I noticed [Sun Nov 28 06:10:01 2010] [error] (9)Bad file descriptor: apr_socket_accept: (client socket) problem. As I searched, I found this page. I read all the comments and info provided here. Then, I checked my other two servers. But the other two are not having the same problem. I just wanted to drop a note for you, about the difference between erroneous server and the others. So that it might be helpful in finding your problem. The one which started producing the mentioned error is my development server. Where I am developing a module on handling requests. In the module I am intensively playig with pools and tables. I think the error started on my server after I coded, late in last night, some lines with apr_table_set, apr_table_unset, apr_table_merge etc. I will concentrate on this point for locating the problem. If you're having this error, try to disable your modules one by one and give a graceful restart. Probably you have an offending module which causes this. Following are the details of my servers. First one is my development server where I am having the error. Please notice that the first and the second are exactly the same, but they are running on different boxes (even on different countries) Server version: Apache/2.2.15 (Unix) Server built: Mar 8 2010 13:36:25 Server's Module Magic Number: 20051115:24 Server loaded: APR 1.3.9, APR-Util 1.3.9 Compiled using: APR 1.3.9, APR-Util 1.3.9 Architecture: 64-bit Server MPM: Prefork threaded: no forked: yes (variable process count) Server compiled with.... -D APACHE_MPM_DIR="server/mpm/prefork" -D APR_HAS_SENDFILE -D APR_HAS_MMAP -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled) -D APR_USE_SYSVSEM_SERIALIZE -D APR_USE_PTHREAD_SERIALIZE -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT -D APR_HAS_OTHER_CHILD -D AP_HAVE_RELIABLE_PIPED_LOGS -D DYNAMIC_MODULE_LIMIT=128 -D HTTPD_ROOT="/usr" -D SUEXEC_BIN="/usr/bin/suexec" -D DEFAULT_PIDLOG="/var/run/httpd/httpd.pid" -D DEFAULT_SCOREBOARD="logs/apache_runtime_status" -D DEFAULT_LOCKFILE="/var/run/httpd/accept.lock" -D DEFAULT_ERRORLOG="logs/error_log" -D AP_TYPES_CONFIG_FILE="/etc/httpd/mime.types" -D SERVER_CONFIG_FILE="/etc/httpd/httpd.conf" Following servers having no error: Server version: Apache/2.2.15 (Unix) Server built: Mar 8 2010 13:36:25 Server's Module Magic Number: 20051115:24 Server loaded: APR 1.3.9, APR-Util 1.3.9 Compiled using: APR 1.3.9, APR-Util 1.3.9 Architecture: 64-bit Server MPM: Prefork threaded: no forked: yes (variable process count) Server compiled with.... -D APACHE_MPM_DIR="server/mpm/prefork" -D APR_HAS_SENDFILE -D APR_HAS_MMAP -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled) -D APR_USE_SYSVSEM_SERIALIZE -D APR_USE_PTHREAD_SERIALIZE -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT -D APR_HAS_OTHER_CHILD -D AP_HAVE_RELIABLE_PIPED_LOGS -D DYNAMIC_MODULE_LIMIT=128 -D HTTPD_ROOT="/usr" -D SUEXEC_BIN="/usr/bin/suexec" -D DEFAULT_PIDLOG="/var/run/httpd/httpd.pid" -D DEFAULT_SCOREBOARD="logs/apache_runtime_status" -D DEFAULT_LOCKFILE="/var/run/httpd/accept.lock" -D DEFAULT_ERRORLOG="logs/error_log" -D AP_TYPES_CONFIG_FILE="/etc/httpd/mime.types" -D SERVER_CONFIG_FILE="/etc/httpd/httpd.conf" Server version: Apache/2.2.13 (Unix) Server built: Aug 10 2009 14:15:43 Server's Module Magic Number: 20051115:23 Server loaded: APR 1.3.8, APR-Util 1.3.9 Compiled using: APR 1.3.8, APR-Util 1.3.9 Architecture: 32-bit Server MPM: Prefork threaded: no forked: yes (variable process count) Server compiled with.... -D APACHE_MPM_DIR="server/mpm/prefork" -D APR_HAS_SENDFILE -D APR_HAS_MMAP -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled) -D APR_USE_SYSVSEM_SERIALIZE -D APR_USE_PTHREAD_SERIALIZE -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT -D APR_HAS_OTHER_CHILD -D AP_HAVE_RELIABLE_PIPED_LOGS -D DYNAMIC_MODULE_LIMIT=128 -D HTTPD_ROOT="/usr" -D SUEXEC_BIN="/usr/bin/suexec" -D DEFAULT_PIDLOG="/var/run/httpd/httpd.pid" -D DEFAULT_SCOREBOARD="logs/apache_runtime_status" -D DEFAULT_LOCKFILE="/var/run/httpd/accept.lock" -D DEFAULT_ERRORLOG="logs/error_log" -D AP_TYPES_CONFIG_FILE="/etc/httpd/mime.types" -D SERVER_CONFIG_FILE="/etc/httpd/httpd.conf" Good luck, Yavuz Darendelioglu (aka bioyavuz)
# /usr/sbin/httpd -v Server version: Apache/2.2.17 (Unix) Server built: Oct 19 2010 16:27:47 Still having exactly same problem (triggering graceful restart from command line one by one causes segmentation fault in approximately 3 minutes). I am using Plesk together with Zend Server, and it seems that they use this scenario (multiple graceful restarts) for log rotation or something and my web server is crashing 1-2 times a week. Looks like no patch saved in 2.2.17 or it is not working now.
My platform is CentOS 5.5.
(In reply to comment #51) > Still having exactly same problem (triggering graceful restart from command > line one by one causes segmentation fault in approximately 3 minutes). This is not the same problem. This bug is about hanging processes, not crashes. Please open a new bug report for your segfaults and provide backtraces. See the section "Debugging intermittent crashes" at http://httpd.apache.org/dev/debugging.html
I am still encountering this problem on Gentoo using Apache 2.2.17 with the itk mpm. Was it really resolved?
Sorry I should have reinforced the problem more precisely - this is what is being written to /var/log/apache2/error_log when I run /etc/init.d/apache2 reload: [Fri Jun 17 04:59:23 2011] [notice] Graceful restart requested, doing restart [Fri Jun 17 04:59:23 2011] [error] (9)Bad file descriptor: apr_socket_accept: (client socket) [Fri Jun 17 04:59:25 2011] [notice] Apache/2.2.17 (Unix) mod_ssl/2.2.17 OpenSSL/1.0.0d configured -- resuming normal operations [Fri Jun 17 05:01:20 2011] [notice] Graceful restart requested, doing restart [Fri Jun 17 05:01:20 2011] [error] (9)Bad file descriptor: apr_socket_accept: (client socket) [Fri Jun 17 05:01:22 2011] [notice] Apache/2.2.17 (Unix) mod_ssl/2.2.17 OpenSSL/1.0.0d configured -- resuming normal operations The error does not appear when I do a full Apache restart. I believe the error began as soon as I started using the MPM.
itk-mpm user: This bug is tracking something specific to the prefork MPM. Report elsewhere if you're using the itk-mpm.
(In reply to comment #54) > I am still encountering this problem on Gentoo using Apache 2.2.17 with the itk > mpm. Was it really resolved? If you use mpm-itk 2.2.17-01 or newer, then it should be resolved there, too. In any case, I am really confident that this bug is fixed. If anyone encounters a similar issue with 2.2.19, he should open a new PR.
I just thought I would add something I noticed. I have Sparc Solaris 10 and I had noticed the graceful restart resulting in G's in server-status never going away with apache 2.2.13 years ago, compiled with SUN cc 12.1 and preinstalled ssl from /usr/sfw. I installed 2.2.14 and the problem went away. If you do not implement ssl there is no problem. This week I set up a clean, patched Solaris 10, with httpd 2.2.18, compiled using gcc and openssl 0.9.8R in /usr/local/ssl. I saw that the problem returned, G in server-status. I installed 2.2.19 the same way and the problem remained. I returned to 2.2.14, configured and compiled the same way as 18 and 19, and the problem is NOT there. I"m just an admin and using prefork but I thought this might help someone figure out why. Thanks
(In reply to comment #58) > I just thought I would add something I noticed. > > I have Sparc Solaris 10 and I had noticed the graceful restart resulting in G's > in server-status never going away with apache 2.2.13 years ago, compiled with > SUN cc 12.1 and preinstalled ssl from /usr/sfw. I installed 2.2.14 and the > problem went away. If you do not implement ssl there is no problem. > In my opinion SSL is not the Problem, this also happens with multiple listeners (80, 81 e.g. loadbalancer terminated SSL), also stated somewhere above. But I still confirm this, possibly only on Solaris 10 (x86 or sparc does not matter). I will have a closer look on 2.2.14 to mayby stick with that version. Besides, FreeBSD and all tested Linux flavors are doing fine. > > I"m just an admin and using prefork but I thought this might help someone > figure out why. > Me too, so I'm not much of a help. Thanks Oliver
(In reply to comment #59) > I will have a closer look on 2.2.14 to mayby stick with that version. Besides, > FreeBSD and all tested Linux flavors are doing fine. PR 42829 is fixed. You are experiencing a different issue which is tracked in PR 50261, which also has a work-around. Please only add comments to that new PR.
Is the patch for fixing the apache-status showing G the attachment "patch for graceful-stop / graceful reload problem without race condition"? If so, why wasn't it implemented in the subsequent releases, except 14? I experienced the problem in 18 and 19. Does it work for Solaris please, because all that is mentioned is linux? Thank you for any information.
(In reply to comment #61)
Everytime I do a graceful-stop, I see the following in the error log: [Fri Dec 02 19:02:22 2011] [notice] caught SIGWINCH, shutting down gracefully [Fri Dec 02 19:02:22 2011] [error] (9)Bad file descriptor: apr_socket_accept: (client socket) I'm using apache 2.2.21; prefork MPM on linux. After reading through this thread (especially Joe's comment - https://issues.apache.org/bugzilla/show_bug.cgi?id=42829#c16). I figured out that it is a race condition where SIGUSR1 is called after the child has done "else if (die_now) {" and before the child has done "status = lr->accept_func(&csd, lr, ptrans);". The following patch fixed the problem for me: @@ -638,6 +638,14 @@ * ### than POLLIN */ lr = pdesc[last_poll_idx++].client_data; + + if (die_now) { + /* In graceful stop/restart; drop the mutex + * and terminate the child. */ + SAFE_ACCEPT(accept_mutex_off()); + clean_child_exit(0); + } + goto got_fd; } }
Hi, Can paul@silvitch.com specify which source file he included the patch into please? Thank you very very much. I have found this doesn't occur on my linux servers running Centos and apache 2.2.15 but solaris has the problem on that same version. It doesn't have the problem on 2.2.14 on Solaris. Also, it only happens when SSL is compiled in on solaris. Thanks for any help.