54852 – graceful restart takes very long time sometimes

Bug 54852 - graceful restart takes very long time sometimes

Summary: graceful restart takes very long time sometimes

Status:	RESOLVED FIXED

Alias:	None

Product:	Apache httpd-2
Classification:	Unclassified
Component:	mpm_prefork (show other bugs)
Version:	2.4.4
Hardware:	PC Linux

Importance:	P2 major with 6 votes (vote)
Target Milestone:	---
Assignee:	Apache HTTPD Bugs Mailing List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2013-04-16 08:52 UTC by Arkadiusz Miskiewicz
Modified:	2016-06-16 08:57 UTC (History)
CC List:	1 user (show)

Attachments
Only check idle (653 bytes, patch) 2013-11-15 15:30 UTC, Jim Jagielski	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Arkadiusz Miskiewicz 2013-04-16 08:52:07 UTC

Sometimes graceful restart can take up to few minutes here (time between "Graceful restart requested, doing restart" and "resuming normal operations") when using prefork MPM.

I've tracked this to such scenario:


1) main process on gracefull restart calls 

/* kill off the idle ones */
ap_mpm_pod_killpg(pod, retained->max_daemons_limit);

and that does dummy_connection() for each pod. This works fast
(1-2s) in most cases where childrens exist but if children
are gone then it can take ages. Why children are gone and why it takes ages? Below.

2) children were busy and notice that new generation is starting and
thus exit themself without any need to dummy_connection to "ping" them:

        else if (retained->my_generation !=
                 ap_scoreboard_image->global->running_generation) { /* 
restart? */
            /* yeah, this could be non-graceful restart, in which case the
             * parent will kill us soon enough, but why bother checking?
             */
            die_now = 1;
        }

3) Now such scenario can happen? Well in my case main process works
slowly enough that it doesn't finish doing ap_mpm_pod_killpg before
all children exit like in 2). Basically in middle of ap_mpm_pod_killpg()
loop all children already exited. The rest of loop takes then long time.

4) so we are ending up in situation where main process is in middle
of ap_mpm_pod_killpg while all children already exited in 2). Main
process is continuing to do dummy_connection and these connect()s
+ dummy data send, polling are taking very long time (like 1-2s for each
dummy_connection * num of these connections). This is the primary reason
why graceful is painfully slow here (1-5 minutes) sometimes.

5) why connect() and socket sending in dummy_connect succeed if there
are no children? If it failed then ap_mpm_pod_killpg loop would break
and things would end quite fast.

Since all children exited there is only main process left that still
holds listening socket and connection go to this socket. But main process
doesn't call accept() and these connections are not processed. dummy_donnection
sending doesn't fail due to the way tcp works. So loop in ap_mpm_pod_killpg
still get processed but each step takes 1-2s * 64 or more StartServers setting.

Tested the listening vs accept theory by doing external telnet connection
to the same IP/port dummy_connection is doing - telnet connect() succeeded
but telnet data is not processed since main process is not doing accept() etc yet.

6) possible solution - detect that children already exited and stop doing pointless/time consuming dummy_connection queries.

void ap_mpm_pod_killpg(ap_pod_t *pod, int num)
{
    int i;
    int max_daemons;
    apr_status_t rv = APR_SUCCESS;

    sleep(5); // THIS MAKES REPRODUCING THIS PROBLEM VERY EASY THING
              // SINCE REPRODUCING HEAVILY RELIES ON TIMMING

    /* we don't write anything to the pod here...  we assume
     * that the would-be reader of the pod has another way to
     * see that it is time to die once we wake it up
     *
     * writing lots of things to the pod at once is very
     * problematic... we can fill the kernel pipe buffer and
     * be blocked until somebody consumes some bytes or
     * we hit a timeout...  if we hit a timeout we can't just
     * keep trying because maybe we'll never successfully
     * write again...  but then maybe we'll leave would-be
     * readers stranded (a number of them could be tied up for
     * a while serving time-consuming requests)
     */
  
    for (i = 0; i < num && rv == APR_SUCCESS; ++i) {
        process_score *ps = ap_get_scoreboard_process(i);
        pid_t pid = ps->pid;

        if (pid == 0) {
            continue; /* not every scoreboard entry is in use */
        }

        rv = dummy_connection(pod);
    }

Something like this - so basically do dummy_connection() as long as we have childrens living. Unfortunately not sure if this approach is always correct (I'm not sure if scoreboard is updated correctly when children exit in such case).



7) how to reproduce?

Set StartServers to big number (64 or more), so ap_mpm_pod_killpg will have this number passed in "num" parameter and thus will have bigger number of loop iterations to do.

Main process needs to be slown down while doing ap_mpm_pod_killpg(), so that children have a chance to exit in 2) before ap_mpm_pod_killpg finishes. In my production setup this simply happens. Since it's timing dependant it is not always easy to reproduce.

Anyway how we can help it be reproduced:
- add sleep(5); at beginning of mpm_pod_killpg. This simulates a case when main process does its things in slower way than children exiting

So:
- add sleep or slow down main process in some other way (stracing main process in my case was also being enough to slown it down)
- ab -n 100000 -c 100 http://apache.ip
- before ab finishes initiate graceful restart

Relevant thread on mailing list
http://mail-archives.apache.org/mod_mbox/httpd-dev/201304.mbox/browser
(unfortunately not a popular subjects among devs, so creating this bug report,
so it won't get lost)

Comment 1 jan 2013-05-30 18:10:04 UTC

I can confirm this bug.

I’m running apache 2.2.x (24 now) with about 250 connections open all the time. Reloading apache configuration (apachectl graceful) is very slow and takes 1-4 minutes. During that time new connections are established but webpages are not served. Monitoring programs with a timeout will detect this as apache crash and try to restart it which makes situation even more complicated. I’ve seen this issue for years now.

I’ve applied Arkadiusz Miskiewicz patch for 2.2.24 (http://comments.gmane.org/gmane.comp.apache.devel/50249) and it makes quite a difference. Graceful restarts are now executed in 4-12 seconds. I personaly think it is still to long. Webpages on webserver should not be inaccessible due to graceful restart even for 2 seconds. There great is need for better way of reloading configuration and “graceful” was always quick and very dirty walkaround.

Comment 2 Jim Jagielski 2013-11-15 14:55:34 UTC

Certainly, dummy_connection() setting a 3 *second* timeout is bogus...

Will look...

Comment 3 Jeff Trawick 2013-11-15 15:07:32 UTC

At least for worker, "ap_mpm_pod_killpg(pod, ap_daemons_limit, TRUE);" should probably use the usually-smaller ap_max_daemons_limit.  (That's from looking at 2.2 worker.c recently; I didn't check other MPMs/branches.)  This could help some.

It is important for users to check the time between these two messages so that it is clear it is the same problem.

SIGUSR1 received.  Doing graceful restart 
Apache (Unix) configured -- resuming normal operations

Comment 4 Jim Jagielski 2013-11-15 15:30:02 UTC

Created attachment 31046 [details]
Only check idle

Comment 5 Jim Jagielski 2013-11-15 18:39:37 UTC

Patch in trunk

Comment 6 odosk 2016-06-16 08:57:45 UTC

To avoid any confusion this is first fixed in 2.4.10.