Bug 52567 - Worker recovery state does not properly persist if no traffic is received
Summary: Worker recovery state does not properly persist if no traffic is received
Status: RESOLVED FIXED
Alias: None
Product: Tomcat Connectors
Classification: Unclassified
Component: Common (show other bugs)
Version: 1.2.32
Hardware: PC Linux
: P2 normal (vote)
Target Milestone: ---
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-01-31 20:21 UTC by Aaron Ogburn
Modified: 2012-01-31 21:51 UTC (History)
1 user (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Aaron Ogburn 2012-01-31 20:21:23 UTC
I've noticed an issue with the worker recovery state.  If the worker receives no traffic after it goes into recovery mode, the worker will flip back into full error mode again with the next worker maintenance call.  This can be problematic in certain scenarios without session replication/failover and low traffic in a multiple httpd server mod_jk load balancing configuration.

If traffic is unlucky enough just to hit the worker when it has flipped back into error mode, the worker doesn't get a chance to recover.  Checking the relevant code, I see the cause behind this behavior in recover_workers:

        else if (w->s->error_time > 0 &&
                 (int)difftime(now, w->s->error_time) >= p->error_escalation_time) {
            if (JK_IS_DEBUG_LEVEL(l))
                jk_log(l, JK_LOG_DEBUG,
                       "worker %s escalating local error to global error",
                       w->name);
            w->s->state = JK_LB_STATE_ERROR;
        }

A worker in recovery mode has an error_time still set with a difftime that is greater than the error_escalation_time and so it falls into the "escalating local error to global error" block and moves back to full error mode. This issue could likely typically be worked around through other config options or administrative practices through the jkstatus, but this is inconsistent with expected/intended behavior and looks like an easy fix.  It seems this could be corrected with an additional check to confirm that the worker state is not JK_LB_STATE_RECOVER, for example:

        else if (w->s->error_time > 0 &&
                 (int)difftime(now, w->s->error_time) >= p->error_escalation_time) {
            if (w->s->state != JK_LB_STATE_RECOVER) {
                 if (JK_IS_DEBUG_LEVEL(l))
                     jk_log(l, JK_LOG_DEBUG,
                            "worker %s escalating local error to global error",
                            w->name);
                 w->s->state = JK_LB_STATE_ERROR;
            }
        }
Comment 1 Rainer Jung 2012-01-31 21:51:36 UTC
Thanks for analyzing and reporting this.

I used a slightly different patch by moving the condition into the surrounding if check. That way a worker in recovery state will correctly be counted with non_error and not trigger an additional forced recovery.

Fixed in r1238823, will be part of version 1.2.33.

Regards,

Rainer