Bug 48939 - Add ability to force workers into PROXY_WORKER_IN_ERROR when configured statuses are found
Summary: Add ability to force workers into PROXY_WORKER_IN_ERROR when configured statu...
Status: RESOLVED FIXED
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: mod_proxy_balancer (show other bugs)
Version: 2.2.15
Hardware: Sun All
: P2 enhancement with 1 vote (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords: FixedInTrunk, NeedsReleaseNote, PatchAvailable
: 47207 (view as bug list)
Depends on:
Blocks:
 
Reported: 2010-03-18 18:24 UTC by Daniel Ruggeri
Modified: 2010-11-24 23:32 UTC (History)
2 users (show)



Attachments
Code modifications to support initial proposal (3.09 KB, patch)
2010-03-18 18:32 UTC, Daniel Ruggeri
Details | Diff
Update (3.09 KB, application/octet-stream)
2010-03-18 21:02 UTC, Daniel Ruggeri
Details
Final proposed patch (3.19 KB, patch)
2010-03-19 18:37 UTC, Daniel Ruggeri
Details | Diff
Final patch for 2.2 branch (4.00 KB, patch)
2010-07-20 10:43 UTC, Daniel Ruggeri
Details | Diff
Fixes the fix to be "failonstatus" (4.00 KB, patch)
2010-08-21 17:25 UTC, Daniel Ruggeri
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Ruggeri 2010-03-18 18:24:22 UTC
Greetings Apache Dev's;
   A bug was filed (47207) mid last year under the expectation that HTTPD should mark servers in error state on a 500 error status return. While I disagree that this should be universal, we have found ourselves in situations where we need the ability to mark members as PROXY_WORKER_IN_ERROR automatically on certain status code returns. This varies heavily by application and backend, but is a feature worth adding. I feel that this should be a separate bug report since this is adding a server administrator configuration parameter rather than applying the proposed change in 47207.

Problem examples:
Apache HTTPD server as reverse proxy to two WebSphere application servers
Some applications take a significant amount of time to initialize. Until they are done initializing, WebSphere will rightfully return a 503. Depending on how long this application may take to initialize (30 minutes in the extreme cases of in-memory databases), this could inadvertently leave a member in service during a period of time where it can not service requests.

Apache HTTPD server as reverse proxy to another HTTPD reverse proxy
Sometimes DMZ segments are broken up in a way such that only reverse proxies with specific proxypass rules are allowed to traverse firewalls. Additionally, sometimes the data carried within those proxies is too sensitive to do in cleartext, so SSL may be needed. In some situations (server cert expired, client cert expired, misconfiguration), the target (second) HTTPD reverse proxy will throw a 502 that gets bubbled up to the first reverse proxy. Marking that proxy as unusable would be beneficial.

Apache HTTPD server as reverse proxy to any WebSphere application server
In WebSphere, it is possible to have an application deployed but not running. In the event an application is deployed but not started, the context root is not bound in the web container. When a request comes in for that context root, a 404 is returned. While it would be insane to mark a member out of service on every 404, this is just an example of a use case.

Apache HTTPD server to any backend
Some folks are brazen enough to say their application has handled all possible error conditions and that a 500 being returned to the user means the application server must be at fault and taken out of service. Again, this is madness, but there are other use cases (testing scripts) where taking instances out of service because of a 500 may be desirable.

The proposed solution:
Add a configuration directive for balancers called "ErrorOnStatus" with usage like:
ErrorOnStatus=501,502,503,504
Construct a apr_array_header_t that is checked during the currently unused proxy_balancer_post_request method.

My thoughts:
I chose to use apr_array_header_t because it does not suggest a data type. My preference would be to use apr_hash_t since it seems it would be faster, but I am concerned that there is an expectation for character array datatypes for key and value.

I am tidying up the patch now and will attach it soon.
Comment 1 Daniel Ruggeri 2010-03-18 18:32:24 UTC
Created attachment 25148 [details]
Code modifications to support initial proposal
Comment 2 Daniel Ruggeri 2010-03-18 21:02:12 UTC
Created attachment 25150 [details]
Update

Noticed a mistake in the log message - was logging the name of the balancer instead of the worker.
Comment 3 Daniel Ruggeri 2010-03-19 18:37:29 UTC
Created attachment 25153 [details]
Final proposed patch

Update to set error_time in the proxy_worker_stat now that testing is complete.
Comment 4 Daniel Ruggeri 2010-03-19 18:43:51 UTC
Final patch added and functionality has been tested as follows:

<Proxy balancer://App_cluster>
   BalancerMember http://127.0.0.1:8001 route=1
   BalancerMember http://127.0.0.1:8002 route=2
   ProxySet lbmethod=byrequests stickysession=App_STICKY nofailover=Off erroronstatus=500,502
</Proxy>

127.0.0.1 is answered by an Apache instance with valid CGI script and buggy CGI script (causing 500).

Continuous hits to valid script:
   [1]http://127.0.0.1:8001 1                1      0   Ok     20      6.9K 3.1K
   [2]http://127.0.0.1:8002 2                1      0   Ok     18      6.2K 3.7K

One hit to script that generates the 500 (500 returned to browser):
   [1]http://127.0.0.1:8001 1                1      0   Err    21      7.3K 3.7K
   [2]http://127.0.0.1:8002 2                1      0   Ok     18      6.2K 3.7K

Several hits to valid script before 60 second retry time:
   [1]http://127.0.0.1:8001 1                1      0   Ok     34      12K  3.9K
   [2]http://127.0.0.1:8002 2                1      0   Err    20      6.9K 4.8K

2 hits after retry time expired:
   [1]http://127.0.0.1:8001 1                1      0   Ok     35      12K  3.9K
   [2]http://127.0.0.1:8002 2                1      0   Ok     21      7.3K 4.8K


At the same time, I ran a test case with several hits to the buggy CGI script - as expected the force_recovery function forced the traffic through.
Comment 5 Nick Kew 2010-04-01 22:50:48 UTC
Thanks for the patch.  Committed to trunk in r930125
Comment 6 Nick Kew 2010-04-01 22:52:46 UTC
*** Bug 47207 has been marked as a duplicate of this bug. ***
Comment 7 Daniel Ruggeri 2010-07-20 10:43:38 UTC
Created attachment 25788 [details]
Final patch for 2.2 branch

This is the final patch including doc for 2.2 branch.
Comment 8 Daniel Ruggeri 2010-08-21 17:25:38 UTC
Created attachment 25923 [details]
Fixes the fix to be "failonstatus"
Comment 9 Daniel Ruggeri 2010-11-24 23:32:49 UTC
Added in 2.2.17