Bug 29709 - Error in pool management on multiprocessor envoirment
Summary: Error in pool management on multiprocessor envoirment
Status: CLOSED INVALID
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: All (show other bugs)
Version: 2.0.47
Hardware: PC All
: P3 critical with 3 votes (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-06-21 09:11 UTC by Gabriel Kalkuhl
Modified: 2004-11-29 16:26 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Gabriel Kalkuhl 2004-06-21 09:11:29 UTC
Hello !
It seems there is a bug in the memory management of apache withhin a 
multiprocessor envoirment.
First I will introduce my envoirment:
I use apache 2.0.47 in default configuration (build with the workspace shipped 
by apache.org).
The bug was reproduced on Windows NT4 with dual Processor, W2K with dual 
Processor and W2K with Hyperthreading processor.
For reproducing the bug I modified the mod_example as followed:
In x_handler() I insert a Sleep(1000); for simulating a long operation during 
request like this:
--------------------------------------
ap_rprintf(r, "  Apache HTTP Server version: \"%s\"\n", ap_get_server_version
());
ap_rputs("  <BR>\n", r);

/* Simulating a long operation */
Sleep(1000);

ap_rprintf(r, "  Server built: \"%s\"\n", ap_get_server_built());
--------------------------------------
Then I made heavy request to normal URL’s and to mod_example simultaneous.

After a while apache will fail at random in two different states. The first 
state is a access violation. The Log comes as follows

[Thu Jun 17 16:45:47 2004] [notice] Parent: child process exited with status 
3221225477 -- Restarting.
[Thu Jun 17 16:45:50 2004] [notice] Parent: Created child process 3464
[Thu Jun 17 16:45:51 2004] [debug] mpm_winnt.c(505): Parent: Sent the 
scoreboard to the child
[Thu Jun 17 16:45:53 2004] [notice] Child 3464: Child process is running
[Thu Jun 17 16:45:53 2004] [info] Parent: Duplicating socket 404 and sending it 
to child process 3464
[Thu Jun 17 16:45:53 2004] [debug] mpm_winnt.c(426): Child 3464: Retrieved our 
scoreboard from the parent.
[Thu Jun 17 16:45:53 2004] [debug] mpm_winnt.c(623): Parent: Sent 1 listeners 
to child 3464
[Thu Jun 17 16:45:53 2004] [debug] mpm_winnt.c(582): Child 3464: retrieved 1 
listeners from parent
[Thu Jun 17 16:45:53 2004] [notice] Child 3464: Acquired the start mutex.
[Thu Jun 17 16:45:54 2004] [notice] Child 3464: Starting 25 worker threads.

Where status is 0xC0000005 which means Access Violation.

The access violation happens in apr_pool_walk_tree() while accessing
child = pool->child;

At the same time another thread tries to free the pool memory.

I have no stack backtrace for this at the moment but I can try to make one, if 
there is a intrest for it.

The other failure is that apache stops responding and stays with a processor 
load of about 50%.
A break into code shows, that the pool is damaged.
One thread stays in allocator_free() and tries to free pool memory but the
next pointer of the actual node points to itself so there is a recursion where
apache never gets out.
This happens at this point:

do {
        next = node->next;
        index = node->index;

Where the node has the following content:

next = 0x007dadd8
node->index = 1
node->next = 0x007dadd8
node->ref = 0x007dadd8
node->free_index = 3452816845

The Thread backtrace looks like this:

allocator_free(apr_allocator_t * 0x00773dd8, apr_memnode_t * 0x007dadd8) line 
362 + 6 bytes
apr_pool_destroy(apr_pool_t * 0x007d8db8) line 797 + 13 bytes
trace_add(server_rec * 0x0077c290, request_rec * 0x00000000, x_cfg * 
0x007b7c18, const char * 0x10014908 `string') line 408 + 15 bytes
x_insert_filter(request_rec * 0x007d2d48) line 997 + 23 bytes
ap_run_insert_filter(request_rec * 0x007d2d48) line 121 + 31 bytes
ap_invoke_handler(request_rec * 0x6ff09466) line 374
ap_process_http_connection(conn_rec * 0x6ff03f8f) line 293 + 6 bytes
ap_run_process_connection(conn_rec * 0x007f51c8) line 85 + 31 bytes
ap_process_connection(conn_rec * 0x007f51c8, void * 0x007f5100) line 211 + 6 
bytes
worker_main(long 2013300156) line 731
MSVCRT! 780085bc()

Another Thread stays in this stack state:

NTDLL! 77894091()
NTDLL! 778922f8()
allocator_alloc(apr_allocator_t * 0x00773dd8, unsigned int 8192) line 242
apr_pool_create_ex(apr_pool_t * * 0x007d51dc, apr_pool_t * 0x007d4d48, int (int)
* 0x00000000, apr_allocator_t * 0x00773dd8) line 829 + 14 bytes
core_output_filter(ap_filter_t * 0x6ff182c1, apr_bucket_brigade * 0x007d5198) 
line 4108
ap_pass_brigade(ap_filter_t * 0x007d5198, apr_bucket_brigade * 0x007ed5a0) line 
550 + 7 bytes
ap_http_header_filter(ap_filter_t * 0x6ff182c1, apr_bucket_brigade * 
0x007ab1f0) line 1695
ap_pass_brigade(ap_filter_t * 0x007ab1f0, apr_bucket_brigade * 0x007ed408) line 
550 + 7 bytes
ap_content_length_filter(ap_filter_t * 0x6ff182c1, apr_bucket_brigade * 
0x007ab1d8) line 1252 + 20 bytes
ap_pass_brigade(ap_filter_t * 0x007ab1d8, apr_bucket_brigade * 0x007ed408) line 
550 + 7 bytes
ap_byterange_filter(ap_filter_t * 0x6ff182c1, apr_bucket_brigade * 0x007ab1c0) 
line 3036 + 5 bytes
ap_pass_brigade(ap_filter_t * 0x007ab1c0, apr_bucket_brigade * 0x007ed408) line 
550 + 7 bytes
ap_old_write_filter(ap_filter_t * 0x007ed3f0, apr_bucket_brigade * 0x007ed528) 
line 1321 + 10 bytes
end_output_stream(request_rec * 0x007aa550) line 1039 + 29 bytes
ap_finalize_request_protocol(request_rec * 0x007aa550) line 1061 + 6 bytes
ap_send_error_response(request_rec * 0x6ff0d26f, int 404) line 2423 + 6 bytes
ap_die(int 1878053609, request_rec * 0x00000000) line 232 + 11 bytes
ap_process_request(request_rec * 0x007aa550) line 311
ap_process_http_connection(conn_rec * 0x6ff03f8f) line 293 + 6 bytes
ap_run_process_connection(conn_rec * 0x007d4e48) line 85 + 31 bytes
ap_process_connection(conn_rec * 0x007d4e48, void * 0x007d4d80) line 211 + 6 
bytes
worker_main(long 2013300156) line 731
MSV

and another Thread looks like this:

allocator_alloc(apr_allocator_t * 0x00773dd8, unsigned int 8192) line 242
apr_pool_create_ex(apr_pool_t * * 0x0156ff28, apr_pool_t * 0x007d6d80, int (int)
* 0x00000000, apr_allocator_t * 0x00773dd8) line 829 + 14 bytes
ap_read_request(conn_rec * 0x6ff09431) line 848
ap_process_http_connection(conn_rec * 0x6ff03f8f) line 286 + 6 bytes
ap_run_process_connection(conn_rec * 0x007d6e80) line 85 + 31 bytes
ap_process_connection(conn_rec * 0x007d6e80, void * 0x007d6db8) line 211 + 6 
bytes
worker_main(long 2013300156) line 731
MSVCRT! 780085bc()

all other threads stays in winnt_get_connection()

So in my oppinion, there is a race condition where one thread tries to free the 
pool memory and another thread tries to access this memory at the same time.
It seems that this behavior can only be reproduced on a multiprocessor or 
hyperthreading machine. I don’t know if it affects to other OS then Windows.

Greetings
Gabriel Kalkuhl
Comment 1 David Lane 2004-10-25 23:19:02 UTC
Hello, 
I am also having the same issue.
If i change the ThreadsPerChild then i am able to vary how long the server 
will run before restarting.
While it restarts it fails to serve any pages and returns error in browser.

System
-------------------
Dell Dual P3 1gz
Win2k with all updates as of 10/25/2004
Apache 2.0.52
OpenSSL 0.9.7d
PHP (tried many version and all have same issue from 4.3.2 - 5.0.2 and 5.1.x 
CVS)

Any advice or help in focusing this issue would be very helpfull. 

Thank You,
David



[Sun Oct 24 11:52:37 2004] [notice] Parent: child process exited with status 
1 -- Restarting.
[Sun Oct 24 11:52:40 2004] [notice] Parent: Created child process 4060
[Sun Oct 24 11:52:40 2004] [notice] Disabled use of AcceptEx() WinSock2 API
[Sun Oct 24 11:52:43 2004] [notice] Child 4060: Child process is running
[Sun Oct 24 11:52:43 2004] [notice] Child 4060: Acquired the start mutex.
[Sun Oct 24 11:52:43 2004] [notice] Child 4060: Starting 1920 worker threads.
[Sun Oct 24 11:52:44 2004] [notice] Child 4060: Listening on port 443.
[Sun Oct 24 11:52:44 2004] [notice] Child 4060: Listening on port 80.
[Sun Oct 24 11:52:44 2004] [notice] Child 4060: Listening on port 80.
[Sun Oct 24 11:58:17 2004] [notice] Parent: child process exited with status 
1 -- Restarting.
[Sun Oct 24 11:58:20 2004] [notice] Parent: Created child process 10080
[Sun Oct 24 11:58:20 2004] [notice] Disabled use of AcceptEx() WinSock2 API
[Sun Oct 24 11:58:23 2004] [notice] Child 10080: Child process is running
[Sun Oct 24 11:58:23 2004] [notice] Child 10080: Acquired the start mutex.
[Sun Oct 24 11:58:23 2004] [notice] Child 10080: Starting 1920 worker threads.
[Sun Oct 24 11:58:23 2004] [notice] Child 10080: Listening on port 443.
[Sun Oct 24 11:58:23 2004] [notice] Child 10080: Listening on port 80.
[Sun Oct 24 11:58:23 2004] [notice] Child 10080: Listening on port 80.

Comment 2 Gabriel Kalkuhl 2004-10-26 10:02:04 UTC
Hello David!

I think many of the listed PHP errors for Apache 2 depends on this Bug.
I wonder why Apache team members ignore this bug!
However for fixing it, you have to change the memory management to protect the 
pool memory.
I've not the time to do this and it's a bit dangerous because of race 
conditions and dead lock states.
This bug should be fixed by someone who created this pool management.

The Bug fix for me was to completely switch back to Apache 1.

Sorry, not really helpful
Gabriel
Comment 3 Joe Orton 2004-10-26 14:07:11 UTC
apr_pool_destroy(apr_pool_t * 0x007d8db8) line 797 + 13 bytes
trace_add(server_rec * 0x0077c290, request_rec * 0x00000000, x_cfg *...)

"Then I made heavy request to normal URL’s and to mod_example simultaneous."

it looks very much like mod_example is not thread-safe; so this may just be a
good demonstration of that.  Do you have an example of a crash using any module
other than mod_example?

Comment 4 David Lane 2004-10-26 14:56:26 UTC
I agree with switching back to apache 1. 
I'm currently in the process of compiling the latest stable version of apache 
1 with ssl after getting a few emails from some colleagues of mine with the 
same issue.

It is unfortunate that this issue is so ingrained into the apache2 version. It 
may just be my imagination, but I believe that the apache team is focusing on 
apache2. If this is true then for people like us to upgrade this is a 
mandatory fix, I just can’t have customers getting errors every 5.5 minutes.

Thnx for the fast response and I would love to know if you or the Apache team 
ever decides this is really an issue they would like to fix.

- David
Comment 5 Joe Orton 2004-10-26 15:01:57 UTC
All that the "Child process exited with status 3221225477 -- Restarting."
messages means is that a child process crashed.  This may be because of a bug in
a module, or it may be because of a bug in httpd -- it's impossible to tell
without a backtrace.  If you can either:

a) get a simple reproduction case for the problem, or
b) get a backtrace from a crash

there is a chance of looking at this further.
Comment 6 Jeff Trawick 2004-10-26 15:13:02 UTC
I concur with Joe's comment that mod_example is not thread-safe.  (Any threaded
Apache with mod_example should be expected to crash.)  As he says, we need a
recipe to reproduce, or other crash data, from a configuration without mod_example .
Comment 7 Gabriel Kalkuhl 2004-10-26 15:14:16 UTC
Hello Joe,

yes, the error appeared to me first by using mod_cplusplus with EasySoap++.
I implemented a SOAP server with this modules.
My module worked fine for some time. On the first dual processor machine we 
installed this software it crashed.
After some reviews in the cplusplus module and EasySoap++ I found that the 
error appers in the apache pool management.
For better demonstration I implemented the delay in mod_example. It will cras 
as soon as you make simultanous requests to normal http service and mod_example.

By the way, I saw some other bug reports from poeple using PHP with the same 
behavior (see bug 24095).

Are there some restrictions on making 'long' operations during request handling?

Gabriel
Comment 8 Gabriel Kalkuhl 2004-10-26 20:39:18 UTC
Sorry, comments where made at the same time.
Sorry Jeff, but I'm a bit confused about hearing that mod_example isn't thread 
safe.
Mod_example doesn't do anything like piping some text to the output.
The only modification I've done is to insert a sleep between the output lines 
so the module behaves like doing a slow database acces (which originaly was my 
intention). Mod_example doesn't crash if I remove the sleep.
If this is not thread safe, then all modules which generate some text output 
aren't thread safe.

Gabriel
Comment 9 Joe Orton 2004-10-26 20:46:35 UTC
Look at mod_example's trace_add function and how it stores pointers to pools in
global variables - it is not a well-behaved example module at all!  The first
thing to do must be to take mod_example completely out of the picture, and try
and trigger whatever problem it is you're seeing.

Yes, PHP users report crashes too, sometimes these are PHP bugs and sometimes not.  
Comment 10 Paul Querna 2004-11-04 03:36:33 UTC
Bug #32051 is a tracking bug for mod_example not being thread safe.

This bug is invalid because the issue comes from a user basing their module on
mod_example.
Comment 11 Gabriel Kalkuhl 2004-11-04 07:41:33 UTC
Your right!

I removed all the trace_add stuff in mod_example and it works fine.
No more crashes anymore!

By the way, I'm a bit dissapointed about mod_example.
Normaly example means, that someone wíll show how to do something.
In this case mod_example should be titled: "Here we show you, how to write bad 
code and how you shouldn't write modules".

Now I can go on searching on mod_cplusplus, which is only third party for 
apache.

Thanks for spending time on my mistakes!

Gabriel
Comment 12 Nick Kew 2004-11-04 10:08:34 UTC
I've implemented the one-minute fix you suggested, by adding a warning to
mod_example

http://cvs.apache.org/viewcvs.cgi/httpd-2.0/modules/experimental/mod_example.c