Remark: This *may* be a duplicate of various reported bugs having to do with crashing worker processes under Win2K/XP with log entry "Parent: child process exited with status 3221225477 -- Restarting." Description: Sometimes the winnt/child.c worker thread/process crashes in memory/unix/apr_pools.c, line 323 (next = node->next;) when accessing the server via SSL using mod_ssl with OpenSSL 0.9.7d. The worker is then restarted by the main process (with the error log entry above). The error location in detail (all infos obtained using MS VC++ 6.0 debugger): static APR_INLINE void allocator_free(apr_allocator_t *allocator, apr_memnode_t *node) { /* *** (code deleted) *** */ /* Walk the list of submitted nodes and free them one by one, * shoving them in the right 'size' buckets as we go. */ do { next = node->next; <<==== CRASHES *HERE* index = node->index; The crash happens because 'node' (the second parameter) equals NULL. allocator_free() is called by apr_pool_clear() (same file, line 713). Obviously, the pool specified as the function parameter has pool->active->next == NULL. The complete stack trace is as follows: allocator_free(apr_allocator_t * 0x0087ff88, apr_memnode_t * 0x00000000) line 323 + 3 bytes apr_pool_clear(apr_pool_t * 0x00ce8b80) line 713 + 18 bytes core_output_filter(ap_filter_t * 0x00cc0d28, apr_bucket_brigade * 0x00cbdc68) line 4166 ap_pass_brigade(ap_filter_t * 0x00cc0d28, apr_bucket_brigade * 0x00cbdc68) line 511 + 16 bytes bio_filter_out_flush(bio_st * 0x00cbbbc0) line 155 + 25 bytes bio_filter_out_write(bio_st * 0x00cbbbc0, const char * 0x00cd1e30, int 23) line 220 + 9 bytes BIO_write(bio_st * 0x00cbbbc0, const void * 0x00cd1e30, int 23) line 201 + 20 bytes ssl3_write_pending(ssl_st * 0x00cbb6c8, int 21, const unsigned char * 0x00cbb9c4, unsigned int 2) line 742 + 50 bytes do_ssl3_write(ssl_st * 0x00cbb6c8, int 21, const unsigned char * 0x00cbb9c4, unsigned int 2, int 0) line 713 + 21 bytes ssl3_dispatch_alert(ssl_st * 0x00cbb6c8) line 1282 + 27 bytes ssl3_send_alert(ssl_st * 0x00cbb6c8, int 1, int 0) line 1271 + 9 bytes ssl3_shutdown(ssl_st * 0x00cbb6c8) line 1650 + 13 bytes SSL_shutdown(ssl_st * 0x00cbb6c8) line 895 + 13 bytes SSL_smart_shutdown(ssl_st * 0x00cbb6c8) line 188 + 9 bytes ssl_filter_io_shutdown(ssl_filter_ctx_t * 0x00caf8e8, conn_rec * 0x00caf568, int 0) line 955 + 9 bytes ssl_io_filter_cleanup(void * 0x00caf8e8) line 996 + 15 bytes run_cleanups(cleanup_t * * 0x00caf470) line 1951 + 13 bytes apr_pool_clear(apr_pool_t * 0x00caf460) line 693 + 12 bytes mpm_recycle_completion_context(CompContext * 0x00ca70a8) line 81 winnt_get_connection(CompContext * 0x00ca70a8) line 637 worker_main(long 499) line 692 + 9 bytes This error happens from "inside" mod_ssl, which corresponds to the fact that it does *not* occur when accessing the server through normal HTTP. Now my question is: Is is possible or "legal" that pools exist whose 'active->next' is set to NULL? If this is the case, the allocator_free() function would have to check for this. If simply tried the following: CHanged the beginning of the loop in allocator_free() to: /* Walk the list of submitted nodes and free them one by one, * shoving them in the right 'size' buckets as we go. */ do { /* *** NEW *** */ if (node == NULL) break; next = node->next; index = node->index; However, after this change, the server now crashes in apr_buckets_alloc.c, line 91 because its 'list' parameter equals NULL. Here is the full stack trace: apr_bucket_alloc(unsigned int 40, apr_bucket_alloc_t * 0x00000000) line 91 + 3 bytes apr_bucket_flush_create(apr_bucket_alloc_t * 0x00000000) line 38 + 11 bytes check_pipeline_flush(request_rec * 0x00d32260) line 211 + 13 bytes ap_process_request(request_rec * 0x00d32260) line 269 + 9 bytes ap_process_http_connection(conn_rec * 0x00ceed20) line 250 + 9 bytes ap_run_process_connection(conn_rec * 0x00ceed20) line 42 + 78 bytes ap_process_connection(conn_rec * 0x00ceed20, void * 0x00ceec50) line 177 worker_main(long 497) line 718 Apparently, I was wrong with my first assumption :-( However, I still believe that it must have something to do with how mod_ssl handles memory. But I really don't understand why this kind of error is only observed under Windows. Hmmm.
Just found out that active->next ist set to NULL in the line just before the allocator_free() call at the end of apr_pool_clear(): ----------------------------- snip ----------------------------- if (active->next == active) /* apr_pools.c, line 709 */ return; *active->ref = NULL; allocator_free(pool->allocator, active->next); ----------------------------- snip ----------------------------- The '*active->ref = NULL;' statement sets active->next to NULL. It was *NOT* NULL before that statement. Obviously, 'active->ref' pointed to the *own* 'next' field.
New Info: I am reproducing the error using MS IE 6.0 on a SSL page containing a number of images that are loaded via SSL, too. Nearly every time I refresh the page, the error occurs. However, when I remove the images from the page, so only the main HTML is loaded and no second connection is used by IE, the error does *not* occur anymore. I first thought it has something to do with the multithreading, but how I think that the whole thing is cause by IE aborting connections because I can reproduce the error even w/o images just by hitting ESC before the page loads - the connection is aborted, mod_ssl is shutting down the socket (SSL_Shutdown, ssl3_send_alert...) and this causes the error. P.S. I forgot: Previously, the page was generated by a Tomcat. However, the error appears also when directly putting the page in Apache's htdocs, so it has nothing to do with Tomcat. However, the error appears less often, I have to hit F5 (reload) in IE several times before it appears.
Possible workaround found: Make SSL shutdown a quiet one. Changed ssl_util_ssl.c, added 'ssl->quiet_shutdown = 1;' before OpenSSL's SSL_shutdown() is called: int SSL_smart_shutdown(SSL *ssl) { int i; int rc; /* * Repeat the calls, because SSL_shutdown internally dispatches through a * little state machine. Usually only one or two interation should be * needed, so we restrict the total number of restrictions in order to * avoid process hangs in case the client played bad with the socket * connection and OpenSSL cannot recognize it. */ rc = 0; ssl->quiet_shutdown = 1; for (i = 0; i < 4 /* max 2x pending + 2x data = 4 */; i++) { if ((rc = SSL_shutdown(ssl))) break; } return rc; } This removes the error completely, however I am not 100% sure about security issues (when does the client not get an abort notification, and when is that dangerous?)
I am sorry - it's me again ;-) But this problem doesn't want to let me go... In apr_pool_clear(), the following is done: 1. Destroy the subpools - *first* 2. Run cleanups 3. Free subprocesses Could it be that the cleanups do something wrong when they allocate memory (and the SSL shutdown *does*!!) and the subpools are already destroyed? When I change this to: 1. Run cleanups 2. Free subprocesses 3. Destroy the subpools - *last* Everything works fine (w/o my previous quiet_shutdown workaround!). Question: Is this ok, or does it maybe produce memory leaks? I admit that I still don't fully understand the pool concept :-) BTW: The same questions apply to the apr_pool_destroy() function which does the same three steps for the sub-pools.
Good analysis :) The problem is a misuse of pool cleanups, see patch on the other bug. *** This bug has been marked as a duplicate of 27945 ***
Thank you! I will search the database more thoroughly @ the next bug ;-)