29690 – allocator_free() crashes because of NULL-Pointer inside SSL_smart_shutdown()

Bug 29690 - allocator_free() crashes because of NULL-Pointer inside SSL_smart_shutdown()

Summary: allocator_free() crashes because of NULL-Pointer inside SSL_smart_shutdown()

Status:	CLOSED DUPLICATE of bug 27945

Alias:	None

Product:	Apache httpd-2
Classification:	Unclassified
Component:	mod_ssl (show other bugs)
Version:	2.0.49
Hardware:	PC All

Importance:	P3 critical (vote)
Target Milestone:	---
Assignee:	Apache HTTPD Bugs Mailing List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2004-06-19 20:41 UTC by Christian Rang
Modified:	2014-02-17 13:51 UTC (History)
CC List:	0 users

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Christian Rang 2004-06-19 20:41:11 UTC

Remark: This *may* be a duplicate of various reported bugs having to do with 
crashing worker processes under Win2K/XP with log entry "Parent: child process 
exited with status 3221225477 -- Restarting."

Description: Sometimes the winnt/child.c worker thread/process crashes in 
memory/unix/apr_pools.c, line 323 (next = node->next;) when accessing the 
server via SSL using mod_ssl with OpenSSL 0.9.7d. The worker is then restarted 
by the main process (with the error log entry above).

The error location in detail (all infos obtained using MS VC++ 6.0 debugger):

static APR_INLINE
void allocator_free(apr_allocator_t *allocator, apr_memnode_t *node)
{
/* *** (code deleted) *** */
    /* Walk the list of submitted nodes and free them one by one,
     * shoving them in the right 'size' buckets as we go.
     */
    do {
        next = node->next;     <<==== CRASHES *HERE*
        index = node->index;

The crash happens because 'node' (the second parameter) equals NULL. 
allocator_free() is called by apr_pool_clear() (same file, line 713). 
Obviously, the pool specified as the function parameter has 
   pool->active->next == NULL.

The complete stack trace is as follows:

allocator_free(apr_allocator_t * 0x0087ff88, apr_memnode_t * 0x00000000) line 
323 + 3 bytes
apr_pool_clear(apr_pool_t * 0x00ce8b80) line 713 + 18 bytes
core_output_filter(ap_filter_t * 0x00cc0d28, apr_bucket_brigade * 0x00cbdc68) 
line 4166
ap_pass_brigade(ap_filter_t * 0x00cc0d28, apr_bucket_brigade * 0x00cbdc68) line 
511 + 16 bytes
bio_filter_out_flush(bio_st * 0x00cbbbc0) line 155 + 25 bytes
bio_filter_out_write(bio_st * 0x00cbbbc0, const char * 0x00cd1e30, int 23) line 
220 + 9 bytes
BIO_write(bio_st * 0x00cbbbc0, const void * 0x00cd1e30, int 23) line 201 + 20 
bytes
ssl3_write_pending(ssl_st * 0x00cbb6c8, int 21, const unsigned char * 
0x00cbb9c4, unsigned int 2) line 742 + 50 bytes
do_ssl3_write(ssl_st * 0x00cbb6c8, int 21, const unsigned char * 0x00cbb9c4, 
unsigned int 2, int 0) line 713 + 21 bytes
ssl3_dispatch_alert(ssl_st * 0x00cbb6c8) line 1282 + 27 bytes
ssl3_send_alert(ssl_st * 0x00cbb6c8, int 1, int 0) line 1271 + 9 bytes
ssl3_shutdown(ssl_st * 0x00cbb6c8) line 1650 + 13 bytes
SSL_shutdown(ssl_st * 0x00cbb6c8) line 895 + 13 bytes
SSL_smart_shutdown(ssl_st * 0x00cbb6c8) line 188 + 9 bytes
ssl_filter_io_shutdown(ssl_filter_ctx_t * 0x00caf8e8, conn_rec * 0x00caf568, 
int 0) line 955 + 9 bytes
ssl_io_filter_cleanup(void * 0x00caf8e8) line 996 + 15 bytes
run_cleanups(cleanup_t * * 0x00caf470) line 1951 + 13 bytes
apr_pool_clear(apr_pool_t * 0x00caf460) line 693 + 12 bytes
mpm_recycle_completion_context(CompContext * 0x00ca70a8) line 81
winnt_get_connection(CompContext * 0x00ca70a8) line 637
worker_main(long 499) line 692 + 9 bytes

This error happens from "inside" mod_ssl, which corresponds to the fact that it 
does *not* occur when accessing the server through normal HTTP.

Now my question is: Is is possible or "legal" that pools exist whose 
'active->next' is set to NULL? If this is the case, the allocator_free() 
function would have to check for this. If simply tried the following: CHanged 
the beginning of the loop in allocator_free() to:

   /* Walk the list of submitted nodes and free them one by one,
     * shoving them in the right 'size' buckets as we go.
     */
    do {
        /* *** NEW *** */
        if (node == NULL)
            break;

        next = node->next;
        index = node->index;

However, after this change, the server now crashes in apr_buckets_alloc.c, line 
91 because its 'list' parameter equals NULL. Here is the full stack trace:

apr_bucket_alloc(unsigned int 40, apr_bucket_alloc_t * 0x00000000) line 91 + 3 
bytes
apr_bucket_flush_create(apr_bucket_alloc_t * 0x00000000) line 38 + 11 bytes
check_pipeline_flush(request_rec * 0x00d32260) line 211 + 13 bytes
ap_process_request(request_rec * 0x00d32260) line 269 + 9 bytes
ap_process_http_connection(conn_rec * 0x00ceed20) line 250 + 9 bytes
ap_run_process_connection(conn_rec * 0x00ceed20) line 42 + 78 bytes
ap_process_connection(conn_rec * 0x00ceed20, void * 0x00ceec50) line 177
worker_main(long 497) line 718

Apparently, I was wrong with my first assumption :-(

However, I still believe that it must have something to do with how mod_ssl 
handles memory. But I really don't understand why this kind of error is only 
observed under Windows. Hmmm.

Comment 1 Christian Rang 2004-06-19 21:20:03 UTC

Just found out that active->next ist set to NULL in the line just before the 
allocator_free() call at the end of apr_pool_clear():

----------------------------- snip -----------------------------
    if (active->next == active)  /* apr_pools.c, line 709 */
        return;

    *active->ref = NULL;
    allocator_free(pool->allocator, active->next);
----------------------------- snip -----------------------------

The '*active->ref = NULL;' statement sets active->next to NULL. It was *NOT* 
NULL before that statement.

Obviously, 'active->ref' pointed to the *own* 'next' field.

Comment 2 Christian Rang 2004-06-19 21:56:26 UTC

New Info: I am reproducing the error using MS IE 6.0 on a SSL page containing a 
number of images that are loaded via SSL, too. Nearly every time I refresh the 
page, the error occurs. However, when I remove the images from the page, so 
only the main HTML is loaded and no second connection is used by IE, the error 
does *not* occur anymore.

I first thought it has something to do with the multithreading, but how I think 
that the whole thing is cause by IE aborting connections because I can 
reproduce the error even w/o images just by hitting ESC before the page loads - 
the connection is aborted, mod_ssl is shutting down the socket (SSL_Shutdown, 
ssl3_send_alert...) and this causes the error.

P.S. I forgot: Previously, the page was generated by a Tomcat. However, the 
error appears also when directly putting the page in Apache's htdocs, so it has 
nothing to do with Tomcat. However, the error appears less often, I have to hit 
F5 (reload) in IE several times before it appears.

Comment 3 Christian Rang 2004-06-19 23:26:38 UTC

Possible workaround found: Make SSL shutdown a quiet one. 
Changed ssl_util_ssl.c, added 'ssl->quiet_shutdown = 1;' before OpenSSL's 
SSL_shutdown() is called:

int SSL_smart_shutdown(SSL *ssl)
{
    int i;
    int rc;

    /*
     * Repeat the calls, because SSL_shutdown internally dispatches through a
     * little state machine. Usually only one or two interation should be
     * needed, so we restrict the total number of restrictions in order to
     * avoid process hangs in case the client played bad with the socket
     * connection and OpenSSL cannot recognize it.
     */
    rc = 0;
    ssl->quiet_shutdown = 1;
    for (i = 0; i < 4 /* max 2x pending + 2x data = 4 */; i++) {
        if ((rc = SSL_shutdown(ssl)))
            break;
    }
    return rc;
}

This removes the error completely, however I am not 100% sure about security 
issues (when does the client not get an abort notification, and when is that 
dangerous?)

Comment 4 Christian Rang 2004-06-20 00:40:00 UTC

I am sorry - it's me again ;-) But this problem doesn't want to let me go...

In apr_pool_clear(), the following is done:
   1. Destroy the subpools - *first*
   2. Run cleanups
   3. Free subprocesses

Could it be that the cleanups do something wrong when they allocate memory (and 
the SSL shutdown *does*!!) and the subpools are already destroyed?

When I change this to:
   1. Run cleanups
   2. Free subprocesses
   3. Destroy the subpools - *last*

Everything works fine (w/o my previous quiet_shutdown workaround!). 

Question: Is this ok, or does it maybe produce memory leaks? I admit that I 
still don't fully understand the pool concept :-)

BTW: The same questions apply to the apr_pool_destroy() function which does the 
same three steps for the sub-pools.

Comment 5 Joe Orton 2004-06-20 08:08:41 UTC

Good analysis :) The problem is a misuse of pool cleanups, see patch on the
other bug.

*** This bug has been marked as a duplicate of 27945 ***

Comment 6 Christian Rang 2004-06-20 17:00:34 UTC

Thank you!
I will search the database more thoroughly @ the next bug ;-)