Bug 57177 - terminated by signal ILL in server/scoreboard.c line 369
Summary: terminated by signal ILL in server/scoreboard.c line 369
Status: RESOLVED FIXED
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: Core (show other bugs)
Version: 2.4.10
Hardware: Sun Solaris
: P2 normal (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords: FixedInTrunk
Depends on:
Blocks:
 
Reported: 2014-11-03 19:56 UTC by Dennis Clarke
Modified: 2015-06-17 14:32 UTC (History)
2 users (show)



Attachments
Initialize the scoreboard API (not the data) on graceful restarts (5.50 KB, patch)
2014-11-04 11:47 UTC, Yann Ylavic
Details | Diff
Initialize the scoreboard API (not the data) on graceful restarts (6.17 KB, patch)
2014-11-04 12:26 UTC, Yann Ylavic
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dennis Clarke 2014-11-03 19:56:13 UTC
Tough to describe other than a core dump being thrown thus : 

(dbx) where                                                                                                                      
current thread: t@27
  [1] 0xffffffff75b010d8(0x1006a8c50, 0x18, 0x6, 0x1006a8c50, 0x1006c6030, 0x1006c6030), at 0xffffffff75b010d8 
=>[2] ap_increment_counts(sb = 0x1006a8bb0, r = 0x1006c6030), line 369 in "scoreboard.c"
  [3] eor_bucket_cleanup(data = 0x1006aaa68), line 37 in "eor_bucket.c"
  [4] run_cleanups(cref = 0x1006c6028), line 2352 in "apr_pools.c"
  [5] apr_pool_destroy(pool = 0x1006c5fb8), line 804 in "apr_pools.c"
  [6] eor_bucket_destroy(data = 0x1006c6030), line 90 in "eor_bucket.c"
  [7] ssl_io_filter_output(f = 0x1006a92e8, bb = 0x1006aff58), line 1659 in "ssl_engine_io.c"
  [8] ap_pass_brigade(next = 0x1006a92e8, bb = 0x1006aff58), line 590 in "util_filter.c"
  [9] ssl_io_filter_coalesce(f = 0x1006a92c0, bb = 0x1006aff58), line 1558 in "ssl_engine_io.c"
  [10] ap_pass_brigade(next = 0x1006a92c0, bb = 0x1006aff58), line 590 in "util_filter.c"
  [11] ap_process_request_after_handler(r = 0x1006c6030), line 256 in "http_request.c"
  [12] ap_process_async_request(r = 0x1006c6030), line 353 in "http_request.c"
  [13] ap_process_http_async_connection(c = 0x1006a8c50), line 143 in "http_core.c"
  [14] ap_process_http_connection(c = 0x1006a8c50), line 228 in "http_core.c"
  [15] ap_run_process_connection(0x1006a8c50, 0x1006a88c0, 0x10009f5c8, 0x1001c3540, 0x1006a8bb0, 0x1006aa848), at 0x10007e1d8 
  [16] process_socket(thd = 0x100281a98, p = 0x1006a8838, sock = 0x1006a88c0, cs = 0x1006a8bc8, my_child_num = 1, my_thread_num = 24), line 1035 in "event.c"
  [17] worker_thread(thd = 0x100281a98, dummy = 0x10065b070), line 1875 in "event.c"
  [18] dummy_worker(opaque = 0x100281a98), line 142 in "thread.c"
(dbx) list
  369           bytes = pfn_ap_logio_get_last_bytes(r->connection);
  370       }
  371       else if (r->method_number == M_GET && r->method[0] == 'H') {
  372           bytes = 0;
  373       }
  374       else {
  375           bytes = r->bytes_sent;
  376       }
  377   
  378   #ifdef HAVE_TIMES

Not sure how to proceed on this.
Comment 1 Eric Covener 2014-11-03 19:59:44 UTC
looks like http://svn.apache.org/viewvc?view=revision&revision=886324 can you try the patch?
Comment 2 Eric Covener 2014-11-03 20:02:44 UTC
sorry, that change is much older than I thought
Comment 3 Dennis Clarke 2014-11-03 22:53:36 UTC
I have isolated httpd into a forground process with a very similar to near
identical config as the server that throws the core dumps. Have not been 
able to repeat the crash. 

Thus far it seems to only happen when I do a graceful restart. 
Theree after I will see twenty or thirty core files an hour. 

I am not yet able to isolate what exactly is happening but I do have
 core files and that is a good start. 

I'll update this if I see something interesting. In the meanwhile it MAY
be a valid bug. Maybe.
Comment 4 Dennis Clarke 2014-11-03 23:28:04 UTC
Looking at ap_sb_handle_t : 

struct ap_sb_handle_t {
    int child_num;
    int thread_num;
};


and then :

AP_DECLARE(void) ap_increment_counts(ap_sb_handle_t *sb, request_rec *r);

Thread 27 crashes in ap_increment_counts with pointers :

       (sb = 0x1006a8bb0, r = 0x1006c6030)

So looking at memory at *sb we should see two integers :

(dbx) x 0x1006a8bb0 / 16 x 
0x00000001006a8bb0: 0x0000 0x0001 0x0000 0x0018 0x0000 0x0000 0x0000 0x0000
0x00000001006a8bc0: 0x0000 0x0001 0x006a 0x8bc8 0x0000 0x0001 0x006a 0x8bc8
(dbx) print sizeof(int)   
sizeof(int ) = 4

So my guess is that child_num is 1 and thread_num is 0x18h = 24.

(dbx) print *sb
*sb = {
    child_num  = 1
    thread_num = 24
}

Sure, looks fine. 

Next we have that pesky request_rec r which will be a bit verbose :

(dbx) print r 
r = 0x1006c6030

(dbx) print *r
*r = {
    pool                 = 0x1006c5fb8
    connection           = 0x1006a8c50
    server               = 0x10028c1e0
    next                 = 0x1006d0538
    prev                 = (nil)
    main                 = (nil)
    the_request          = 0x1006c7318 "GET / HTTP/1.0"
    assbackwards         = 0
    proxyreq             = 0
    header_only          = 0
    proto_num            = 1000
    protocol             = 0x1006c7348 "HTTP/1.0"
    hostname             = 0x1006c73e0 "dev.mydomain.ca"
    request_time         = 1415043364521974
    status_line          = (nil)
    status               = 401
    method_number        = 0
    method               = 0x1000b85b0 "GET"
    allowed              = 0
    allowed_xmethods     = (nil)
    allowed_methods      = 0x1006c62e8
    sent_bodyct          = 0
    bytes_sent           = 0
    mtime                = 0
    range                = (nil)
    clength              = 0
    chunked              = 0
    read_body            = 0
    read_chunked         = 0
    expecting_100        = 0
    kept_body            = (nil)
    body_table           = (nil)
    remaining            = 0
    read_length          = 0
    headers_in           = 0x1006c6328
    headers_out          = 0x1006c6a28
    err_headers_out      = 0x1006c6c70
    subprocess_env       = 0x1006c66a8
    notes                = 0x1006c6e10
    content_type         = (nil)
    handler              = (nil)
    content_encoding     = (nil)
    content_languages    = (nil)
    vlist_validator      = (nil)
    user                 = (nil)
    ap_auth_type         = 0x100293228 "basic"
    unparsed_uri         = 0x1006c7338 "/"
    uri                  = 0x1006c7340 "/"
    filename             = 0x1006c7690 "/usr/local/www/data/htdocs/"
    canonical_filename   = 0x1006c7690 "/usr/local/www/data/htdocs/"
    path_info            = 0x1006c75eb ""
    args                 = (nil)
    used_path_info       = 2
    eos_sent             = 1
    per_dir_config       = 0x1006c7a20
    request_config       = 0x1006c6fb0
    log                  = 0x10028c200
    log_id               = (nil)
    htaccess             = 0x1006c79f8
    output_filters       = 0x1006c7218
    input_filters        = 0x1006c73f0
    proto_output_filters = 0x1006c7218
    proto_input_filters  = 0x1006c73f0
    no_cache             = 0
    no_local_copy        = 1
    invoke_mtx           = 0x1006c74f0
    parsed_uri           = {
        scheme         = (nil)
        hostinfo       = (nil)
        user           = (nil)
        password       = (nil)
        hostname       = (nil)
        port_str       = (nil)
        path           = 0x1006c7340 "/"
        query          = (nil)
        fragment       = (nil)
        hostent        = (nil)
        port           = 0
        is_initialized = 1U
        dns_looked_up  = 0
        dns_resolved   = 0
    }
    finfo                = {
        pool       = 0x1006c5fb8
        valid      = 7598960
        protection = 1909
        filetype   = APR_DIR
        user       = 16411
        group      = 20002
        inode      = 424070U
        device     = 1099511693323U
        nlink      = 18
        size       = 21
        csize      = 1536
        atime      = 1415040555872951
        mtime      = 1414893925405310
        ctime      = 1414893925405310
        fname      = 0x1006c75d0 "/usr/local/www/data/htdocs/"
        name       = (nil)
        filehand   = (nil)
    }
    useragent_addr       = 0x1006a8a50
    useragent_ip         = 0x1006a90b0 "64.41.200.104"
}
(dbx)

Well I can confirm that the uid and gid of the target file fname
is correct : 

devl $ ls -ladin /usr/local/www/data/htdocs
  424070 drwxrwxr-x 18 16411 20002 21 Nov 2 02:05 /usr/local/www/data/htdocs


Here is where things get interesting. That ip address 64.41.200.104 is the
guys at Qualys where I use their SSL tester to verify my config.  So whatever
they are fireing at this server causes it to toss a core file. 

I'll try that again and see if I can get a full trace.
Comment 5 Dennis Clarke 2014-11-03 23:59:32 UTC
Am happy to report that re-running the SSL scan tests from Qualys SSL Labs does not cause any ill effects UNTIL I issue an apachectl graceful.  Then I get a pile of core files tossed : 

devl# cd  /var/crash/dev/coredump/
devl# ls -1tr *.core 
node_dev-host_sun4v-zone_z_001-time_1415058824-pid_14569-uid_80-gid_80-fid_httpd.core
node_dev-host_sun4v-zone_z_001-time_1415058857-pid_14567-uid_80-gid_80-fid_httpd.core
node_dev-host_sun4v-zone_z_001-time_1415058858-pid_14572-uid_80-gid_80-fid_httpd.core
node_dev-host_sun4v-zone_z_001-time_1415058858-pid_14566-uid_80-gid_80-fid_httpd.core
node_dev-host_sun4v-zone_z_001-time_1415058876-pid_14565-uid_80-gid_80-fid_httpd.core


I will check each one to see what is happening here after a graceful restart.
Comment 6 Rainer Jung 2014-11-04 08:42:18 UTC
There's two occurences of

pfn_ap_logio_get_last_bytes = APR_RETRIEVE_OPTIONAL_FN(ap_logio_get_last_bytes);

in server/scoreboard.c. It looks like none of the two runs after your graceful restart. Could you add a log line to each of the two and check, whether they run or not after gracefully restarting?

The line retrieves the address of a callback function in mod_logio which is called during the crash. If the load order of the modules changes duriung graceful restart, then the old address doesn't point any longer to the correct function. Therefore the address needs to be updated after restart.
Comment 7 Rainer Jung 2014-11-04 08:43:06 UTC
And: as a short term workaround: if you do not need mod_logio, you can disable the module by not loading it. That should stop your problem (but doesn't fix the underlying code bug).
Comment 8 Yann Ylavic 2014-11-04 11:47:48 UTC
Created attachment 32188 [details]
Initialize the scoreboard API (not the data) on graceful restarts

One way to fix the issue.

This patch moves the static optional function(s) initialization from ap_calc_scoreboard_size() to ap_init_scoreboard(), so that when the latter is called with no data to initialize (NULL), it will only make its optional function(s) point to the right location.

Hence each MPM is also modified to call ap_init_scoreboard(NULL) on graceful restarts.

Dennis, can you please verify this works for you (mod_logio loaded still)?
Comment 9 Yann Ylavic 2014-11-04 12:26:11 UTC
Created attachment 32189 [details]
Initialize the scoreboard API (not the data) on graceful restarts

Minor change to call ap_init_scoreboard(NULL) from ap_create_scoreboard() when the scoreboard image exists already (only).

This avoids a double APR_RETRIEVE_OPTIONAL_FN() call when the image is to be created and allows for such initializations to be centralized in ap_init_scoreboard().
Comment 10 Dennis Clarke 2014-11-04 19:59:48 UTC
entirely tangential comment : 
Working on it. I keep getting interrupted by people and their interpretation
of "work".
Comment 11 Yann Ylavic 2015-01-23 08:34:09 UTC
Dennys, any news on this, did you try the patch (attachment 32189 [details])?
Comment 12 Dennis Clarke 2015-01-23 22:04:01 UTC
Sorry for the delay. I have been distracted by work on oher items and that includes this thing : 

    https://issues.apache.org/bugzilla/show_bug.cgi?id=56324

Also we have a large exploit in PHP today and a new OpenSSL major release and CMAKE issues etc etc.

Promise I will get into this on this weekend.
Comment 13 jw 2015-02-23 20:07:29 UTC
I've seen that one on CentOS 7 using httpd-2.4.6-19.el7.centos.x86_64:

(gdb) bt
#0  0x00007f96e0e7eb60 in ?? ()
#1  0x00007f96e7d253c2 in ap_increment_counts (sb=0x7f96e9fc2350, r=r@entry=0x7f96e9fc8190)
    at scoreboard.c:369
#2  0x00007f96e7d36190 in eor_bucket_cleanup (data=<optimized out>) at eor_bucket.c:37
#3  0x00007f96e69f615e in run_cleanups (cref=0x7f96e9fc8188) at memory/unix/apr_pools.c:2352
#4  apr_pool_destroy (pool=0x7f96e9fc8118) at memory/unix/apr_pools.c:804
#5  0x00007f96e7d362a6 in remove_empty_buckets (bb=bb@entry=0x7f96e9fc2b28) at core_filters.c:716
#6  0x00007f96e7d36648 in send_brigade_nonblocking (s=s@entry=0x7f96e9fc2160, bb=bb@entry=0x7f96e9fc2b28, 
    bytes_written=bytes_written@entry=0x7f96e9fc2a60, c=c@entry=0x7f96e9fc2350) at core_filters.c:705
#7  0x00007f96e7d37301 in send_brigade_blocking (c=0x7f96e9fc2350, bytes_written=0x7f96e9fc2a60, 
    bb=0x7f96e9fc2b28, s=0x7f96e9fc2160) at core_filters.c:729
#8  ap_core_output_filter (f=0x7f96e9fc2928, new_bb=0x7f96e9fc2b28) at core_filters.c:542
#9  0x00007f96e7d50489 in ap_process_request (r=r@entry=0x7f96e9fc8190) at http_request.c:369
#10 0x00007f96e7d4cdb2 in ap_process_http_sync_connection (c=0x7f96e9fc2350) at http_core.c:190
#11 ap_process_http_connection (c=0x7f96e9fc2350) at http_core.c:231
#12 0x00007f96e7d44e30 in ap_run_process_connection (c=0x7f96e9fc2350) at connection.c:41
#13 0x00007f96e7d45218 in ap_process_connection (c=c@entry=0x7f96e9fc2350, csd=<optimized out>)
    at connection.c:202
#14 0x00007f96dd6947ef in child_main (child_num_arg=child_num_arg@entry=1) at prefork.c:704
#15 0x00007f96dd694a26 in make_child (s=0x7f96e9eb0348, slot=slot@entry=1) at prefork.c:800
#16 0x00007f96dd695191 in prefork_run (_pconf=<optimized out>, plog=<optimized out>, s=<optimized out>)
    at prefork.c:1051
#17 0x00007f96e7d220fe in ap_run_mpm (pconf=0x7f96e9e85138, plog=0x7f96e9eb2358, s=0x7f96e9eb0348)
    at mpm_common.c:96
#18 0x00007f96e7d1b726 in main (argc=3, argv=0x7fff42952e28) at main.c:777


Easy reproducable, just sending SIGHUP while under load. I can also confirm that it is nor reproducable without mod_logio.
Comment 14 Yann Ylavic 2015-02-24 09:53:42 UTC
(In reply to jw from comment #13)
> 
> Easy reproducable, just sending SIGHUP while under load. I can also confirm
> that it is nor reproducable without mod_logio.


Does the patch from attachment 32189 [details] help?
Comment 15 Dennis Clarke 2015-03-22 09:35:23 UTC
Minor update. I am moving on with checking this on recent 2.4.12 release thus : 

node000 # /usr/local/bin/httpd -V
Server version: Apache/2.4.12 (Unix)
Server built:   Mar 22 2015 08:29:39
Server's Module Magic Number: 20120211:41
Server loaded:  APR 1.5.1, APR-UTIL 1.5.3
Compiled using: APR 1.5.1, APR-UTIL 1.5.4
Architecture:   64-bit
Server MPM:     event
  threaded:     yes (fixed thread count)
    forked:     yes (variable process count)
Server compiled with....
 -D APR_HAS_SENDFILE
 -D APR_HAS_MMAP
 -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
 -D APR_USE_PROC_PTHREAD_SERIALIZE
 -D APR_USE_PTHREAD_SERIALIZE
 -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
 -D APR_HAS_OTHER_CHILD
 -D AP_HAVE_RELIABLE_PIPED_LOGS
 -D DYNAMIC_MODULE_LIMIT=256
 -D HTTPD_ROOT="/usr/local"
 -D SUEXEC_BIN="/usr/local/bin/suexec"
 -D DEFAULT_PIDLOG="www/var/logs/httpd.pid"
 -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
 -D DEFAULT_ERRORLOG="logs/error_log"
 -D AP_TYPES_CONFIG_FILE="www/conf/mime.types"
 -D SERVER_CONFIG_FILE="www/conf/httpd.conf"


First odd problem that I wonder about is that the loaded APR-UTIL version does not match the version I compiled with. Regardless, I will get back to the main issue with scoreboard.c if it still exists.
Comment 16 Yann Ylavic 2015-03-23 08:08:33 UTC
Simpler patch committed in r1668532.
Comment 17 Ralf Lehmann 2015-06-15 14:13:33 UTC
Hi,

I stumbled across the same problem last week. I am using the latest official Apache 2.4.12 sources with a gcc 3.4.3 on Solaris 10.

Unfortunately the simpler patch  r1668532 (changed core.c, scoreboard.c) did not fix the problem. Same bus error, same stacktrace.

configure option --enable-pie didn't work either
configure option --enable-logio=static works. No more bus errors.

But of course with this error, I think there is room for more problems. Currently I am experiencing more apache processes in state "G" and they increase with every graceful restart. I have to investigate further.
Comment 18 Yann Ylavic 2015-06-15 14:34:36 UTC
The fix backported to 2.4.13 (available in upcoming 2.4.15) also includes r1668553 (in addition to r1668532).

Can you try if both patches applied work?
Comment 19 Ralf Lehmann 2015-06-17 07:08:47 UTC
Just noticed, that core.c does not compile with Apache source 2.4.12.
So my previous comment is not quite right, because I am sure I ran with the unpatched httpd.

So I think there is no patch for 2.4.12, right?

Thanks
Comment 20 Rainer Jung 2015-06-17 07:23:17 UTC
Can you try the workaround of disabling mod_logio (not using LoadModule to load it). Does that work for you?
Comment 21 Rainer Jung 2015-06-17 07:35:06 UTC
(In reply to Rainer Jung from comment #20)
> Can you try the workaround of disabling mod_logio (not using LoadModule to
> load it). Does that work for you?

More precisely: stop the web server, remove the LoadModule for mod_logio, then start it and after that try to do whatever caused the crash for you.
Comment 22 Ralf Lehmann 2015-06-17 14:08:58 UTC
Disableing mod_logio or make it static in the webserver works. No more bus errors
Comment 23 Ralf Lehmann 2015-06-17 14:23:50 UTC
I can confirm that the bug is also gone in 2.4.14.
Patches seem to be ok.

Unfortunately every graceful restart of apache creates several new processes stuck in "G" state. pstack does not show any difference between the stuck processes and the ones still working.
This was already the case with 2.4.12

Here is s sample stack trace, but currently I can't see a direct connection to this bug:

10108:  /opt/apache24/bin/httpd -f /opt/apache24/conf/httpd.conf-UGIRQJ -DIT -
 ffffffff7dbdcbc0 lwp_mutex_timedlock (ffffffff7a200000, 0)
 ffffffff7dbd0b74 mutex_lock_internal (ffffffff7a200000, 0, 1, ffffffff7dd00200, 51, 1) + 1dc
 ffffffff7e827584 proc_mutex_proc_pthread_acquire (1003188a8, 1003382d0, 0, 0, 0, 100324cc0) + 10
 ffffffff7e828000 apr_proc_mutex_lock (1003188a8, 0, 2, 0, 0, 1003268a8) + 18
 00000001000a8d80 accept_mutex_on (8, 2, 0, 11, 1003208c8, 1001ccb64) + 20
 00000001000aae60 child_main (11, 1000a993c, 0, ffffffff7dd00200, 0, ffffffff7dd4a360) + 9f4
 00000001000ab658 make_child (10021d708, 11, 0, 0, 10021d708, ffffffff7ffff610) + 318
 00000001000acde0 prefork_run (1001d9ad8, 100207cc8, 10021d708, 10021d708, 100313b60, 0) + dc8
 0000000100038ca4 ap_run_mpm (1001d9ad8, 100207cc8, 10021d708, 10021d708, 101010100000000, ffffffff7ffff828) + c4
 000000010002b29c main (6, ffffffff7ffff968, ffffffff7ffff9a0, 1001cd948, 100000000, ffffffff7dd00200) + 19f8
 0000000100027ccc _start (0, 0, 0, 0, 0, 0) + 7c
Comment 24 Yann Ylavic 2015-06-17 14:27:47 UTC
(In reply to Ralf Lehmann from comment #23)
> I can confirm that the bug is also gone in 2.4.14.
> Patches seem to be ok.

Thanks for testing, I'm closing this report.
Comment 25 Yann Ylavic 2015-06-17 14:32:13 UTC
(In reply to Ralf Lehmann from comment #23)
> 
> Unfortunately every graceful restart of apache creates several new processes
> stuck in "G" state. pstack does not show any difference between the stuck
> processes and the ones still working.

It may be related to Bug 49504, otherwise please open a new report.