Tough to describe other than a core dump being thrown thus : (dbx) where current thread: t@27 [1] 0xffffffff75b010d8(0x1006a8c50, 0x18, 0x6, 0x1006a8c50, 0x1006c6030, 0x1006c6030), at 0xffffffff75b010d8 =>[2] ap_increment_counts(sb = 0x1006a8bb0, r = 0x1006c6030), line 369 in "scoreboard.c" [3] eor_bucket_cleanup(data = 0x1006aaa68), line 37 in "eor_bucket.c" [4] run_cleanups(cref = 0x1006c6028), line 2352 in "apr_pools.c" [5] apr_pool_destroy(pool = 0x1006c5fb8), line 804 in "apr_pools.c" [6] eor_bucket_destroy(data = 0x1006c6030), line 90 in "eor_bucket.c" [7] ssl_io_filter_output(f = 0x1006a92e8, bb = 0x1006aff58), line 1659 in "ssl_engine_io.c" [8] ap_pass_brigade(next = 0x1006a92e8, bb = 0x1006aff58), line 590 in "util_filter.c" [9] ssl_io_filter_coalesce(f = 0x1006a92c0, bb = 0x1006aff58), line 1558 in "ssl_engine_io.c" [10] ap_pass_brigade(next = 0x1006a92c0, bb = 0x1006aff58), line 590 in "util_filter.c" [11] ap_process_request_after_handler(r = 0x1006c6030), line 256 in "http_request.c" [12] ap_process_async_request(r = 0x1006c6030), line 353 in "http_request.c" [13] ap_process_http_async_connection(c = 0x1006a8c50), line 143 in "http_core.c" [14] ap_process_http_connection(c = 0x1006a8c50), line 228 in "http_core.c" [15] ap_run_process_connection(0x1006a8c50, 0x1006a88c0, 0x10009f5c8, 0x1001c3540, 0x1006a8bb0, 0x1006aa848), at 0x10007e1d8 [16] process_socket(thd = 0x100281a98, p = 0x1006a8838, sock = 0x1006a88c0, cs = 0x1006a8bc8, my_child_num = 1, my_thread_num = 24), line 1035 in "event.c" [17] worker_thread(thd = 0x100281a98, dummy = 0x10065b070), line 1875 in "event.c" [18] dummy_worker(opaque = 0x100281a98), line 142 in "thread.c" (dbx) list 369 bytes = pfn_ap_logio_get_last_bytes(r->connection); 370 } 371 else if (r->method_number == M_GET && r->method[0] == 'H') { 372 bytes = 0; 373 } 374 else { 375 bytes = r->bytes_sent; 376 } 377 378 #ifdef HAVE_TIMES Not sure how to proceed on this.
looks like http://svn.apache.org/viewvc?view=revision&revision=886324 can you try the patch?
sorry, that change is much older than I thought
I have isolated httpd into a forground process with a very similar to near identical config as the server that throws the core dumps. Have not been able to repeat the crash. Thus far it seems to only happen when I do a graceful restart. Theree after I will see twenty or thirty core files an hour. I am not yet able to isolate what exactly is happening but I do have core files and that is a good start. I'll update this if I see something interesting. In the meanwhile it MAY be a valid bug. Maybe.
Looking at ap_sb_handle_t : struct ap_sb_handle_t { int child_num; int thread_num; }; and then : AP_DECLARE(void) ap_increment_counts(ap_sb_handle_t *sb, request_rec *r); Thread 27 crashes in ap_increment_counts with pointers : (sb = 0x1006a8bb0, r = 0x1006c6030) So looking at memory at *sb we should see two integers : (dbx) x 0x1006a8bb0 / 16 x 0x00000001006a8bb0: 0x0000 0x0001 0x0000 0x0018 0x0000 0x0000 0x0000 0x0000 0x00000001006a8bc0: 0x0000 0x0001 0x006a 0x8bc8 0x0000 0x0001 0x006a 0x8bc8 (dbx) print sizeof(int) sizeof(int ) = 4 So my guess is that child_num is 1 and thread_num is 0x18h = 24. (dbx) print *sb *sb = { child_num = 1 thread_num = 24 } Sure, looks fine. Next we have that pesky request_rec r which will be a bit verbose : (dbx) print r r = 0x1006c6030 (dbx) print *r *r = { pool = 0x1006c5fb8 connection = 0x1006a8c50 server = 0x10028c1e0 next = 0x1006d0538 prev = (nil) main = (nil) the_request = 0x1006c7318 "GET / HTTP/1.0" assbackwards = 0 proxyreq = 0 header_only = 0 proto_num = 1000 protocol = 0x1006c7348 "HTTP/1.0" hostname = 0x1006c73e0 "dev.mydomain.ca" request_time = 1415043364521974 status_line = (nil) status = 401 method_number = 0 method = 0x1000b85b0 "GET" allowed = 0 allowed_xmethods = (nil) allowed_methods = 0x1006c62e8 sent_bodyct = 0 bytes_sent = 0 mtime = 0 range = (nil) clength = 0 chunked = 0 read_body = 0 read_chunked = 0 expecting_100 = 0 kept_body = (nil) body_table = (nil) remaining = 0 read_length = 0 headers_in = 0x1006c6328 headers_out = 0x1006c6a28 err_headers_out = 0x1006c6c70 subprocess_env = 0x1006c66a8 notes = 0x1006c6e10 content_type = (nil) handler = (nil) content_encoding = (nil) content_languages = (nil) vlist_validator = (nil) user = (nil) ap_auth_type = 0x100293228 "basic" unparsed_uri = 0x1006c7338 "/" uri = 0x1006c7340 "/" filename = 0x1006c7690 "/usr/local/www/data/htdocs/" canonical_filename = 0x1006c7690 "/usr/local/www/data/htdocs/" path_info = 0x1006c75eb "" args = (nil) used_path_info = 2 eos_sent = 1 per_dir_config = 0x1006c7a20 request_config = 0x1006c6fb0 log = 0x10028c200 log_id = (nil) htaccess = 0x1006c79f8 output_filters = 0x1006c7218 input_filters = 0x1006c73f0 proto_output_filters = 0x1006c7218 proto_input_filters = 0x1006c73f0 no_cache = 0 no_local_copy = 1 invoke_mtx = 0x1006c74f0 parsed_uri = { scheme = (nil) hostinfo = (nil) user = (nil) password = (nil) hostname = (nil) port_str = (nil) path = 0x1006c7340 "/" query = (nil) fragment = (nil) hostent = (nil) port = 0 is_initialized = 1U dns_looked_up = 0 dns_resolved = 0 } finfo = { pool = 0x1006c5fb8 valid = 7598960 protection = 1909 filetype = APR_DIR user = 16411 group = 20002 inode = 424070U device = 1099511693323U nlink = 18 size = 21 csize = 1536 atime = 1415040555872951 mtime = 1414893925405310 ctime = 1414893925405310 fname = 0x1006c75d0 "/usr/local/www/data/htdocs/" name = (nil) filehand = (nil) } useragent_addr = 0x1006a8a50 useragent_ip = 0x1006a90b0 "64.41.200.104" } (dbx) Well I can confirm that the uid and gid of the target file fname is correct : devl $ ls -ladin /usr/local/www/data/htdocs 424070 drwxrwxr-x 18 16411 20002 21 Nov 2 02:05 /usr/local/www/data/htdocs Here is where things get interesting. That ip address 64.41.200.104 is the guys at Qualys where I use their SSL tester to verify my config. So whatever they are fireing at this server causes it to toss a core file. I'll try that again and see if I can get a full trace.
Am happy to report that re-running the SSL scan tests from Qualys SSL Labs does not cause any ill effects UNTIL I issue an apachectl graceful. Then I get a pile of core files tossed : devl# cd /var/crash/dev/coredump/ devl# ls -1tr *.core node_dev-host_sun4v-zone_z_001-time_1415058824-pid_14569-uid_80-gid_80-fid_httpd.core node_dev-host_sun4v-zone_z_001-time_1415058857-pid_14567-uid_80-gid_80-fid_httpd.core node_dev-host_sun4v-zone_z_001-time_1415058858-pid_14572-uid_80-gid_80-fid_httpd.core node_dev-host_sun4v-zone_z_001-time_1415058858-pid_14566-uid_80-gid_80-fid_httpd.core node_dev-host_sun4v-zone_z_001-time_1415058876-pid_14565-uid_80-gid_80-fid_httpd.core I will check each one to see what is happening here after a graceful restart.
There's two occurences of pfn_ap_logio_get_last_bytes = APR_RETRIEVE_OPTIONAL_FN(ap_logio_get_last_bytes); in server/scoreboard.c. It looks like none of the two runs after your graceful restart. Could you add a log line to each of the two and check, whether they run or not after gracefully restarting? The line retrieves the address of a callback function in mod_logio which is called during the crash. If the load order of the modules changes duriung graceful restart, then the old address doesn't point any longer to the correct function. Therefore the address needs to be updated after restart.
And: as a short term workaround: if you do not need mod_logio, you can disable the module by not loading it. That should stop your problem (but doesn't fix the underlying code bug).
Created attachment 32188 [details] Initialize the scoreboard API (not the data) on graceful restarts One way to fix the issue. This patch moves the static optional function(s) initialization from ap_calc_scoreboard_size() to ap_init_scoreboard(), so that when the latter is called with no data to initialize (NULL), it will only make its optional function(s) point to the right location. Hence each MPM is also modified to call ap_init_scoreboard(NULL) on graceful restarts. Dennis, can you please verify this works for you (mod_logio loaded still)?
Created attachment 32189 [details] Initialize the scoreboard API (not the data) on graceful restarts Minor change to call ap_init_scoreboard(NULL) from ap_create_scoreboard() when the scoreboard image exists already (only). This avoids a double APR_RETRIEVE_OPTIONAL_FN() call when the image is to be created and allows for such initializations to be centralized in ap_init_scoreboard().
entirely tangential comment : Working on it. I keep getting interrupted by people and their interpretation of "work".
Dennys, any news on this, did you try the patch (attachment 32189 [details])?
Sorry for the delay. I have been distracted by work on oher items and that includes this thing : https://issues.apache.org/bugzilla/show_bug.cgi?id=56324 Also we have a large exploit in PHP today and a new OpenSSL major release and CMAKE issues etc etc. Promise I will get into this on this weekend.
I've seen that one on CentOS 7 using httpd-2.4.6-19.el7.centos.x86_64: (gdb) bt #0 0x00007f96e0e7eb60 in ?? () #1 0x00007f96e7d253c2 in ap_increment_counts (sb=0x7f96e9fc2350, r=r@entry=0x7f96e9fc8190) at scoreboard.c:369 #2 0x00007f96e7d36190 in eor_bucket_cleanup (data=<optimized out>) at eor_bucket.c:37 #3 0x00007f96e69f615e in run_cleanups (cref=0x7f96e9fc8188) at memory/unix/apr_pools.c:2352 #4 apr_pool_destroy (pool=0x7f96e9fc8118) at memory/unix/apr_pools.c:804 #5 0x00007f96e7d362a6 in remove_empty_buckets (bb=bb@entry=0x7f96e9fc2b28) at core_filters.c:716 #6 0x00007f96e7d36648 in send_brigade_nonblocking (s=s@entry=0x7f96e9fc2160, bb=bb@entry=0x7f96e9fc2b28, bytes_written=bytes_written@entry=0x7f96e9fc2a60, c=c@entry=0x7f96e9fc2350) at core_filters.c:705 #7 0x00007f96e7d37301 in send_brigade_blocking (c=0x7f96e9fc2350, bytes_written=0x7f96e9fc2a60, bb=0x7f96e9fc2b28, s=0x7f96e9fc2160) at core_filters.c:729 #8 ap_core_output_filter (f=0x7f96e9fc2928, new_bb=0x7f96e9fc2b28) at core_filters.c:542 #9 0x00007f96e7d50489 in ap_process_request (r=r@entry=0x7f96e9fc8190) at http_request.c:369 #10 0x00007f96e7d4cdb2 in ap_process_http_sync_connection (c=0x7f96e9fc2350) at http_core.c:190 #11 ap_process_http_connection (c=0x7f96e9fc2350) at http_core.c:231 #12 0x00007f96e7d44e30 in ap_run_process_connection (c=0x7f96e9fc2350) at connection.c:41 #13 0x00007f96e7d45218 in ap_process_connection (c=c@entry=0x7f96e9fc2350, csd=<optimized out>) at connection.c:202 #14 0x00007f96dd6947ef in child_main (child_num_arg=child_num_arg@entry=1) at prefork.c:704 #15 0x00007f96dd694a26 in make_child (s=0x7f96e9eb0348, slot=slot@entry=1) at prefork.c:800 #16 0x00007f96dd695191 in prefork_run (_pconf=<optimized out>, plog=<optimized out>, s=<optimized out>) at prefork.c:1051 #17 0x00007f96e7d220fe in ap_run_mpm (pconf=0x7f96e9e85138, plog=0x7f96e9eb2358, s=0x7f96e9eb0348) at mpm_common.c:96 #18 0x00007f96e7d1b726 in main (argc=3, argv=0x7fff42952e28) at main.c:777 Easy reproducable, just sending SIGHUP while under load. I can also confirm that it is nor reproducable without mod_logio.
(In reply to jw from comment #13) > > Easy reproducable, just sending SIGHUP while under load. I can also confirm > that it is nor reproducable without mod_logio. Does the patch from attachment 32189 [details] help?
Minor update. I am moving on with checking this on recent 2.4.12 release thus : node000 # /usr/local/bin/httpd -V Server version: Apache/2.4.12 (Unix) Server built: Mar 22 2015 08:29:39 Server's Module Magic Number: 20120211:41 Server loaded: APR 1.5.1, APR-UTIL 1.5.3 Compiled using: APR 1.5.1, APR-UTIL 1.5.4 Architecture: 64-bit Server MPM: event threaded: yes (fixed thread count) forked: yes (variable process count) Server compiled with.... -D APR_HAS_SENDFILE -D APR_HAS_MMAP -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled) -D APR_USE_PROC_PTHREAD_SERIALIZE -D APR_USE_PTHREAD_SERIALIZE -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT -D APR_HAS_OTHER_CHILD -D AP_HAVE_RELIABLE_PIPED_LOGS -D DYNAMIC_MODULE_LIMIT=256 -D HTTPD_ROOT="/usr/local" -D SUEXEC_BIN="/usr/local/bin/suexec" -D DEFAULT_PIDLOG="www/var/logs/httpd.pid" -D DEFAULT_SCOREBOARD="logs/apache_runtime_status" -D DEFAULT_ERRORLOG="logs/error_log" -D AP_TYPES_CONFIG_FILE="www/conf/mime.types" -D SERVER_CONFIG_FILE="www/conf/httpd.conf" First odd problem that I wonder about is that the loaded APR-UTIL version does not match the version I compiled with. Regardless, I will get back to the main issue with scoreboard.c if it still exists.
Simpler patch committed in r1668532.
Hi, I stumbled across the same problem last week. I am using the latest official Apache 2.4.12 sources with a gcc 3.4.3 on Solaris 10. Unfortunately the simpler patch r1668532 (changed core.c, scoreboard.c) did not fix the problem. Same bus error, same stacktrace. configure option --enable-pie didn't work either configure option --enable-logio=static works. No more bus errors. But of course with this error, I think there is room for more problems. Currently I am experiencing more apache processes in state "G" and they increase with every graceful restart. I have to investigate further.
The fix backported to 2.4.13 (available in upcoming 2.4.15) also includes r1668553 (in addition to r1668532). Can you try if both patches applied work?
Just noticed, that core.c does not compile with Apache source 2.4.12. So my previous comment is not quite right, because I am sure I ran with the unpatched httpd. So I think there is no patch for 2.4.12, right? Thanks
Can you try the workaround of disabling mod_logio (not using LoadModule to load it). Does that work for you?
(In reply to Rainer Jung from comment #20) > Can you try the workaround of disabling mod_logio (not using LoadModule to > load it). Does that work for you? More precisely: stop the web server, remove the LoadModule for mod_logio, then start it and after that try to do whatever caused the crash for you.
Disableing mod_logio or make it static in the webserver works. No more bus errors
I can confirm that the bug is also gone in 2.4.14. Patches seem to be ok. Unfortunately every graceful restart of apache creates several new processes stuck in "G" state. pstack does not show any difference between the stuck processes and the ones still working. This was already the case with 2.4.12 Here is s sample stack trace, but currently I can't see a direct connection to this bug: 10108: /opt/apache24/bin/httpd -f /opt/apache24/conf/httpd.conf-UGIRQJ -DIT - ffffffff7dbdcbc0 lwp_mutex_timedlock (ffffffff7a200000, 0) ffffffff7dbd0b74 mutex_lock_internal (ffffffff7a200000, 0, 1, ffffffff7dd00200, 51, 1) + 1dc ffffffff7e827584 proc_mutex_proc_pthread_acquire (1003188a8, 1003382d0, 0, 0, 0, 100324cc0) + 10 ffffffff7e828000 apr_proc_mutex_lock (1003188a8, 0, 2, 0, 0, 1003268a8) + 18 00000001000a8d80 accept_mutex_on (8, 2, 0, 11, 1003208c8, 1001ccb64) + 20 00000001000aae60 child_main (11, 1000a993c, 0, ffffffff7dd00200, 0, ffffffff7dd4a360) + 9f4 00000001000ab658 make_child (10021d708, 11, 0, 0, 10021d708, ffffffff7ffff610) + 318 00000001000acde0 prefork_run (1001d9ad8, 100207cc8, 10021d708, 10021d708, 100313b60, 0) + dc8 0000000100038ca4 ap_run_mpm (1001d9ad8, 100207cc8, 10021d708, 10021d708, 101010100000000, ffffffff7ffff828) + c4 000000010002b29c main (6, ffffffff7ffff968, ffffffff7ffff9a0, 1001cd948, 100000000, ffffffff7dd00200) + 19f8 0000000100027ccc _start (0, 0, 0, 0, 0, 0) + 7c
(In reply to Ralf Lehmann from comment #23) > I can confirm that the bug is also gone in 2.4.14. > Patches seem to be ok. Thanks for testing, I'm closing this report.
(In reply to Ralf Lehmann from comment #23) > > Unfortunately every graceful restart of apache creates several new processes > stuck in "G" state. pstack does not show any difference between the stuck > processes and the ones still working. It may be related to Bug 49504, otherwise please open a new report.