We're running Apache 2.0.54 on multiple Solaris 8 machines, using the prefork MPM, mmap and sendfile turned off - sharing a NFS mounted docroot hosted by a NetApp. Occasionally we are getting child processes die with either a bus error or segmentation fault - it seems to be taking the order of 1-2 hours of high-load production traffic to happen. From the cores I've inspected, they all seem to be when a request is being parsed by mod_include (so I've marked that as the component - though something else may be at fault). Having searched around, bug 34708 and bug 19325 look as if they might be related to this. Certainly the files in our docroot do get modified. The process would be for the file to be removed, and then recreated by another system. So Apache might be getting upset by the file disappearing mid request. It's unlikely (though I guess possible) that a file would be removed and recreated in time for Apache to have one request on the original file and the subsequent request on the new file (ie: without accessing it in the period of "no file"). My attempts at trying to recreate this scenario without production traffic have failed. So it's been difficult to rule out particular system components (though we have unloaded as much of our code as possible). Additionally, we are seeing some errors on a couple of servers that have a different docroot (though still on NFS) where files seldom change (or at least not at the times we've seen errors). Based on comments in one of the bugs above, and looking at the head revision of apr-util-trunk, I've also patched apr_buckets_file.c (two liner - so it checks for APR_EOF) - I'll attach the diff. We still get errors (though it doesn't appear to hurt so I've left it in for now)... I've got two backtraces which I'll attach. Unfortunately we weren't initially compiling with debugging symbols, though we have now enabled that. Since enabling, we've only had one error - otherwise I'd attach more. I'd offer to upload the core file but it's 100Mb as we have a huge chunk of shared memory in use. Note that I don't think that is to blame because the other servers don't use it (waiting for a segfault there so I can upload that instead), and because the points of failure don't implicate it at all. Any questions or things you'd like me to try - please ask. Really keen to get to the bottom of this.
Created attachment 15850 [details] patch in use against srclib/apr-util/buckets/apr_buckets_file.c
Created attachment 15851 [details] first backtrace - no debugging symbols
Created attachment 15852 [details] 'full' backtrace - with debugging symbols
#1 0xff3676a0 in apr_brigade_create (p=0x0, list=0x3c7ef0) at apr_brigade.c:67 b = (apr_bucket_brigade *) 0x3c8c80 #2 0x000d17e0 in core_output_filter (f=0x3c3f10, b=0x3f4130) at core.c:4047 the p = NULL bit looks like the problem, unless that's a side-effect of stack corruption. In gdb if you "up 2" into core_output_filter, what does p *f p *f->conn give? If the pool really is NULL this could just be a symptom of hitting an out-of-memory condition - is that plausible for this system? (httpd should be calling abort() in that case but actually doesn't)
Quick response - thanks. :) (In reply to comment #4) > #1 0xff3676a0 in apr_brigade_create (p=0x0, list=0x3c7ef0) at apr_brigade.c:67 > b = (apr_bucket_brigade *) 0x3c8c80 > #2 0x000d17e0 in core_output_filter (f=0x3c3f10, b=0x3f4130) at core.c:4047 > > the p = NULL bit looks like the problem, unless that's a side-effect of stack > corruption. In gdb if you "up 2" into core_output_filter, what does > > p *f > p *f->conn Indeed - meant to mention that's what seems to be the issue (though how it's happened...). Anyway: (gdb) up 2 #4 0x0007c494 in chunk_filter (f=0x3df960, b=0x3f0b70) at http_core.c:218 218 rv = ap_pass_brigade(f->next, b); (gdb) p *f $1 = {frec = 0x1cd638, ctx = 0x0, next = 0x3c3f10, r = 0x3c9f30, c = 0x3c3b80} (gdb) p *f->conn There is no member named conn. But ITYM: (gdb) p *f->c $2 = {pool = 0x0, base_server = 0x0, vhost_lookup_data = 0x0, local_addr = 0x3c3ae0, remote_addr = 0x0, remote_ip = 0x3c3ed8 "85.164.244.5", remote_host = 0x0, remote_logname = 0x0, aborted = 0, keepalive = AP_CONN_CLOSE, double_reverse = 0, keepalives = 0, local_ip = 0x3c3ec8 "212.187.153.30", local_host = 0x0, id = 14, conn_config = 0x3c3bd8, notes = 0x3c3d70, input_filters = 0x3c3ef8, output_filters = 0x3c3f10, sbh = 0x3c1ac8, bucket_alloc = 0x3c7ef0} > If the pool really is NULL this could just be a symptom of hitting an > out-of-memory condition - is that plausible for this system? (httpd should be > calling abort() in that case but actually doesn't) Incredibly unlikely. top claims: load averages: 1.29, 1.38, 1.21 16:37:17 124 processes: 123 sleeping, 1 on cpu CPU states: 50.0% idle, 0.0% user, 50.0% kernel, 0.0% iowait, 0.0% swap Memory: 2048M real, 908M free, 768M swap in use, 1727M swap free vmstat gives similar numbers. The system is so busy because we're deliberately overloading that server in an attempt to reproduce the problem. One other thing that's worth mentioning is another server that has the same apache binary and mostly the same modules loaded has had no errors - but that does not have a NFS docroot (it serves purely dynamic requests). Cheers
It would be useful to monitor memory use over time to be able to correlate the child crashes with memory exhaustion. (using sar or just vmstat, for example) To progress, otherwise, some ideas: - try an "--enable-pool-debug" build, if possible; such a build will be much slower and consume much more memory overall but might help tracking the issue down - unsubtle debugging hacks like "if (!f->c->pool) abort()" in strategic places like the top of core_output_filter, if that particular failure mode recurs - check the r->uri in case this an issue specific to a particular resource, or at least, to localize testing against some specific resources.
> It would be useful to monitor memory use over time to be able to correlate the > child crashes with memory exhaustion. (using sar or just vmstat, for example) We do have monitoring systems which should pick that kind of event up, but I now have sar running as well. > - try an "--enable-pool-debug" build, if possible; such a build will be much > slower and consume much more memory overall but might help tracking the issue > down This is being done. Will have to check the memory usage, but slower shouldn't matter as we're only testing on one server out of a group. > - check the r->uri in case this an issue specific to a particular resource, or > at least, to localize testing against some specific resources. Yeah, we tried this before. We managed to recreate a couple of requests from the non-debugging enabled cores (using adb and some knowledge of the structures involved), and replay those without problem. Having inspected the more recent cores (we had a few more overnight) there's still no pattern. We suspected cookies at one point (as that's often where testing doesn't mirror real traffic), but again they look normal. A couple seem to have happened outside request handling (eg: in apr_pool_clear called from the loop in child_main). As I mentioned, we've got some more backtraces now. There hasn't been a repeat of the NULL pool error - they're more similar to the first backtrace I attached. One common thing I've noticed is that whilst the errors are happening in a few places (though mostly within apr_pools.c or apr_buckets_alloc.c), they mostly seem to involve the value of a pointer to a apr_memnode_t being a bad address. For example: (gdb) bt #0 0xff1dedb4 in allocator_alloc (allocator=0x3c0ef0, size=8192) at apr_pools.c:219 #1 0xff1dd6d8 in apr_pool_create_ex (newpool=0x3c4044, parent=0x3c3a70, abort_fn=0, allocator=0x3c0ef0) at apr_pools.c:804 #2 0x000d1f18 in core_output_filter (f=0x3c3f10, b=0x3dc8f0) at core.c:4182 #3 0x000c14bc in ap_pass_brigade (next=0x3c3f10, bb=0x3dc8f0) at util_filter.c:512 #4 0x00081308 in ap_http_header_filter (f=0x3d2838, b=0x3dc868) at http_protocol.c:1668 #5 0x000c14bc in ap_pass_brigade (next=0x3d2838, bb=0x3dc868) at util_filter.c:512 ... Line 219 of apr_pools.c being: if ((*ref = node->next) == NULL && i >= max_index) { So: (gdb) p node $3 = (apr_memnode_t *) 0x4f6a5536 (gdb) p *node Cannot access memory at address 0x4f6a5536 If you'd like the full backtrace of that, or indeed the other new backtraces, just let me know.
We've now run with --enable-pool-debug... and got no errors at all! That server has been running for over four days without a single abnormal exit. Meanwhile we set another server up without pool debugging - but otherwise an identical configuration and set of modules - and that contiues to have bus errors and seg faults (a few a day). So it would seem that something in the debugging code is either catching the problem, or making it go away. At a quick glance, it would appear the debug code abort()s when it sees a problem - which we haven't seen... so I don't think that's happening. In which case it's either doing something correct where the non-debugging code goes wrong, or it's changing the program behaviour in such a way that the error conditions don't arise. I'm going to have a look at how pool initialisation/destruction differs between the debugging and non-debugging code paths, but if you've got any insights of what to try or to look into, it would be much appreciated.
Just an update to say that we eventually ran the pool-debug build for a week without any errors. Regular builds continue to see a small number of errors per day. I also now have a coredump from one of our other servers which has the same Apache binary, but only our simplest modules loaded. We see fewer errors there - though they have a lot less traffic to deal with. As those machines don't use shared memory the file is only 2Mb - so I can upload it no problem if it would be of any use. The backtrace appears to be identical to the one outlined in comment 7.
This problem is I believe still being seen. However, occurring in such small numbers (given the number of machines and requests involved) to not be of sufficient importance to warrant ongoing investigation. As you can see, there hasn't been an update to this in two years. I see this as unlikely to change. Additionally, I am no longer involved with that system, so can make no further personal contribution. Hence, I am closing this bug. In any case, the servers in question are undergoing a series of updates (OS, configuration, major apache version, MPM, etc) which will make comparisons impossible. If the appropriate people still see problems after that, they can reopen/file new bugs as appropriate.