35974 – Occasional seg fault/bus error in NFS hosted includes-parsed files

Bug 35974 - Occasional seg fault/bus error in NFS hosted includes-parsed files

Summary: Occasional seg fault/bus error in NFS hosted includes-parsed files

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	Apache httpd-2
Classification:	Unclassified
Component:	mod_include (show other bugs)
Version:	2.0.54
Hardware:	Sun Solaris

Importance:	P2 normal (vote)
Target Milestone:	---
Assignee:	Apache HTTPD Bugs Mailing List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2005-08-02 17:11 UTC by Stuart Children
Modified:	2007-09-01 16:03 UTC (History)
CC List:	0 users

Attachments
patch in use against srclib/apr-util/buckets/apr_buckets_file.c (744 bytes, patch) 2005-08-02 17:16 UTC, Stuart Children	Details \| Diff
first backtrace - no debugging symbols (1.25 KB, text/plain) 2005-08-02 17:18 UTC, Stuart Children	Details
'full' backtrace - with debugging symbols (14.87 KB, text/plain) 2005-08-02 17:20 UTC, Stuart Children	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Stuart Children 2005-08-02 17:11:33 UTC

We're running Apache 2.0.54 on multiple Solaris 8 machines, using the prefork
MPM, mmap and sendfile turned off - sharing a NFS mounted docroot hosted by a
NetApp. Occasionally we are getting child processes die with either a bus error
or segmentation fault - it seems to be taking the order of 1-2 hours of
high-load production traffic to happen. From the cores I've inspected, they all
seem to be when a request is being parsed by mod_include (so I've marked that as
the component - though something else may be at fault).

Having searched around, bug 34708 and bug 19325 look as if they might be related
to this. Certainly the files in our docroot do get modified. The process would
be for the file to be removed, and then recreated by another system. So Apache
might be getting upset by the file disappearing mid request. It's unlikely
(though I guess possible) that a file would be removed and recreated in time for
Apache to have one request on the original file and the subsequent request on
the new file (ie: without accessing it in the period of "no file"). My attempts
at trying to recreate this scenario without production traffic have failed. So
it's been difficult to rule out particular system components (though we have
unloaded as much of our code as possible). Additionally, we are seeing some
errors on a couple of servers that have a different docroot (though still on
NFS) where files seldom change (or at least not at the times we've seen errors).

Based on comments in one of the bugs above, and looking at the head revision of
apr-util-trunk, I've also patched apr_buckets_file.c (two liner - so it checks
for APR_EOF) - I'll attach the diff. We still get errors (though it doesn't
appear to hurt so I've left it in for now)...

I've got two backtraces which I'll attach. Unfortunately we weren't initially
compiling with debugging symbols, though we have now enabled that. Since
enabling, we've only had one error - otherwise I'd attach more.

I'd offer to upload the core file but it's 100Mb as we have a huge chunk of
shared memory in use. Note that I don't think that is to blame because the other
servers don't use it (waiting for a segfault there so I can upload that
instead), and because the points of failure don't implicate it at all.

Any questions or things you'd like me to try - please ask. Really keen to get to
the bottom of this.

Comment 1 Stuart Children 2005-08-02 17:16:53 UTC

Created attachment 15850 [details]
patch in use against srclib/apr-util/buckets/apr_buckets_file.c

Comment 2 Stuart Children 2005-08-02 17:18:16 UTC

Created attachment 15851 [details]
first backtrace - no debugging symbols

Comment 3 Stuart Children 2005-08-02 17:20:38 UTC

Created attachment 15852 [details]
'full' backtrace - with debugging symbols

Comment 4 Joe Orton 2005-08-02 17:28:25 UTC

#1  0xff3676a0 in apr_brigade_create (p=0x0, list=0x3c7ef0) at apr_brigade.c:67
        b = (apr_bucket_brigade *) 0x3c8c80
#2  0x000d17e0 in core_output_filter (f=0x3c3f10, b=0x3f4130) at core.c:4047

the p = NULL bit looks like the problem, unless that's a side-effect of stack
corruption.  In gdb if you "up 2" into core_output_filter, what does

  p *f
  p *f->conn

give?

If the pool really is NULL this could just be a symptom of hitting an
out-of-memory condition - is that plausible for this system?  (httpd should be
calling abort() in that case but actually doesn't)

Comment 5 Stuart Children 2005-08-02 17:39:57 UTC

Quick response - thanks. :)

(In reply to comment #4)
> #1  0xff3676a0 in apr_brigade_create (p=0x0, list=0x3c7ef0) at apr_brigade.c:67
>         b = (apr_bucket_brigade *) 0x3c8c80
> #2  0x000d17e0 in core_output_filter (f=0x3c3f10, b=0x3f4130) at core.c:4047
> 
> the p = NULL bit looks like the problem, unless that's a side-effect of stack
> corruption.  In gdb if you "up 2" into core_output_filter, what does
> 
>   p *f
>   p *f->conn

Indeed - meant to mention that's what seems to be the issue (though how it's
happened...). Anyway:

(gdb) up 2
#4  0x0007c494 in chunk_filter (f=0x3df960, b=0x3f0b70) at http_core.c:218
218             rv = ap_pass_brigade(f->next, b);
(gdb) p *f
$1 = {frec = 0x1cd638, ctx = 0x0, next = 0x3c3f10, r = 0x3c9f30, c = 0x3c3b80}
(gdb) p *f->conn
There is no member named conn.

But ITYM:

(gdb) p *f->c
$2 = {pool = 0x0, base_server = 0x0, vhost_lookup_data = 0x0, local_addr =
0x3c3ae0, remote_addr = 0x0, 
  remote_ip = 0x3c3ed8 "85.164.244.5", remote_host = 0x0, remote_logname = 0x0,
aborted = 0, 
  keepalive = AP_CONN_CLOSE, double_reverse = 0, keepalives = 0, local_ip =
0x3c3ec8 "212.187.153.30", 
  local_host = 0x0, id = 14, conn_config = 0x3c3bd8, notes = 0x3c3d70,
input_filters = 0x3c3ef8, 
  output_filters = 0x3c3f10, sbh = 0x3c1ac8, bucket_alloc = 0x3c7ef0}

> If the pool really is NULL this could just be a symptom of hitting an
> out-of-memory condition - is that plausible for this system?  (httpd should be
> calling abort() in that case but actually doesn't)

Incredibly unlikely. top claims:

load averages:  1.29,  1.38,  1.21                                     16:37:17
124 processes: 123 sleeping, 1 on cpu
CPU states: 50.0% idle,  0.0% user, 50.0% kernel,  0.0% iowait,  0.0% swap
Memory: 2048M real, 908M free, 768M swap in use, 1727M swap free

vmstat gives similar numbers. The system is so busy because we're deliberately
overloading that server in an attempt to reproduce the problem.

One other thing that's worth mentioning is another server that has the same
apache binary and mostly the same modules loaded has had no errors - but that
does not have a NFS docroot (it serves purely dynamic requests).

Cheers

Comment 6 Joe Orton 2005-08-02 18:48:20 UTC

It would be useful to monitor memory use over time to be able to correlate the
child crashes with memory exhaustion.  (using sar or just vmstat, for example)

To progress, otherwise, some ideas:

- try an "--enable-pool-debug" build, if possible; such a build will be much
slower and consume much more memory overall but might help tracking the issue down

- unsubtle debugging hacks like "if (!f->c->pool) abort()" in strategic places
like the top of core_output_filter, if that particular failure mode recurs

- check the r->uri in case this an issue specific to a particular resource, or
at least, to localize testing against some specific resources.

Comment 7 Stuart Children 2005-08-03 14:23:54 UTC

> It would be useful to monitor memory use over time to be able to correlate the
> child crashes with memory exhaustion.  (using sar or just vmstat, for example)

We do have monitoring systems which should pick that kind of event up, but I now
have sar running as well.

> - try an "--enable-pool-debug" build, if possible; such a build will be much
> slower and consume much more memory overall but might help tracking the issue
> down

This is being done. Will have to check the memory usage, but slower shouldn't
matter as we're only testing on one server out of a group.

> - check the r->uri in case this an issue specific to a particular resource, or
> at least, to localize testing against some specific resources.

Yeah, we tried this before. We managed to recreate a couple of requests from the
non-debugging enabled cores (using adb and some knowledge of the structures
involved), and replay those without problem. Having inspected the more recent
cores (we had a few more overnight) there's still no pattern. We suspected
cookies at one point (as that's often where testing doesn't mirror real
traffic), but again they look normal. A couple seem to have happened outside
request handling (eg: in apr_pool_clear called from the loop in child_main).

As I mentioned, we've got some more backtraces now. There hasn't been a repeat
of the NULL pool error - they're more similar to the first backtrace I attached.
One common thing I've noticed is that whilst the errors are happening in a few
places (though mostly within apr_pools.c or apr_buckets_alloc.c), they mostly
seem to involve the value of a pointer to a apr_memnode_t being a bad address.
For example:

(gdb) bt
#0  0xff1dedb4 in allocator_alloc (allocator=0x3c0ef0, size=8192) at apr_pools.c:219
#1  0xff1dd6d8 in apr_pool_create_ex (newpool=0x3c4044, parent=0x3c3a70,
abort_fn=0, allocator=0x3c0ef0) at apr_pools.c:804
#2  0x000d1f18 in core_output_filter (f=0x3c3f10, b=0x3dc8f0) at core.c:4182
#3  0x000c14bc in ap_pass_brigade (next=0x3c3f10, bb=0x3dc8f0) at util_filter.c:512
#4  0x00081308 in ap_http_header_filter (f=0x3d2838, b=0x3dc868) at
http_protocol.c:1668
#5  0x000c14bc in ap_pass_brigade (next=0x3d2838, bb=0x3dc868) at util_filter.c:512
...

Line 219 of apr_pools.c being:
    if ((*ref = node->next) == NULL && i >= max_index) {
So:

(gdb) p node
$3 = (apr_memnode_t *) 0x4f6a5536
(gdb) p *node
Cannot access memory at address 0x4f6a5536

If you'd like the full backtrace of that, or indeed the other new backtraces,
just let me know.

Comment 8 Stuart Children 2005-08-08 11:57:51 UTC

We've now run with --enable-pool-debug... and got no errors at all! That server
has been running for over four days without a single abnormal exit. Meanwhile we
set another server up without pool debugging - but otherwise an identical
configuration and set of modules - and that contiues to have bus errors and seg
faults (a few a day).

So it would seem that something in the debugging code is either catching the
problem, or making it go away. At a quick glance, it would appear the debug code
abort()s when it sees a problem - which we haven't seen... so I don't think
that's happening. In which case it's either doing something correct where the
non-debugging code goes wrong, or it's changing the program behaviour in such a
way that the error conditions don't arise.

I'm going to have a look at how pool initialisation/destruction differs between
the debugging and non-debugging code paths, but if you've got any insights of
what to try or to look into, it would be much appreciated.

Comment 9 Stuart Children 2005-08-30 17:18:56 UTC

Just an update to say that we eventually ran the pool-debug build for a week
without any errors. Regular builds continue to see a small number of errors per day.

I also now have a coredump from one of our other servers which has the same
Apache binary, but only our simplest modules loaded. We see fewer errors there -
though they have a lot less traffic to deal with. As those machines don't use
shared memory the file is only 2Mb - so I can upload it no problem if it would
be of any use. The backtrace appears to be identical to the one outlined in
comment 7.

Comment 10 Stuart Children 2007-09-01 16:03:13 UTC

This problem is I believe still being seen. However, occurring in such small
numbers (given the number of machines and requests involved) to not be of
sufficient importance to warrant ongoing investigation. As you can see, there
hasn't been an update to this in two years. I see this as unlikely to change.

Additionally, I am no longer involved with that system, so can make no further
personal contribution. Hence, I am closing this bug.

In any case, the servers in question are undergoing a series of updates (OS,
configuration, major apache version, MPM, etc) which will make comparisons
impossible. If the appropriate people still see problems after that, they can
reopen/file new bugs as appropriate.