Issue Details (XML | Word | Printable)

Key: STDCXX-625
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Travis Vitek
Reporter: Travis Vitek
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
C++ Standard Library

[IBM XLC++ 9.0/AIX 5.3] SIGSEGV in 0.process

Created: 31/Oct/07 01:14 AM   Updated: 06/Mar/08 05:11 PM
Component/s: Tests
Affects Version/s: 4.2.0
Fix Version/s: 4.2.1

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works stdcxx-625.patch 2007-11-06 11:26 PM Travis Vitek 7 kB
Text File Licensed for inclusion in ASF works test.cpp 2007-10-31 01:33 AM Travis Vitek 1 kB
Environment: HP-UX or AIX platforms

Patch Info: Patch Available
Severity: Incorrect Behavior
Resolved: 04/Feb/08 10:39 PM
Resolution Date: 06/Mar/08 05:11 PM


 Description  « Hide
It appears that when we re-register for the signal from within the user signal handler the user signal handler is immediately invoked again. This recursion causes a stack overflow and a crash. Here is the stack on AIX....
sig_handler(int)( = 20), line 528 in "process.cpp"
sigaction(??, ??, ??) at 0xd01fa358
signal(??, ??) at 0xd02759a8
sig_handler(int)( = 20), line 528 in "process.cpp"
sigaction(??, ??, ??) at 0xd01fa358
signal(??, ??) at 0xd02759a8
sig_handler(int)( = 20), line 528 in "process.cpp"
nsleep(??, ??) at 0xd01fa1d4
sleep(??) at 0xd02051d4
unnamed block $b656, line 553 in "process.cpp"
rw_waitpid(long,int*,int)(pid = 782442, result = 0x2ff226a0, timeout = 5), line 553 in "process.cpp"
join_test(long,bool)(pid = 782442, should_hang = false), line 51 in "0.process.cpp"
test_process_create1()(), line 109 in "0.process.cpp"
run_test(int,char**)(argc = 1, argv = 0x2ff22a60), line 276 in "0.process.cpp"
rw_vtest(int,char**,const char*,const char*,const char*,int(*)(int,char**),const char*,char*)(argc = 1, argv = 0x2ff22a60, file_name = "/amd/devco/vitek/stdcxx-trunk/tests/self/0.process.cpp", clause = "0.process", comment = "", fun = 0x2000b85c, optstr = "|-child#0 |-timeout#", va = " "), line 1030 in "driver.cpp"
rw_test(int,char**,const char*,const char*,const char*,int(*)(int,char**),const char*,...)(argc = 1, argv = 0x2ff22a60, fname = "/amd/devco/vitek/stdcxx-trunk/tests/self/0.process.cpp", clause = "0.process", comment = "", testfun = 0x2000b85c, optstr = "|-child#0 |-timeout#", ... = 0x20009608), line 1128 in "driver.cpp"
main(argc = 1, argv = 0x2ff22a60), line 299 in "0.process.cpp"

Here is the stack on HP-UX

#0  sig_handler (No.Identifier=18) at /amd/devco/vitek/stdcxx-trunk/tests/src/process.cpp:529
#1  <signal handler called>
#2  0x7b0086c0 in _sigvector+0x10 () from /usr/lib/libc.2
#3  0x7b00f114 in signalvector+0xac () from /usr/lib/libc.2
#4  0x7b00f020 in signal+0xa0 () from /usr/lib/libc.2
#5  0x187fc in sig_handler (No.Identifier=18) at /amd/devco/vitek/stdcxx-trunk/tests/src/process.cpp:529
#6  <signal handler called>
#7  0x7b0086c0 in _sigvector+0x10 () from /usr/lib/libc.2
#8  0x7b00f114 in signalvector+0xac () from /usr/lib/libc.2
#9  0x7b00f020 in signal+0xa0 () from /usr/lib/libc.2
#10 0x187fc in sig_handler (No.Identifier=18) at /amd/devco/vitek/stdcxx-trunk/tests/src/process.cpp:529
#11 <signal handler called>
#12 0x7b0086c0 in _sigvector+0x10 () from /usr/lib/libc.2
#13 0x7b00f114 in signalvector+0xac () from /usr/lib/libc.2
#14 0x7b00f020 in signal+0xa0 () from /usr/lib/libc.2
#15 0x187fc in sig_handler (No.Identifier=18) at /amd/devco/vitek/stdcxx-trunk/tests/src/process.cpp:529
#16 <signal handler called>
#17 0x7b0086c0 in _sigvector+0x10 () from /usr/lib/libc.2
#18 0x7b00f114 in signalvector+0xac () from /usr/lib/libc.2
#19 0x7b00f020 in signal+0xa0 () from /usr/lib/libc.2
#20 0x187fc in sig_handler (No.Identifier=18) at /amd/devco/vitek/stdcxx-trunk/tests/src/process.cpp:529
#21 <signal handler called>
#22 0x7b00a160 in __sigtimedwait_sys+0x10 () from /usr/lib/libc.2
#23 0x7b013c84 in sigtimedwait+0x6c () from /usr/lib/libc.2
#24 0x7afa2a50 in sleep+0xe8 () from /usr/lib/libc.2
#25 0x188f0 in rw_waitpid (pid=13825, result=0x7f7f0db0, timeout=5)
    at /amd/devco/vitek/stdcxx-trunk/tests/src/process.cpp:555
#26 0x15250 in join_test (pid=13825, should_hang=false) at /amd/devco/vitek/stdcxx-trunk/tests/self/0.process.cpp:51
#27 0x15424 in test_process_create1 () at /amd/devco/vitek/stdcxx-trunk/tests/self/0.process.cpp:109
#28 0x159b4 in run_test (argc=1, argv=0x7f7f08d4) at /amd/devco/vitek/stdcxx-trunk/tests/self/0.process.cpp:276
#29 0x16d1c in rw_vtest (argc=1, argv=0x7f7f08d4, 
    file_name=0x8e090 "/amd/devco/vitek/stdcxx-trunk/tests/self/0.process.cpp", clause=0x8e528 "0.process", 
    comment=0x8e08f "", fun=0x4001ed52 <run_test(int, char **)>, optstr=0x8e534 "|-child#0 |-timeout#", va=0x7f7f0a34)
    at /amd/devco/vitek/stdcxx-trunk/tests/src/driver.cpp:1030
#30 0x171ec in rw_test (argc=1, argv=0x7f7f08d4, fname=0x8e090 "/amd/devco/vitek/stdcxx-trunk/tests/self/0.process.cpp", 
    clause=0x8e528 "0.process", comment=0x8e08f "", testfun=0x4001ed52 <run_test(int, char **)>, 
    optstr=0x8e534 "|-child#0 |-timeout#") at /amd/devco/vitek/stdcxx-trunk/tests/src/driver.cpp:1127
#31 0x15ad4 in main (argc=1, argv=0x7f7f08d4) at /amd/devco/vitek/stdcxx-trunk/tests/self/0.process.cpp:299


 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Travis Vitek added a comment - 31/Oct/07 01:33 AM - edited
Note that this problem doesn't come up with a trivial testcase, I tried. It does happen when using SIGCHLD and sleep(). It appears that it is implementation defined what the signal mask is set to when the user signal handler is invoked. i.e. some platforms keep the same handler, some reset it back to the default, and some others temporarily set it to something else and then restore it after the handler has completed.

Attaching a testcase and patch.

2007-10-30  Travis Vitek  <vitek@roguewave.com>

	STDCXX-625
	* process.cpp (sig_handler): Don't re-register for signal from
	within signal handler to avoid stack overflow on HP-UX and AIX.
	(rw_waitpid): Set and restore signal handler inside loop so we
	can avoid using the signal handler to do it.

Farid Zaripov added a comment - 31/Oct/07 03:20 PM

Martin Sebor added a comment - 01/Nov/07 06:00 PM
Wouldn't using sigaction() instead of signal() be a better solution here?

Travis Vitek added a comment - 01/Nov/07 08:25 PM - edited
I guess. It doesn't really matter much to me.

Martin Sebor added a comment - 02/Nov/07 01:04 AM
I think it's a matter of correctness. Calling signal() the way we have been (including after applying the patch) leaves open the possibility for one of the signals to get lost if it arrives in the brief window after the handler has been reset and before we've had a chance to install it again. Another reason why I suggested sigaction is the recent berakage in exec on Linux – see http://svn.apache.org/viewvc?view=rev&revision=590510.

Unfortunately, though, we can't use sigaction unconditionally. When using EDG eccp in strict mode (with ourt pure C headers) the struct or the function aren't declared. Look at rev 590510 for how I had to hack around it there (and even that's not the end of it if we want the function to be reliable with EDG eccp on Linux).


Travis Vitek added a comment - 05/Nov/07 11:46 PM
Well, with that in mind, the rw_waitpid() function is flawed. There is a potential missed signal window between waitpid() and sleep(). If the child process is reaped after waitpid() but before the sleep(), the signal is lost and we wait the entire sleep interval. If we want this to work reliably, we would need to avoid the sleep() if the child has exited. The only way I see to do that is to use alarm() and sigsuspend() to wait until I get a SIGALRM or SIGCHLD, but then we have the problem that there is only one SIGALRM.

I must be missing something because this doesn't seem like it should be that difficult of a problem to solve.


Farid Zaripov added a comment - 06/Nov/07 04:33 PM
Another option is using setjmp() before setting signal handler and longjmp() from signal handler.

Travis Vitek added a comment - 06/Nov/07 05:03 PM
Yes, but setjmp/longjmp don't solve the missed signal problem that we are talking about. Apparently Martin knows how to handle this, so I'm hoping he'll chime in.

Martin Sebor added a comment - 06/Nov/07 06:10 PM
You should be able to use a modified/simplified version of the algorithm from the exec utility (wait_for_child() in exec.cpp).

Btw., why is the SIGCHLD handler in process.cpp a no-op? That's no different than the default action for the signal, i.e., to ignore it.


Travis Vitek added a comment - 06/Nov/07 06:38 PM
If you don't set a signal handler for SIGCHLD, then sleep() won't be interrupted when the signal is raised(). Thank you for the hint. I don't know why I dind't see it earlier. Regardless, the implementation there has the same problem that I mentioned above... There is only one alarm(), so tests that use rw_alarm() or alarm() need to be careful about using rw_waitpid().

Martin Sebor added a comment - 06/Nov/07 06:51 PM
I see.

Yes, there is only one alarm. It's not a problem for the exec program but it could be for some of our tests. We might want to at least try to preserve existing alarms, even if it's not reliable.


Travis Vitek added a comment - 06/Nov/07 11:26 PM
Even the wait_for_child() code in exec.cpp has a timing window failure. If the call to waitpid() is interrupted for any reason other than to report success or timeout, there is a window in which the alarm signal can be missed before the next call to waitpid(). If that happens, then the waitpid() could hang indefintely.

So I'm attaching a new patch that attempts to do something a bit more like wait_for_child(), but I don't really see any reason that I shouldn't use signal() here instead of sigaction(). I only need to register for SIGALRM, and I don't really need to worry if the signal handler is restored or set to default automatically or not.

That said, I'm thinking it is probably better to just use rw_alarm() and let the implementation of that function decide if it is appropriate to call signal() or sigaction(). Unfortunately I'm fuzzy on why we need to use sigaction() instead of signal() in this case, and most specifically on Linux [see comment from exec.cpp:466]. With all this discussion I'm afraid to assume that I know what I'm doing or that my testcases will expose whatever problem is being eluded to, so I've provided an implementation of rw_waitpid() that uses sigaction() on all non EDG C++ configurations.

Input?

2007-11-06 Travis Vitek <vitek@roguewave.com>

STDCXX-625

  • process.cpp (rw_waitpid): Rewrite to avoid sleep() and
    signal timing window. Use sigaction() instead of signal()
    where available.

Martin Sebor added a comment - 07/Nov/07 12:28 AM
If you think there's a problem with wait_for_child() in exec.cpp can you please open an issue for it?

As for using signal() vs sigaction(), I mentioned the problem with the former above: inadvertently switching back to using signal() caused the exec utility to hang on Linux. I read up on signal() on Linux and I think the problem was caused by Linux blocking the signal after the first call to the handler when _XOPEN_SOURCE is #defined.

Check out this test case:

$ cat t.c && gcc t.c && ./a.out && echo success && gcc -D_XOPEN_SOURCE t.c && ./a.out
#include <assert.h>
#include <signal.h>

static volatile sig_atomic_t handler_called;
void handle_signal (int signo) { ++handler_called; }

int main () {
signal (SIGCHLD, handle_signal);
raise (SIGCHLD);
assert (1 == handler_called);
raise (SIGCHLD);
assert (2 == handler_called);
return 0;
}
success
a.out: t.c:12: main: Assertion `2 == handler_called' failed.
Aborted

This makes signal() prettty brittle to use on Linux, so sigaction() seems like a better choice. (POSIX also recommends it in favor of signal()).


Travis Vitek added a comment - 07/Nov/07 02:07 AM
IMO that is the expected behavior. The man page for signal() and the xopen docs http://opengroup.org/onlinepubs/007908799/xsh/signal.html say that the signal handler is effectively set to SIG_DFL before the user signal handler is invoked, and it says nothing about the handler being restored. Your testcase fails reliably on every platform I tested [HP, Sun, AIX and Linux] and that is consistent with the behavior I remember from one of my first C projects in college.

Regardless, I don't have a problem with using sigaction() where it is available.It would also be nice if we would consistently use one or the other and stick to it, or even write a wrapper that does the right thing and use it.

Phew.


Martin Sebor added a comment - 07/Nov/07 03:54 AM
You're right, my mistake. I wasn't careful enough trying to create a simple test case. It must be something else. All I'm saying is that the utility would hang when attempting to kill 22.messages.mt after I accidentally changed sigaction() to signal() in rev 588734 (see http://www.nabble.com/22.locale.messages-hangs-on-Linux-tf4721753.html#a13508587). And if you read the diff for exec.cpp below for this change you'll even see a comment on line 466 that reads:

I don't remember the details and, unfortunately, the original author didn't provide them. All I remember is that we discussed the issue when we first ran into it and decided to deal with it by using sigaction(). Now that we've spent this much time discussing we might as well go all the way and figure out what this "glitch" is.


Martin Sebor added a comment - 07/Nov/07 03:56 AM
Btw., I agree that we might as well use sigaction() wherever it's available, not just on Linux.

Travis Vitek added a comment - 28/Nov/07 08:48 PM
This is not a binary compatible issue.

Martin Sebor added a comment - 21/Jan/08 07:04 PM
Is the patch ready to be committed?

Travis Vitek added a comment - 04/Feb/08 10:39 PM
Committed to trunk in r618385
Merged to 4.2.x in r618388

Martin Sebor added a comment - 20/Feb/08 01:00 AM
Unfortunately, the patch breaks EDG eccp – see STDCXX-735. I just committed a fix for it but it still needs to be merged out to 4.2.x.

Travis Vitek added a comment - 03/Mar/08 07:22 PM
Martin, I'm unsure why you reopened this issue and added the above comment. I understand that my change broke the test library for the ultra strict environment, but it seems that if any merging back to 4.2.x needs to be done, it should be done from STDCXX-735. I guess I should take this as a hint that you want me to verify and merge your change?

Martin Sebor added a comment - 05/Mar/08 07:57 PM
I'm not sure exactly what I was thinking when I reopened it but I'm guessing it was just an attempt to make a record of the regression.