|
Commited thus: http://svn.apache.org/viewvc?rev=590711&view=rev
Wouldn't using sigaction() instead of signal() be a better solution here?
I guess. It doesn't really matter much to me.
I think it's a matter of correctness. Calling signal() the way we have been (including after applying the patch) leaves open the possibility for one of the signals to get lost if it arrives in the brief window after the handler has been reset and before we've had a chance to install it again. Another reason why I suggested sigaction is the recent berakage in exec on Linux – see http://svn.apache.org/viewvc?view=rev&revision=590510
Unfortunately, though, we can't use sigaction unconditionally. When using EDG eccp in strict mode (with ourt pure C headers) the struct or the function aren't declared. Look at rev 590510 for how I had to hack around it there (and even that's not the end of it if we want the function to be reliable with EDG eccp on Linux). Well, with that in mind, the rw_waitpid() function is flawed. There is a potential missed signal window between waitpid() and sleep(). If the child process is reaped after waitpid() but before the sleep(), the signal is lost and we wait the entire sleep interval. If we want this to work reliably, we would need to avoid the sleep() if the child has exited. The only way I see to do that is to use alarm() and sigsuspend() to wait until I get a SIGALRM or SIGCHLD, but then we have the problem that there is only one SIGALRM.
I must be missing something because this doesn't seem like it should be that difficult of a problem to solve. Another option is using setjmp() before setting signal handler and longjmp() from signal handler.
Yes, but setjmp/longjmp don't solve the missed signal problem that we are talking about. Apparently Martin knows how to handle this, so I'm hoping he'll chime in.
You should be able to use a modified/simplified version of the algorithm from the exec utility (wait_for_child() in exec.cpp).
Btw., why is the SIGCHLD handler in process.cpp a no-op? That's no different than the default action for the signal, i.e., to ignore it. If you don't set a signal handler for SIGCHLD, then sleep() won't be interrupted when the signal is raised(). Thank you for the hint. I don't know why I dind't see it earlier. Regardless, the implementation there has the same problem that I mentioned above... There is only one alarm(), so tests that use rw_alarm() or alarm() need to be careful about using rw_waitpid().
I see.
Yes, there is only one alarm. It's not a problem for the exec program but it could be for some of our tests. We might want to at least try to preserve existing alarms, even if it's not reliable. Even the wait_for_child() code in exec.cpp has a timing window failure. If the call to waitpid() is interrupted for any reason other than to report success or timeout, there is a window in which the alarm signal can be missed before the next call to waitpid(). If that happens, then the waitpid() could hang indefintely.
So I'm attaching a new patch that attempts to do something a bit more like wait_for_child(), but I don't really see any reason that I shouldn't use signal() here instead of sigaction(). I only need to register for SIGALRM, and I don't really need to worry if the signal handler is restored or set to default automatically or not. That said, I'm thinking it is probably better to just use rw_alarm() and let the implementation of that function decide if it is appropriate to call signal() or sigaction(). Unfortunately I'm fuzzy on why we need to use sigaction() instead of signal() in this case, and most specifically on Linux [see comment from exec.cpp:466]. With all this discussion I'm afraid to assume that I know what I'm doing or that my testcases will expose whatever problem is being eluded to, so I've provided an implementation of rw_waitpid() that uses sigaction() on all non EDG C++ configurations. Input? 2007-11-06 Travis Vitek <vitek@roguewave.com>
If you think there's a problem with wait_for_child() in exec.cpp can you please open an issue for it?
As for using signal() vs sigaction(), I mentioned the problem with the former above: inadvertently switching back to using signal() caused the exec utility to hang on Linux. I read up on signal() on Linux and I think the problem was caused by Linux blocking the signal after the first call to the handler when _XOPEN_SOURCE is #defined. Check out this test case: $ cat t.c && gcc t.c && ./a.out && echo success && gcc -D_XOPEN_SOURCE t.c && ./a.out static volatile sig_atomic_t handler_called; int main () { This makes signal() prettty brittle to use on Linux, so sigaction() seems like a better choice. (POSIX also recommends it in favor of signal()). IMO that is the expected behavior. The man page for signal() and the xopen docs http://opengroup.org/onlinepubs/007908799/xsh/signal.html
Regardless, I don't have a problem with using sigaction() where it is available.It would also be nice if we would consistently use one or the other and stick to it, or even write a wrapper that does the right thing and use it. Phew. You're right, my mistake. I wasn't careful enough trying to create a simple test case. It must be something else. All I'm saying is that the utility would hang when attempting to kill 22.messages.mt after I accidentally changed sigaction() to signal() in rev 588734 (see http://www.nabble.com/22.locale.messages-hangs-on-Linux-tf4721753.html#a13508587
I don't remember the details and, unfortunately, the original author didn't provide them. All I remember is that we discussed the issue when we first ran into it and decided to deal with it by using sigaction(). Now that we've spent this much time discussing we might as well go all the way and figure out what this "glitch" is. Btw., I agree that we might as well use sigaction() wherever it's available, not just on Linux.
This is not a binary compatible issue.
Is the patch ready to be committed?
Unfortunately, the patch breaks EDG eccp – see
Martin, I'm unsure why you reopened this issue and added the above comment. I understand that my change broke the test library for the ultra strict environment, but it seems that if any merging back to 4.2.x needs to be done, it should be done from
I'm not sure exactly what I was thinking when I reopened it but I'm guessing it was just an attempt to make a record of the regression.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Attaching a testcase and patch.