To address your various points:
The program attempts to deduce its configuration directory based on argv. Given that task-controller should be in $HADOOP_HOME/bin/, it will do some surgery on the program name to remove "bin/task-controller" and add the "conf" suffix, giving $HADOOP_HOME/conf. Unfortunately, this presumes that argv is the absolute path to task-controller. If it's not, then this will result in a negative offset being calculated – and it will attempt to malloc a negative amount of space, which causes a SIGSEGV. This checks that task-controller was invoked with argv set to the absolute path to task-controller, and not just something like './task-controller'
Hadoop may invoke task-controller via its absolute path, but when I was unit testing it, I just ran './task-controller', which was causing it to segfault. This should exit politely instead.
Now that the semantics have changed, I am not very sure we want to use the same configuration property for sleeping after dump-stack. (Thinking aloud..) Do we even need a sleep here? The signalling order is SIGQUIT->SIGTERM->SIGKILL. Will signals be processed in the order of their arrival? If so, then we will not another sleep. If not, we may need a sleep here, but may or may not be driven by the same config item. What do you think?
I'm not honestly sure. I believe that the operating system will raise the signals to the process in order. However, what it means to "handle" a signal is ambiguous. It may be the case that the signal handler for SIGQUIT merely sets a flag which tells another thread to print the stack dumps to stdout at some other convenient interval (e.g., during a gc when the world is stopped). So while SIGQUIT may be received by the JVM before SIGTERM, I don't take that as a guarantee that the data will reach stdout when we want it to. So I think the delay is necessary. As for adding a new config item.. maybe wait til someone makes an issue of this? Is there any performance implication here?
Are we sure "PSPermGen" will always be there in the dump? Instead how about passing our own TaskController that does custom actions in TaskController.dumpStacks(), simplifying our verification that dump-stack is indeed called?
In the Sun JVM, the stack dump messages always end with a description of the heap, which includes this key string. I doubt this is standardized; JRockit or another JVM will probably use some other string. Testing with a MockTaskController is probably a good idea here.
The test-time can be more than halved if we set max-map-attempts to one in both the tests via conf.setMaxMapAttempts(1);
Good observation. will do.
We need a similar test for LinuxTaskController to test stack-dump when multiple users are involved.
I'll look into this.
I'll also handle the other more minor suggestions you made as well.
Thanks for the review!