In current code, timeout is not specified when JobTracker (JobEndNotifier) calls into the notification URL. When the given URL points to a server that will not respond for a long time, job notifications are completely stuck (given that we have only a single thread processing all notifications). We've seen this cause noticeable delays in job execution in components that rely on job end notifications (like Oozie workflows).
I propose we introduce a configurable timeout option and set a default to a reasonably small value.
If we want, we can also introduce a configurable number of workers processing the notification queue (not sure if this is needed though at this point).
I will prepare a patch soon. Please comment back.