Affects Version/s: None
Fix Version/s: 0.12.0
Localizing refers to downloading of resources that a container needs to execute. This could include executables (binaries, jar files etc.) or other resource files that a container needs when it runs. The NM interacts with the HttpFileSystem to fetch the resources.
When there are flaky connection issues to the HttpFileSystem, we should graciously fail localizing with a timeout (instead of hanging the localizing phase forever). At LinkedIn, we have encountered issues with several jobs in our cluster hanging indefinitely. This error is very subtle because Yarn localization happens in a separate process called "ContainerLocalizer".
Based on investigation here are the relevant stack traces:
Investigating heap dumps of the NM and the state of its data-structures revealed a hung socket.
The NM threads that consume the STDOUT and STDERR of the ContainerLocalizer are blocked waiting for the ContainerLocalizer to finish download. (This is not surprising since the pipe with the child process has not yet closed and there is no new data to read).
The fix is as follows:
Fix the HttpFileSystem to provide timeouts for read calls. The socket time out will cause the NM to shutdown the ContainerLocalizer. This will cause the NM thread stuck on reading from the STDOUT of ContainerLocalizer to be interrupted (since the other end of the pipe is now closed). It will later trigger an AM notification for a killed container and the AM can make a new request to the RM for that container.
The fix must be tested carefully since this is on the critical path of every single container request.