Uploaded image for project: 'Samza'
  1. Samza
  2. SAMZA-1079

HttpFileSystem should timeout for blocking reads when localizing containers.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.12.0
    • Component/s: None
    • Labels:
      None

      Description

      Localizing refers to downloading of resources that a container needs to execute. This could include executables (binaries, jar files etc.) or other resource files that a container needs when it runs. The NM interacts with the HttpFileSystem to fetch the resources.

      When there are flaky connection issues to the HttpFileSystem, we should graciously fail localizing with a timeout (instead of hanging the localizing phase forever). At LinkedIn, we have encountered issues with several jobs in our cluster hanging indefinitely. This error is very subtle because Yarn localization happens in a separate process called "ContainerLocalizer".

      Based on investigation here are the relevant stack traces:

      "ContainerLocalizer Downloader" #27 prio=5 os_prio=0 tid=0x00007fa8252f6000 nid=0x49b6 runnable [0x00007fa7b959d000]
         java.lang.Thread.State: RUNNABLE
          at java.net.SocketInputStream.socketRead0(Native Method)
          at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)...
          - locked <0x000000008022ca40> (a java.io.BufferedInputStream)
          at org.apache.commons.httpclient.ContentLengthInputStream.read(ContentLengthInputStream.java:143)
          at java.io.FilterInputStream.read(FilterInputStream.java:83)...
          at org.apache.commons.httpclient.AutoCloseInputStream.read(AutoCloseInputStream.java:88)
          at org.apache.samza.util.hadoop.HttpInputStream.read(HttpInputStream.scala:39)
          - locked <0x000000008022db10> (a java.lang.Object)...
          at java.lang.Thread.run(Thread.java:745)
      
      

      Investigating heap dumps of the NM and the state of its data-structures revealed a hung socket.

      java      18781  app  206r  IPv6 zzz      0t0  TCP ltx1-appzzz.stg.linkedin.com:nnnn->ltx1-artifactory.xxx.linkedin.com:nnnn (ESTABLISHED)
      

      The NM threads that consume the STDOUT and STDERR of the ContainerLocalizer are blocked waiting for the ContainerLocalizer to finish download. (This is not surprising since the pipe with the child process has not yet closed and there is no new data to read).

               "LocalizerRunner for container_e03_1481261762048_0541_02_000060" #2335967 prio=5 os_prio=0 tid=0x00007f993c913800 nid=0x4fa4 runnable [0x00007f9929d6f000]
      java.lang.Thread.State: RUNNABLE
           at java.io.FileInputStream.readBytes(Native Method)
           at java.io.FileInputStream.read(FileInputStream.java:255)..
           - locked <0x00000000c7185be0> (a java.lang.UNIXProcess$ProcessPipeInputStream)
           at org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:786)
           at org.apache.hadoop.util.Shell.runCommand(Shell.java:568)..
           at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:237)
           at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1113)
      

      The fix is as follows:

      Fix the HttpFileSystem to provide timeouts for read calls. The socket time out will cause the NM to shutdown the ContainerLocalizer. This will cause the NM thread stuck on reading from the STDOUT of ContainerLocalizer to be interrupted (since the other end of the pipe is now closed). It will later trigger an AM notification for a killed container and the AM can make a new request to the RM for that container.

      The fix must be tested carefully since this is on the critical path of every single container request.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user vjagadish opened a pull request:

          https://github.com/apache/samza/pull/42

          SAMZA-1079: Add timeouts for reads from HttpFileSystem. Add tests.

          • Wrote a unit/integration test to simulate a stuck connection when reading binaries for the job.
            Other misc. changes:
          • Moved some debug log messages to be info for better debugging.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/vjagadish1989/samza http-fs

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/samza/pull/42.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #42


          commit fcefa7a6cae4bb995b465b4e342f708a335e92b1
          Author: vjagadish1989 <jvenkatr@linkedin.com>
          Date: 2017-01-19T05:58:59Z

          SAMZA-1079: Add timeouts for reads from HttpFileSystem. Add unit tests.
          Other misc. changes:

          • Moved some debug log messages to be info for better debugging.

          Show
          githubbot ASF GitHub Bot added a comment - GitHub user vjagadish opened a pull request: https://github.com/apache/samza/pull/42 SAMZA-1079 : Add timeouts for reads from HttpFileSystem. Add tests. Wrote a unit/integration test to simulate a stuck connection when reading binaries for the job. Other misc. changes: Moved some debug log messages to be info for better debugging. You can merge this pull request into a Git repository by running: $ git pull https://github.com/vjagadish1989/samza http-fs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/samza/pull/42.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #42 commit fcefa7a6cae4bb995b465b4e342f708a335e92b1 Author: vjagadish1989 <jvenkatr@linkedin.com> Date: 2017-01-19T05:58:59Z SAMZA-1079 : Add timeouts for reads from HttpFileSystem. Add unit tests. Other misc. changes: Moved some debug log messages to be info for better debugging.
          Show
          jagadish1989@gmail.com Jagadish added a comment - - edited Yi Pan (Data Infrastructure) , Prateek Maheshwari Chris Pettitt Updated with patch here: https://github.com/apache/samza/pull/42
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/samza/pull/42

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/samza/pull/42
          Hide
          jagadish1989@gmail.com Jagadish added a comment -

          Thank you Yi Pan (Data Infrastructure) for the review. Submitted!

          Show
          jagadish1989@gmail.com Jagadish added a comment - Thank you Yi Pan (Data Infrastructure) for the review. Submitted!

            People

            • Assignee:
              jagadish1989@gmail.com Jagadish
              Reporter:
              jagadish1989@gmail.com Jagadish
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development