Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-3435

WebUIService thread tries to use blacklisted disk, dies, and kills AM

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Not A Bug
    • 0.8.4
    • None
    • UI
    • None

    Description

      We recently hit an issue that certain TEZ jobs died when scheduled on a node that had a broken disk. The disk was already marked as broken and excluded by YARN node manager. Other applications worked fine on that node, only TEZ jobs died.

      The error where ClassNotFound exceptions, of basic hadoop classes, which should be available everywhere. After some investigation we found out that the WebUIService thread, spawned by the AM tries to utilize that broken disk. See stacktrace, disk3 was excluded by node manager.

       [WARN] [ServiceThread:org.apache.tez.dag.app.web.WebUIService] |mortbay.log|: Failed to read file: /volumes/disk3/yarn/nm/filecache/9017/hadoop-mapreduce-client-core-2.6.0.jar
      java.util.zip.ZipException: error in opening zip file
      	at java.util.zip.ZipFile.open(Native Method)
      	at java.util.zip.ZipFile.<init>(ZipFile.java:219)
      	at java.util.zip.ZipFile.<init>(ZipFile.java:149)
      	at java.util.jar.JarFile.<init>(JarFile.java:166)
      	at java.util.jar.JarFile.<init>(JarFile.java:130)
      	at org.mortbay.jetty.webapp.TagLibConfiguration.configureWebApp(TagLibConfiguration.java:174)
      	at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1279)
      	at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
      	at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
      	at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
      	at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
      	at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)
      	at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
      	at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
      	at org.mortbay.jetty.Server.doStart(Server.java:224)
      	at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
      	at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:900)
      	at org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:273)
      	at org.apache.tez.dag.app.web.WebUIService.serviceStart(WebUIService.java:94)
      	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
      	at org.apache.tez.dag.app.DAGAppMaster$ServiceWithDependency.start(DAGAppMaster.java:1827)
      	at org.apache.tez.dag.app.DAGAppMaster$ServiceThread.run(DAGAppMaster.java:1848)
      

      Which did lead to the ClassNotFound exceptions and killing the AM. Interesting enough the DAGAppMaster was aware of this broken disk and did exclude it from the localDirs. It contains only the remaining disks of the node.

      [INFO] [main] |app.DAGAppMaster|: Creating DAGAppMaster for applicationId=application_1472223062609_42648, attemptNum=1, AMContainerId=container_1472223062609_42648_01_000001, jvmPid=2538, userFromEnv=muhammad, cliSessionOption=true, pwd=/volumes/disk1/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648/container_1472223062609_42648_01_000001, localDirs=/volumes/disk1/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk10/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk4/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk5/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk6/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk7/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk8/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk9/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648, logDirs=/var/log/hadoop-yarn/container/application_1472223062609_42648/container_1472223062609_42648_01_000001
      

      Actually this is quite an issue as in a huge data center you always have some broken disks and by chance your AM may scheduled on one of this nodes.

      Summary: From my point of view it looks like as if the WebUIService thread does somehow not properly take into account the local directories that are excluded by the node manager.

      Attachments

        Activity

          People

            Unassigned Unassigned
            mprim Michael Prim
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: