Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-91 DFIP aka 'NodeManager should handle Disk-Failures In Place'
  3. YARN-2566

DefaultContainerExecutor should pick a working directory randomly

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 2.5.0
    • 2.6.0
    • nodemanager
    • None
    • Reviewed

    Description

      startLocalizer in DefaultContainerExecutor will only use the first localDir to copy the token file, if the copy is failed for first localDir due to not enough disk space in the first localDir, the localization will be failed even there are plenty of disk space in other localDirs. We see the following error for this case:

      2014-09-13 23:33:25,171 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to create app directory /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
      java.io.IOException: mkdir of /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
      	at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
      	at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
      	at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
      	at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
      	at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
      	at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
      	at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
      	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
      	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
      	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
      	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
      2014-09-13 23:33:25,185 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed
      java.io.FileNotFoundException: File file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does not exist
      	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
      	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
      	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
      	at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
      	at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
      	at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:344)
      	at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
      	at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
      	at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
      	at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
      	at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
      	at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
      	at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021)
      	at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963)
      	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102)
      	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
      2014-09-13 23:33:25,186 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410663092546_0004_01_000001 transitioned from LOCALIZING to LOCALIZATION_FAILED
      2014-09-13 23:33:25,187 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera	OPERATION=Container Finished - Failed	TARGET=ContainerImpl	RESULT=FAILURE	DESCRIPTION=Container failed with state: LOCALIZATION_FAILED	APPID=application_1410663092546_0004	CONTAINERID=container_1410663092546_0004_01_000001
      2014-09-13 23:33:25,187 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410663092546_0004_01_000001 transitioned from LOCALIZATION_FAILED to DONE
      2014-09-13 23:33:25,187 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_1410663092546_0004_01_000001 from application application_1410663092546_0004
      2014-09-13 23:33:25,187 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Considering container container_1410663092546_0004_01_000001 for log-aggregation
      2014-09-13 23:33:25,187 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1410663092546_0004
      2014-09-13 23:33:25,187 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001
      2014-09-13 23:33:25,187 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete returned false for path: [/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001]
      2014-09-13 23:33:25,188 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /hadoop/d2/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001
      2014-09-13 23:33:25,188 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete returned false for path: [/hadoop/d2/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001]
      2014-09-13 23:33:25,291 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_1410663092546_0004_01_000001
      2014-09-13 23:33:26,159 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed completed container container_1410663092546_0004_01_000001
      

      The correct way to do is If the IOException happened during the copy, try the next the localDir, If all the localDirs are failed to copy, then throw a exception.
      I will create a patch to fix this issue.

      Attachments

        1. YARN-2566.000.patch
          10 kB
          Zhihai Xu
        2. YARN-2566.001.patch
          10 kB
          Zhihai Xu
        3. YARN-2566.002.patch
          11 kB
          Zhihai Xu
        4. YARN-2566.003.patch
          11 kB
          Zhihai Xu
        5. YARN-2566.004.patch
          13 kB
          Zhihai Xu
        6. YARN-2566.005.patch
          14 kB
          Zhihai Xu
        7. YARN-2566.006.patch
          14 kB
          Zhihai Xu
        8. YARN-2566.007.patch
          14 kB
          Zhihai Xu
        9. YARN-2566.008.patch
          14 kB
          Zhihai Xu

        Issue Links

          Activity

            People

              zxu Zhihai Xu
              zxu Zhihai Xu
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: