Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5749

Fail to localize resources after health status for local dirs changed

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 3.0.0-alpha2
    • Fix Version/s: None
    • Component/s: nodemanager
    • Labels:
      None

      Description

      HADOOP-13440 updated FileContext#setUMask method to change umask from local variable to global variable through updating conf value of "fs.permissions.umask-mode".

      This method might be called to update value for global umask by LogWriter and ResourceLocalizationService.
      After an application finished, LogWriter will update the umask value to be "137" while uploading logs for containers. Then the global umask value is updated right now and will affect other services. In my case , After one of local directories is marked as bad (because the disk used space is above the threshold defined by "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage"), ResourceLocalizationService will reinitailize the left local directories and change the permission from "drwxr-xr-x" to "drw-r-----"(umask value changed from "022" to "137"). From now on, The NM will always fail to localize resources as the local directories is not executable.

      Detail logs are as follows:

      2016-10-19 15:36:32,650 WARN org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext: Disk Error Exception:
      org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not executable: /home/yangtao.yt/hadoop-data/nm-local-dir-2/nmPrivate
              at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:215)
              at org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:190)
              at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:124)
              at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.createPath(LocalDirAllocator.java:350)
              at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:412)
              at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:151)
              at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132)
              at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:116)
              at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getLocalPathForWrite(LocalDirsHandlerService.java:563)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1162)
      2016-10-19 15:36:32,650 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed
      org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for nmPrivate/container_e26_1476858409240_0004_01_000005.tokens
              at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:441)
              at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:151)
              at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132)
              at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:116)
              at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getLocalPathForWrite(LocalDirsHandlerService.java:563)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1162)
      2016-10-19 15:36:32,652 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_e26_1476858409240_0004_01_000005 transitioned from LOCALIZING to LOCALIZATION_FAILED
      

      To solve this problem, in my opinion, it's better if FileContext can be compatible with past usage.
      Please feel free to give your suggestions.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                Tao Yang Tao Yang
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: