Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 0.20.204.0
    • Fix Version/s: None
    • Component/s: tasktracker
    • Labels:
      None

      Description

      Chmod'ing one of the mapred local directories so it's not executable will cause the TT to fail to start. Doing this after the TT has started will result in a TT that is up but can not execute tasks.

        Issue Links

          Activity

          Hide
          Eli Collins added a comment -

          TaskTracke#initialize and DefaultTaskController#initializeJob can't handle mkdirs failing.

          2011-08-31 17:58:58,795 INFO org.apache.hadoop.mapred.TaskTracker: Starting tasktracker with owner as eli
          2011-08-31 17:58:58,796 INFO org.apache.hadoop.mapred.TaskTracker: Good mapred local directories are: /home/eli/src/hadoop1/dirs/mapred/local1,/home/eli/src/hadoop1/dirs/mapred/local2
          2011-08-31 17:58:58,799 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.io.IOException: Failed to set permissions of path: /home/eli/src/hadoop1/dirs/mapred/local1/taskTracker to 0755
          at org.apache.hadoop.fs.RawLocalFileSystem.checkReturnValue(RawLocalFileSystem.java:526)
          at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:500)
          at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:319)
          at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:183)
          at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:736)
          at org.apache.hadoop.mapred.TaskTracker.<init>(TaskTracker.java:1463)
          at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3620)

          Show
          Eli Collins added a comment - TaskTracke#initialize and DefaultTaskController#initializeJob can't handle mkdirs failing. 2011-08-31 17:58:58,795 INFO org.apache.hadoop.mapred.TaskTracker: Starting tasktracker with owner as eli 2011-08-31 17:58:58,796 INFO org.apache.hadoop.mapred.TaskTracker: Good mapred local directories are: /home/eli/src/hadoop1/dirs/mapred/local1,/home/eli/src/hadoop1/dirs/mapred/local2 2011-08-31 17:58:58,799 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.io.IOException: Failed to set permissions of path: /home/eli/src/hadoop1/dirs/mapred/local1/taskTracker to 0755 at org.apache.hadoop.fs.RawLocalFileSystem.checkReturnValue(RawLocalFileSystem.java:526) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:500) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:319) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:183) at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:736) at org.apache.hadoop.mapred.TaskTracker.<init>(TaskTracker.java:1463) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3620)
          Hide
          Eli Collins added a comment -

          Eg if you've got a cluster with all hosts running and 10 hosts have failed local dirs, if you restart the cluster these 10 hosts will not come up unless you first unconfigure the failed directory from these hosts. However this requires identifying these 10 hosts and modifiying just their config, which admins don't want to do.

          Show
          Eli Collins added a comment - Eg if you've got a cluster with all hosts running and 10 hosts have failed local dirs, if you restart the cluster these 10 hosts will not come up unless you first unconfigure the failed directory from these hosts. However this requires identifying these 10 hosts and modifiying just their config, which admins don't want to do.
          Hide
          Ravi Gummadi added a comment -

          The existing behavior has the advantage that it avoids the following issue:
          If TT starts up even with single good disk/mapredLocalDir ignoring all other bad disks, then that node can go into IO contention issues for this single disk from all tasks — because we are not reducing the number of slots on this TT based on bad disks.

          Show
          Ravi Gummadi added a comment - The existing behavior has the advantage that it avoids the following issue: If TT starts up even with single good disk/mapredLocalDir ignoring all other bad disks, then that node can go into IO contention issues for this single disk from all tasks — because we are not reducing the number of slots on this TT based on bad disks.
          Hide
          Eli Collins added a comment -

          Wasn't the point of MAPREDUCE-2413 to "handle disk failures at both startup and runtime"?

          If TT starts up even with single good disk/mapredLocalDir ignoring all other bad disks, then that node can go into IO contention issues for this single disk from all tasks — because we are not reducing the number of slots on this TT based on bad disks.

          Per MAPREDUCE-2924 we should only handle a configurable # of failures, eg you could prevent it from starting up if only N local dirs are OK.

          The same rationale applies at runtime btw! This is my point in MAPREDUCE-2413, but you and Owen seem to be presenting the case that it's OK to have a TT running with lots of slots and few functioning disks and no DN. I don't see why that's not OK on startup but is OK once the TT is running.

          Show
          Eli Collins added a comment - Wasn't the point of MAPREDUCE-2413 to "handle disk failures at both startup and runtime"? If TT starts up even with single good disk/mapredLocalDir ignoring all other bad disks, then that node can go into IO contention issues for this single disk from all tasks — because we are not reducing the number of slots on this TT based on bad disks. Per MAPREDUCE-2924 we should only handle a configurable # of failures, eg you could prevent it from starting up if only N local dirs are OK. The same rationale applies at runtime btw! This is my point in MAPREDUCE-2413 , but you and Owen seem to be presenting the case that it's OK to have a TT running with lots of slots and few functioning disks and no DN. I don't see why that's not OK on startup but is OK once the TT is running.
          Hide
          Eli Collins added a comment -

          It turns out the TT will successfully start if there's a failed local directory (it checks the dirs and removes any that fail) so it will start up fine with a failed or read-only directory etc. The failure I discovered in the first description is because checkDir doesn't fail if the directory exists and is not exectutable, once that's fixed the TT will start up in that case as well.

          Show
          Eli Collins added a comment - It turns out the TT will successfully start if there's a failed local directory (it checks the dirs and removes any that fail) so it will start up fine with a failed or read-only directory etc. The failure I discovered in the first description is because checkDir doesn't fail if the directory exists and is not exectutable, once that's fixed the TT will start up in that case as well.
          Hide
          Eli Collins added a comment -

          Filed HADOOP-7818 for making DiskChecker#checkDir fail if the directory is not executable. Closing this as a dupe of that.

          Show
          Eli Collins added a comment - Filed HADOOP-7818 for making DiskChecker#checkDir fail if the directory is not executable. Closing this as a dupe of that.

            People

            • Assignee:
              Eli Collins
              Reporter:
              Eli Collins
            • Votes:
              2 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development