Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-134

TaskTracker startup fails if any mapred.local.dir entries don't exist

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      ~30 node cluster, various size/number of disks, CPUs, memory

      Description

      This appears to have been introduced with the "check for enough free space" before startup.

      It's debatable how best to fix this bug. I will submit a patch which ignores directories for which the DF utility fails. This is letting me continue operation on my cluster (where the number of drives varies, so there are entries in mapred.local.dir for drives that aren't on all cluster nodes), but a cleaner solution is probably better. I'd lean towards "check for existence", and ignore the dir if it doesn't - but don't depend on DF to fail, since DF could fail for other reasons without meaning you're out of disk space. I argue that a TaskTracker should start up if all directories that can be written to in the list have enough space. Otherwise, a failed drive per cluster machine means no work ever gets done.

      1. fix-freespace-tasktracker-failure.txt
        0.8 kB
        Bryan Pendleton
      2. fix.tasktracker.localdirs.patch.txt
        3 kB
        Bryan Pendleton

        Issue Links

          Activity

          Hide
          Bryan Pendleton added a comment -

          Here's the patch. As mentioned, it will just punt (and log) if something goes wrong with the DF for a specified directory. I believe this is a reasonable (but not as clean as desirable) fix for now.

          Show
          Bryan Pendleton added a comment - Here's the patch. As mentioned, it will just punt (and log) if something goes wrong with the DF for a specified directory. I believe this is a reasonable (but not as clean as desirable) fix for now.
          Hide
          Doug Cutting added a comment -

          I think a better way to fix this is to change the checkLocalDirs method to return the list of valid, writable local directories. Then enoughFreeSpace() should iterate over this list. That will ensure that all writable local drives have enough space. Does this make sense?

          Show
          Doug Cutting added a comment - I think a better way to fix this is to change the checkLocalDirs method to return the list of valid, writable local directories. Then enoughFreeSpace() should iterate over this list. That will ensure that all writable local drives have enough space. Does this make sense?
          Hide
          Bryan Pendleton added a comment -

          Yeah, that sounds like a better approach. I'd be happy to implement that in the patch instead, modulo a dangling issue:

          Should "good dirs" (ie, the new return value for checkLocalDirs) be cached? Implication: after initialization, no further checking for writability of a directory, and the directory list can only get smaller during an instance of a daemon. The alternative is, as I'm seeing with my current patch, a lot of extraneous log output that isn't really valuable.

          Show
          Bryan Pendleton added a comment - Yeah, that sounds like a better approach. I'd be happy to implement that in the patch instead, modulo a dangling issue: Should "good dirs" (ie, the new return value for checkLocalDirs) be cached? Implication: after initialization, no further checking for writability of a directory, and the directory list can only get smaller during an instance of a daemon. The alternative is, as I'm seeing with my current patch, a lot of extraneous log output that isn't really valuable.
          Hide
          Doug Cutting added a comment -

          Yes, let's cache the "good dirs". If a drive goes offline or becomes unwritable while a node is running, then we should start emitting warnings, but we should not warn more than once for drives that are offline or unwritable at startup.

          Show
          Doug Cutting added a comment - Yes, let's cache the "good dirs". If a drive goes offline or becomes unwritable while a node is running, then we should start emitting warnings, but we should not warn more than once for drives that are offline or unwritable at startup.
          Hide
          Bryan Pendleton added a comment -

          Here's my currently-deployed code fixing this bug. I may not be getting to work with Hadoop clusters much in my next position, so, unfortunately, this is as-is with no test case. It is up-to-date and working against the 0.13.0 branch.

          Without this, listing non-existent directories in mapred.local.dir will fail. This is still a pretty severe bug.

          Show
          Bryan Pendleton added a comment - Here's my currently-deployed code fixing this bug. I may not be getting to work with Hadoop clusters much in my next position, so, unfortunately, this is as-is with no test case. It is up-to-date and working against the 0.13.0 branch. Without this, listing non-existent directories in mapred.local.dir will fail. This is still a pretty severe bug.
          Hide
          Harsh J added a comment -

          This was fixed by the superceding issue that's been linked here.

          Show
          Harsh J added a comment - This was fixed by the superceding issue that's been linked here.
          Hide
          Harsh J added a comment -
          Show
          Harsh J added a comment - I meant MAPREDUCE-2413

            People

            • Assignee:
              Ravi Gummadi
              Reporter:
              Bryan Pendleton
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development