Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-2822

TaskTracker.offerService could handle IO and Remote Exceptions better

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      The core offerService() loop has a try/catch wrapper that catches and processes exceptions. Most cause offerService() to return, which then triggers a sleep and restart in the main loop. But some exceptions are just logged and ignored, which may be inappropriate

        Activity

        Hide
        steve_l added a comment -

        Here is the specific problems that appear to exist, though I could of course be mistaken.

        1. All RemoteExceptions that are not an instance of DisallowedTaskTrackerException are ignored, nothing is even printed

        return State.STALE;
        } catch (RemoteException re) {
        String reClass = re.getClassName();
        if (DisallowedTaskTrackerException.class.getName().equals(reClass))

        { LOG.info("Tasktracker disallowed by JobTracker."); return State.DENIED; }

        2. All IOExceptions are logged, but not reported up; the service remains in the inner (sleepless) while loop.
        } catch (IOException except)

        { String msg = "Caught exception: " + except.getMessage(); LOG.error(msg, except); }

        }

        This may not be the best way to handle network and IO problems.

        3. the code that checks for the system directory off the job service will throw an IOException if none is provided, and exception that will be caught and logged in the code in (2). If a JobTracker is returning null to getSystemDir(), then every TaskTracker that is bonded to it is going to spin, calling getSystemDir() on the server, logging the error and repeating, without any delay at all.

        I'm not sure what the ideal exception handling policy here should be, but what is there today has weaknesses. If the network is playing up, logging RemoteExceptions and maybe inserting delays would be good; if JobTracker.getSystemDir() is null then the clients should sleep longer in the hope that someone will fix the job tracker, rather than spinning.

        Show
        steve_l added a comment - Here is the specific problems that appear to exist, though I could of course be mistaken. 1. All RemoteExceptions that are not an instance of DisallowedTaskTrackerException are ignored, nothing is even printed return State.STALE; } catch (RemoteException re) { String reClass = re.getClassName(); if (DisallowedTaskTrackerException.class.getName().equals(reClass)) { LOG.info("Tasktracker disallowed by JobTracker."); return State.DENIED; } 2. All IOExceptions are logged, but not reported up; the service remains in the inner (sleepless) while loop. } catch (IOException except) { String msg = "Caught exception: " + except.getMessage(); LOG.error(msg, except); } } This may not be the best way to handle network and IO problems. 3. the code that checks for the system directory off the job service will throw an IOException if none is provided, and exception that will be caught and logged in the code in (2). If a JobTracker is returning null to getSystemDir(), then every TaskTracker that is bonded to it is going to spin, calling getSystemDir() on the server, logging the error and repeating, without any delay at all. I'm not sure what the ideal exception handling policy here should be, but what is there today has weaknesses. If the network is playing up, logging RemoteExceptions and maybe inserting delays would be good; if JobTracker.getSystemDir() is null then the clients should sleep longer in the hope that someone will fix the job tracker, rather than spinning.

          People

          • Assignee:
            Unassigned
            Reporter:
            Steve Loughran
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development