Pig
  1. Pig
  2. PIG-2780

MapReduceLauncher should break early when one of the jobs throws an exception

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.11
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Right now MapReduceLauncher caches the job exception in jobControlException and only processes it when all the jobs are done:

        jcThread.setUncaughtExceptionHandler(jctExceptionHandler);
        ...
        jcThread.start();
        // Now wait, till we are finished.
        while(!jc.allFinished()){
        ...
        }
        //check for the jobControlException first
        //if the job controller fails before launching the jobs then there are
        //no jobs to check for failure
        if (jobControlException != null) {
          ...
        }
      

      There are two problems with this approach:
      1. There is only one jobControlException variable. If two jobs are throwing exceptions, the first one will be lost.
      2. If there are multiple jobs, the exceptions will not be reported until other jobs are finished, which is a waste of system resource.

      1. PIG-2780.0.patch
        5 kB
        Jie Li
      2. PIG-2780.1.patch
        6 kB
        Jie Li

        Activity

        Hide
        Daniel Dai added a comment -

        Patch committed to trunk. Thanks Jie!

        Show
        Daniel Dai added a comment - Patch committed to trunk. Thanks Jie!
        Hide
        Jie Li added a comment -

        Update the patch to log a early warning for the failure when stop_on_failure is not enabled.

        Show
        Jie Li added a comment - Update the patch to log a early warning for the failure when stop_on_failure is not enabled.
        Hide
        Jie Li added a comment -

        Attached a patch that checks frequently whether any job has failed, and if so stops immediately.

        Also includes a unit test where Pig submits three jobs at the same time and one of them will fail. With this patch, Pig will stop without finishing the other two jobs.

        Show
        Jie Li added a comment - Attached a patch that checks frequently whether any job has failed, and if so stops immediately. Also includes a unit test where Pig submits three jobs at the same time and one of them will fail. With this patch, Pig will stop without finishing the other two jobs.
        Hide
        Feng Peng added a comment -

        We have an example where Pig tries to submit 4 jobs simultaneously but only reports the exception after all other three jobs are finished (after 6 hours). Neither problem would cause errors in the results, but fixing them would help debugging errors in large complex jobs.

        Show
        Feng Peng added a comment - We have an example where Pig tries to submit 4 jobs simultaneously but only reports the exception after all other three jobs are finished (after 6 hours). Neither problem would cause errors in the results, but fixing them would help debugging errors in large complex jobs.
        Hide
        Jie Li added a comment -

        The jobControlException is different from the job exception. Pig submit jobs in batches and check exceptions at the end of each batch. Jobs without inter-dependency can run in parallel in the same batch, otherwise they are in different batches. Usually the number of jobs in one batch is very small, so checking exception at the end looks acceptable.

        If you specify "-stop_on_failure" in the command line option, then Pig will stop after the failed batch.

        If it doesn't satisfy your need, i.e. you also want Pig to stop on failure within batch, then please keep this jira open.

        Show
        Jie Li added a comment - The jobControlException is different from the job exception. Pig submit jobs in batches and check exceptions at the end of each batch. Jobs without inter-dependency can run in parallel in the same batch, otherwise they are in different batches. Usually the number of jobs in one batch is very small, so checking exception at the end looks acceptable. If you specify "-stop_on_failure" in the command line option, then Pig will stop after the failed batch. If it doesn't satisfy your need, i.e. you also want Pig to stop on failure within batch, then please keep this jira open.

          People

          • Assignee:
            Jie Li
            Reporter:
            Feng Peng
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development