Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4824

Provide a mechanism for jobs to indicate they should not be recovered on restart

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.1.0
    • Fix Version/s: 1.2.0
    • Component/s: mrv1
    • Labels:
      None

      Description

      Some jobs (like Sqoop or HBase jobs) are not idempotent, so should not be recovered on jobtracker restart. MAPREDUCE-2702 solves this problem for MR2, however the approach there is not applicable for MR1, since even if we only use the job-level part of the patch and add a isRecoverySupported method to OutputCommitter, there is no way to use that information from the JT (which initiates recovery), since the JT does not instantiate OutputCommitters - and it shouldn't since they are user-level code. (In MR2 it's OK since the MR AM calls the method.)

      Instead, we can add a MR configuration property to say that a job is not recoverable, and the JT could safely read this from the job conf.

      1. MAPREDUCE-4824.patch
        4 kB
        Tom White
      2. MAPREDUCE-4824.patch
        4 kB
        Tom White
      3. MAPREDUCE-4824.patch
        5 kB
        Tom White
      4. MAPREDUCE-4824.patch
        5 kB
        Tom White
      5. MAPREDUCE-4824.patch
        5 kB
        Tom White
      6. MAPREDUCE-4824.patch
        5 kB
        Tom White
      7. MAPREDUCE-4824.patch
        6 kB
        Arun C Murthy
      8. MAPREDUCE-4824.patch
        6 kB
        Arun C Murthy

        Activity

        Hide
        Tom White added a comment -

        Here's a patch that implements this idea. Jobs that shouldn't be recovered should set mapred.job.restart.recover to false.

        Show
        Tom White added a comment - Here's a patch that implements this idea. Jobs that shouldn't be recovered should set mapred.job.restart.recover to false.
        Hide
        Harsh J added a comment -

        Hi,

        • The message below in the exception can be improved I feel. I think its better to say "Job ID was not recovered since it disabled recovery-upon-restart (mapred.job.restart.recover set to false).". Also, since this case is to be expected (non-default override), I think it ought to be a simple INFO log, but I understand we need to throw an Exception to halt the loading of the JIP.
        +      if (recovered && !conf.getBoolean("mapred.job.restart.recover", true)) {
        +        throw new IOException("Job " + jobId + " should not be recovered " +
        +            "since mapred.job.restart.recover is set to false.");
        +      }
        
        • We could also add this property to mapred-default.xml and document it that way.

        The test changes look good.

        Show
        Harsh J added a comment - Hi, The message below in the exception can be improved I feel. I think its better to say "Job ID was not recovered since it disabled recovery-upon-restart (mapred.job.restart.recover set to false).". Also, since this case is to be expected (non-default override), I think it ought to be a simple INFO log, but I understand we need to throw an Exception to halt the loading of the JIP. + if (recovered && !conf.getBoolean( "mapred.job.restart.recover" , true )) { + throw new IOException( "Job " + jobId + " should not be recovered " + + "since mapred.job.restart.recover is set to false ." ); + } We could also add this property to mapred-default.xml and document it that way. The test changes look good.
        Hide
        Bikas Saha added a comment -

        Agree with Harsh.
        I assume this config is job specific and cannot be inadvertently set to disable recovery of all jobs?

        Show
        Bikas Saha added a comment - Agree with Harsh. I assume this config is job specific and cannot be inadvertently set to disable recovery of all jobs?
        Hide
        Tom White added a comment -

        Thanks for the feedback. Here's an updated patch with the improved message.

        I didn't add the property to mapred-default.xml, since it is a job-specific property and these are generally not added there. There's no way to have true job-specific properties, since if someone adds the property to the jobtracker's mapred-site.xml file then it will be picked up. I'm not sure there's an easy way around this.

        Show
        Tom White added a comment - Thanks for the feedback. Here's an updated patch with the improved message. I didn't add the property to mapred-default.xml, since it is a job-specific property and these are generally not added there. There's no way to have true job-specific properties, since if someone adds the property to the jobtracker's mapred-site.xml file then it will be picked up. I'm not sure there's an easy way around this.
        Hide
        Harsh J added a comment -

        I didn't add the property to mapred-default.xml, since it is a job-specific property and these are generally not added there.

        We do have several job-specific properties with proper defaults listed in that file. Unless someone overrides them manually, how come there is harm in doing this, and must we remove the ones already present?

        The file just helps serve as a good doc. behind the config feature, cause otherwise there's no doc reference to this in the patch.

        Show
        Harsh J added a comment - I didn't add the property to mapred-default.xml, since it is a job-specific property and these are generally not added there. We do have several job-specific properties with proper defaults listed in that file. Unless someone overrides them manually, how come there is harm in doing this, and must we remove the ones already present? The file just helps serve as a good doc. behind the config feature, cause otherwise there's no doc reference to this in the patch.
        Hide
        Tom White added a comment -

        Good point, Harsh. Here's a new patch with the property documented in mapred-default.xml.

        Show
        Tom White added a comment - Good point, Harsh. Here's a new patch with the property documented in mapred-default.xml.
        Hide
        Harsh J added a comment -

        +1, please commit. Thanks Tom!

        Show
        Harsh J added a comment - +1, please commit. Thanks Tom!
        Hide
        Arun C Murthy added a comment -

        Tom, I'm concerned that this might blow up different schedulers in different ways. I need to re-check, but have you tested this with all 3 scehdulers?

        Maybe we need to do an 'if' check during recovery and not throw an IOException?

        Show
        Arun C Murthy added a comment - Tom, I'm concerned that this might blow up different schedulers in different ways. I need to re-check, but have you tested this with all 3 scehdulers? Maybe we need to do an 'if' check during recovery and not throw an IOException?
        Hide
        Arun C Murthy added a comment -

        Also, we might want to optimize this for hadoop-2, where in JobClient should set a field in AppSubmissionContext where-by it informs the RM that 'I do not want retries.'

        Thoughts?

        Show
        Arun C Murthy added a comment - Also, we might want to optimize this for hadoop-2, where in JobClient should set a field in AppSubmissionContext where-by it informs the RM that 'I do not want retries.' Thoughts?
        Hide
        Tom White added a comment -

        > I'm concerned that this might blow up different schedulers in different ways.

        I don't think that's a problem since the code change only affects job submission, which kicks in before scheduling code is run.

        > Maybe we need to do an 'if' check during recovery and not throw an IOException?

        I had another look at this and came up with a new patch. Does it look better?

        The Hadoop 2 change sounds like the right approach. At first I thought we didn't need the property in Hadoop 2, due to MAPREDUCE-2702, but actually it would allow users to mark a job as non-recoverable on a per-instance basis. It would build on YARN-128.

        Show
        Tom White added a comment - > I'm concerned that this might blow up different schedulers in different ways. I don't think that's a problem since the code change only affects job submission, which kicks in before scheduling code is run. > Maybe we need to do an 'if' check during recovery and not throw an IOException? I had another look at this and came up with a new patch. Does it look better? The Hadoop 2 change sounds like the right approach. At first I thought we didn't need the property in Hadoop 2, due to MAPREDUCE-2702 , but actually it would allow users to mark a job as non-recoverable on a per-instance basis. It would build on YARN-128 .
        Hide
        Harsh J added a comment -

        +1, took a look again and this alternative approach (no exception throw) looks good as well.

        Show
        Harsh J added a comment - +1, took a look again and this alternative approach (no exception throw) looks good as well.
        Hide
        Bikas Saha added a comment -

        Does it matter if recovered is true if the job conf says dont recover? Please ignore this comment in case I have not understood the logic correclty I am just going by the if condition.

        +    if (recovered && 
        +        !job.getJobConf().getBoolean("mapred.job.restart.recover", true)) {
        +      return null;
        +    }
        

        Did not quite get the resolution of the defaults.xml issue Harsh referred to earlier. Dont see any config changes in the last patch.

        Show
        Bikas Saha added a comment - Does it matter if recovered is true if the job conf says dont recover? Please ignore this comment in case I have not understood the logic correclty I am just going by the if condition. + if (recovered && + !job.getJobConf().getBoolean( "mapred.job.restart.recover" , true )) { + return null ; + } Did not quite get the resolution of the defaults.xml issue Harsh referred to earlier. Dont see any config changes in the last patch.
        Hide
        Tom White added a comment -

        > Does it matter if recovered is true if the job conf says dont recover?

        Yes, because if the job is running for the first time then you don't want to not submit it if mapred.job.restart.recover has been set to false.

        > Did not quite get the resolution of the defaults.xml issue Harsh referred to earlier. Dont see any config changes in the last patch.

        Oops, I inadvertently dropped them in the last patch. Here's a new patch with the change to mapred-default.xml.

        Show
        Tom White added a comment - > Does it matter if recovered is true if the job conf says dont recover? Yes, because if the job is running for the first time then you don't want to not submit it if mapred.job.restart.recover has been set to false. > Did not quite get the resolution of the defaults.xml issue Harsh referred to earlier. Dont see any config changes in the last patch. Oops, I inadvertently dropped them in the last patch. Here's a new patch with the change to mapred-default.xml.
        Hide
        Tom White added a comment -

        Updated with latest branch.

        Show
        Tom White added a comment - Updated with latest branch.
        Hide
        Arun C Murthy added a comment -

        Tom - sorry, looks like we lost track of this. Is this good to go? Tx

        Show
        Arun C Murthy added a comment - Tom - sorry, looks like we lost track of this. Is this good to go? Tx
        Hide
        Arun C Murthy added a comment -

        Also, with RM recovery, we'll need a similar mechanism for YARN too.

        Show
        Arun C Murthy added a comment - Also, with RM recovery, we'll need a similar mechanism for YARN too.
        Hide
        Tom White added a comment -

        Yes, this can go in.

        Show
        Tom White added a comment - Yes, this can go in.
        Hide
        Arun C Murthy added a comment -

        Rebased patch.

        Tom, some changes for you to review:

        1. I've renamed the config to be mapreduce.job.recover.on.restart to be more explicit/clear. We should use 'mapreduce' for new configs to avoid deprecations in future.
        2. I've also introduced a static final MAPREDUCE_RECOVER_JOB variable in JobConf to avoid using the actual config string by hand.
        Show
        Arun C Murthy added a comment - Rebased patch. Tom, some changes for you to review: I've renamed the config to be mapreduce.job.recover.on.restart to be more explicit/clear. We should use 'mapreduce' for new configs to avoid deprecations in future. I've also introduced a static final MAPREDUCE_RECOVER_JOB variable in JobConf to avoid using the actual config string by hand.
        Hide
        Tom White added a comment -

        Arun, thanks for updating the patch. I selected the property name to be similar to the existing mapred.jobtracker.restart.recover, but your point about using the "mapreduce" prefix is a good one (especially if we support this in MR2), so how about mapreduce.jobtracker.restart.recover? The rest looks good.

        Show
        Tom White added a comment - Arun, thanks for updating the patch. I selected the property name to be similar to the existing mapred.jobtracker.restart.recover, but your point about using the "mapreduce" prefix is a good one (especially if we support this in MR2), so how about mapreduce.jobtracker.restart.recover? The rest looks good.
        Hide
        Arun C Murthy added a comment -

        Thanks for taking a look at the update Tom. Let's go with mapreduce.job.restart.recover since, essentially, this is your patch - appreciate you taking f/b on naming!

        Show
        Arun C Murthy added a comment - Thanks for taking a look at the update Tom. Let's go with mapreduce.job.restart.recover since, essentially, this is your patch - appreciate you taking f/b on naming!
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12577205/MAPREDUCE-4824.patch
        against trunk revision .

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3503//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12577205/MAPREDUCE-4824.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3503//console This message is automatically generated.
        Hide
        Arun C Murthy added a comment -

        I just committed this. Thanks Tom!

        Show
        Arun C Murthy added a comment - I just committed this. Thanks Tom!
        Hide
        Matt Foley added a comment -

        Closed upon release of Hadoop 1.2.0.

        Show
        Matt Foley added a comment - Closed upon release of Hadoop 1.2.0.

          People

          • Assignee:
            Tom White
            Reporter:
            Tom White
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development