Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4850

Job recovery may fail if staging directory has been deleted

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.1.1
    • Fix Version/s: 1.2.0
    • Component/s: mrv1
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The job staging directory is deleted in the job cleanup task, which happens before the job-info file is deleted from the system directory (by the JobInProgress garbageCollect() method). If the JT shuts down between these two operations, then when the JT restarts and tries to recover the job, it fails since the job.xml and splits are no longer available.

      1. MAPREDUCE-4850.patch
        3 kB
        Tom White
      2. MAPREDUCE-4850.patch
        9 kB
        Tom White

        Activity

        Hide
        Matt Foley added a comment -

        Closed upon release of Hadoop 1.2.0.

        Show
        Matt Foley added a comment - Closed upon release of Hadoop 1.2.0.
        Hide
        Tom White added a comment -

        I ran test-patch and it came back clean. I just committed this.

        Show
        Tom White added a comment - I ran test-patch and it came back clean. I just committed this.
        Hide
        Alejandro Abdelnur added a comment -

        +1, i got a bit confused by why we need to do a doAs, but mapred is not an HDFS superuser, that is way.

        Show
        Alejandro Abdelnur added a comment - +1, i got a bit confused by why we need to do a doAs, but mapred is not an HDFS superuser, that is way.
        Hide
        Tom White added a comment -

        New patch with unit test. This depends on the fixes I made for MAPREDUCE-4859 which are not committed yet.

        Show
        Tom White added a comment - New patch with unit test. This depends on the fixes I made for MAPREDUCE-4859 which are not committed yet.
        Hide
        Tom White added a comment -

        A patch that deletes the staging directory after the system directory.

        Manual testing showed that with this patch I couldn't get a recovery failure in the scenario in the description. It would be nice to add a unit test, but I'm still trying to figure out how to write one for this.

        Show
        Tom White added a comment - A patch that deletes the staging directory after the system directory. Manual testing showed that with this patch I couldn't get a recovery failure in the scenario in the description. It would be nice to add a unit test, but I'm still trying to figure out how to write one for this.

          People

          • Assignee:
            Tom White
            Reporter:
            Tom White
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development