Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4595

TestLostTracker failing - possibly due to a race in JobHistory.JobHistoryFilesManager#run()

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.0.3
    • Fix Version/s: 1.2.0
    • Component/s: None
    • Labels:
    • Hadoop Flags:
      Reviewed

      Description

      The source for occasional failure of TestLostTracker seems like the following:

      On job completion, JobHistoryFilesManager#run() spawns another thread to move history files to done folder. TestLostTracker waits for job completion, before checking the file format of the history file. However, the history files move might be in the process or might not have started in the first place.

      The attachment (force-TestLostTracker-failure.patch) helps reproducing the error locally, by increasing the chance of hitting this race.

      1. force-TestLostTracker-failure.patch
        0.9 kB
        Karthik Kambatla
      2. MR-4595.patch
        1 kB
        Karthik Kambatla
      3. MR-4595.patch
        1 kB
        Karthik Kambatla

        Activity

        Hide
        mattf Matt Foley added a comment -

        Closed upon release of Hadoop 1.2.0.

        Show
        mattf Matt Foley added a comment - Closed upon release of Hadoop 1.2.0.
        Hide
        tucu00 Alejandro Abdelnur added a comment -

        Thanks Karthik. Committed to branch-1.

        Show
        tucu00 Alejandro Abdelnur added a comment - Thanks Karthik. Committed to branch-1.
        Hide
        tucu00 Alejandro Abdelnur added a comment -

        +1

        Show
        tucu00 Alejandro Abdelnur added a comment - +1
        Hide
        kkambatl Karthik Kambatla (Inactive) added a comment -

        Uploading a new patch that incorporates Alejandro's offline comments:

        • Use while(max-wait-time) instead of for(i < 10)
        • Sleep for shorter time (50 ms)
        Show
        kkambatl Karthik Kambatla (Inactive) added a comment - Uploading a new patch that incorporates Alejandro's offline comments: Use while(max-wait-time) instead of for(i < 10) Sleep for shorter time (50 ms)
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12542627/MR-4595.patch
        against trunk revision .

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2774//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12542627/MR-4595.patch against trunk revision . -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2774//console This message is automatically generated.
        Hide
        kkambatl Karthik Kambatla (Inactive) added a comment -
        • I meant fool proof approach in my previous comment.

        With the patch, the test passes, in the presence of the sleep in JobHistoryFilesManager#run() as in the force-failure patch.

        Show
        kkambatl Karthik Kambatla (Inactive) added a comment - I meant fool proof approach in my previous comment. With the patch, the test passes, in the presence of the sleep in JobHistoryFilesManager#run() as in the force-failure patch.
        Hide
        kkambatl Karthik Kambatla (Inactive) added a comment -

        Uploading a patch for branch-1.

        I understand it is not the absolute fool approach, as the test still fails if the thread moving the file takes longer than 5 minutes. However, it is a cause of concern if it takes longer than that.

        Please feel free to suggest alternate/better approaches.

        Show
        kkambatl Karthik Kambatla (Inactive) added a comment - Uploading a patch for branch-1. I understand it is not the absolute fool approach, as the test still fails if the thread moving the file takes longer than 5 minutes. However, it is a cause of concern if it takes longer than that. Please feel free to suggest alternate/better approaches.

          People

          • Assignee:
            kasha Karthik Kambatla
            Reporter:
            kasha Karthik Kambatla
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development