Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.1, 0.21.0, 0.22.0
    • Fix Version/s: 0.20.2
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      FairSchedulerServlet can cause a deadlock with the JobTracker

      1. deadlock.png
        25 kB
        Todd Lipcon
      2. mapreduce-1070-branch20.txt
        7 kB
        Todd Lipcon
      3. mapreduce-1070.txt
        7 kB
        Todd Lipcon

        Activity

        Hide
        Todd Lipcon added a comment -

        See attached diagram displaying inconsistent lock order based on dynamic analysis.

        Here's a stack trace from an instance we saw this in production:

        Thread 60324 (1823988020@qtp0-4064):
          State: BLOCKED
          Blocked count: 52
          Waited count: 32
          Blocked on org.apache.hadoop.mapred.JobInProgress@5d2044dd
          Blocked by 113 (IPC Server handler 9 on 7277)
        
          Stack:
            org.apache.hadoop.mapred.JobInProgress.finishedMaps(JobInProgress.java:560)
            org.apache.hadoop.mapred.FairSchedulerServlet.showJobs(FairSchedulerServlet.java:235)
            org.apache.hadoop.mapred.FairSchedulerServlet.doGet(FairSchedulerServlet.java:136)
        
            javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
         ...
        Thread 113 (IPC Server handler 9 on 7277):
          State: BLOCKED
          Blocked count: 540572
          Waited count: 2658131
          Blocked on org.apache.hadoop.mapred.FairScheduler@a12d500
        
          Blocked by 60324 (1823988020@qtp0-4064)
          Stack:
            org.apache.hadoop.mapred.JobTracker.finalizeJob(JobTracker.java:2069)
            org.apache.hadoop.mapred.JobInProgress.garbageCollect(JobInProgress.java:2538)
            org.apache.hadoop.mapred.JobInProgress.jobComplete(JobInProgress.java:2181)
        
            org.apache.hadoop.mapred.JobInProgress.completedTask(JobInProgress.java:2125)
            org.apache.hadoop.mapred.JobInProgress.updateTaskStatus(JobInProgress.java:892)
            org.apache.hadoop.mapred.JobTracker.updateTaskStatuses(JobTracker.java:3415)
        
            org.apache.hadoop.mapred.JobTracker.processHeartbeat(JobTracker.java:2712)
            org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2507)
        

        The solution is that the servlet should synchronize on JobTracker before synchronizing on jobs

        Show
        Todd Lipcon added a comment - See attached diagram displaying inconsistent lock order based on dynamic analysis. Here's a stack trace from an instance we saw this in production: Thread 60324 (1823988020@qtp0-4064): State: BLOCKED Blocked count: 52 Waited count: 32 Blocked on org.apache.hadoop.mapred.JobInProgress@5d2044dd Blocked by 113 (IPC Server handler 9 on 7277) Stack: org.apache.hadoop.mapred.JobInProgress.finishedMaps(JobInProgress.java:560) org.apache.hadoop.mapred.FairSchedulerServlet.showJobs(FairSchedulerServlet.java:235) org.apache.hadoop.mapred.FairSchedulerServlet.doGet(FairSchedulerServlet.java:136) javax.servlet.http.HttpServlet.service(HttpServlet.java:707) ... Thread 113 (IPC Server handler 9 on 7277): State: BLOCKED Blocked count: 540572 Waited count: 2658131 Blocked on org.apache.hadoop.mapred.FairScheduler@a12d500 Blocked by 60324 (1823988020@qtp0-4064) Stack: org.apache.hadoop.mapred.JobTracker.finalizeJob(JobTracker.java:2069) org.apache.hadoop.mapred.JobInProgress.garbageCollect(JobInProgress.java:2538) org.apache.hadoop.mapred.JobInProgress.jobComplete(JobInProgress.java:2181) org.apache.hadoop.mapred.JobInProgress.completedTask(JobInProgress.java:2125) org.apache.hadoop.mapred.JobInProgress.updateTaskStatus(JobInProgress.java:892) org.apache.hadoop.mapred.JobTracker.updateTaskStatuses(JobTracker.java:3415) org.apache.hadoop.mapred.JobTracker.processHeartbeat(JobTracker.java:2712) org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2507) The solution is that the servlet should synchronize on JobTracker before synchronizing on jobs
        Hide
        Todd Lipcon added a comment -

        Here's a patch against branch20 which fixes the issue (the dynamic analysis tool no longer sees the potential deadlock).

        I also changed the output to go into a ByteArrayOutputStream to keep the time slice during which the JT lock is held as short as possible.

        Patch against trunk and branch 21 coming soon.

        Show
        Todd Lipcon added a comment - Here's a patch against branch20 which fixes the issue (the dynamic analysis tool no longer sees the potential deadlock). I also changed the output to go into a ByteArrayOutputStream to keep the time slice during which the JT lock is held as short as possible. Patch against trunk and branch 21 coming soon.
        Hide
        Matei Zaharia added a comment -

        +1 looks good.

        Show
        Matei Zaharia added a comment - +1 looks good.
        Hide
        Matei Zaharia added a comment -

        You need to create a patch for trunk too though. I think it will be almost exactly the same.

        Show
        Matei Zaharia added a comment - You need to create a patch for trunk too though. I think it will be almost exactly the same.
        Hide
        Todd Lipcon added a comment -

        Patch against trunk.

        Test not included since the deadlock is a timing bug that can't be reproduced reliably.

        I reran the jcarder tool that produced the attached diagram and it no longer detects this potential deadlock.

        Show
        Todd Lipcon added a comment - Patch against trunk. Test not included since the deadlock is a timing bug that can't be reproduced reliably. I reran the jcarder tool that produced the attached diagram and it no longer detects this potential deadlock.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12422015/mapreduce-1070.txt
        against trunk revision 824750.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/163/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/163/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/163/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/163/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12422015/mapreduce-1070.txt against trunk revision 824750. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/163/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/163/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/163/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/163/console This message is automatically generated.
        Hide
        Todd Lipcon added a comment -

        Failed test is Hftp - unrelated.
        Lack of new tests is because this is a fix for a deadlock which isn't reproducible.

        Show
        Todd Lipcon added a comment - Failed test is Hftp - unrelated. Lack of new tests is because this is a fix for a deadlock which isn't reproducible.
        Hide
        Chris Douglas added a comment -

        I committed this. Thanks, Todd!

        Show
        Chris Douglas added a comment - I committed this. Thanks, Todd!
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #84 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/84/)
        . Prevent a deadlock in the fair scheduler servlet.
        Contributed by Todd Lipcon

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #84 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/84/ ) . Prevent a deadlock in the fair scheduler servlet. Contributed by Todd Lipcon
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #117 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/117/)
        . Prevent a deadlock in the fair scheduler servlet.
        Contributed by Todd Lipcon

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #117 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/117/ ) . Prevent a deadlock in the fair scheduler servlet. Contributed by Todd Lipcon

          People

          • Assignee:
            Todd Lipcon
            Reporter:
            Todd Lipcon
          • Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development