Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-4977

Deadlock between reclaimCapacity and assignTasks

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.19.0
    • Fix Version/s: 0.20.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      I was running the latest trunk with the capacity scheduler and saw the JobTracker lock up with the following deadlock reported in jstack:

      Found one Java-level deadlock:
      =============================
      "18107298@qtp0-4":
      waiting to lock monitor 0x08085b40 (object 0x56605100, a org.apache.hadoop.mapred.JobTracker),
      which is held by "IPC Server handler 4 on 54311"
      "IPC Server handler 4 on 54311":
      waiting to lock monitor 0x0808594c (object 0x5660e518, a org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr),
      which is held by "reclaimCapacity"
      "reclaimCapacity":
      waiting to lock monitor 0x08085b40 (object 0x56605100, a org.apache.hadoop.mapred.JobTracker),
      which is held by "IPC Server handler 4 on 54311"

      Java stack information for the threads listed above:
      ===================================================
      "18107298@qtp0-4":
      at org.apache.hadoop.mapred.JobTracker.getClusterStatus(JobTracker.java:2695)

      • waiting to lock <0x56605100> (a org.apache.hadoop.mapred.JobTracker)
        at org.apache.hadoop.mapred.jobtracker_jsp._jspService(jobtracker_jsp.java:93)
        at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
        at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
        at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
        at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
        at org.mortbay.jetty.Server.handle(Server.java:324)
        at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
        at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
        at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
        at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
        "IPC Server handler 4 on 54311":
        at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.updateQSIObjects(CapacityTaskScheduler.java:564)
      • waiting to lock <0x5660e518> (a org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr)
        at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.assignTasks(CapacityTaskScheduler.java:855)
        at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$1000(CapacityTaskScheduler.java:294)
        at org.apache.hadoop.mapred.CapacityTaskScheduler.assignTasks(CapacityTaskScheduler.java:1336)
      • locked <0x5660dd20> (a org.apache.hadoop.mapred.CapacityTaskScheduler)
        at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2288)
      • locked <0x56605100> (a org.apache.hadoop.mapred.JobTracker)
        at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)

      Unfortunately I didn't manage to select all of the output by mistake, so some is missing, but it appears that reclaimCapacity locks the MapSchedulingMgr and then tries to lock the JobTracker, whereas the updateQSIObjects called in assignTasks holds a lock on the JobTracker (the JobTracker grabs this lock when it calls assignTasks) and then tries to lock the MapSchedulingMgr. The other thread listed there is a Jetty thread for the web interface and isn't part of the circular locking. The solution to this would be to lock the JobTracker in reclaimCapacity before locking anything else.

        Attachments

        1. jstack.txt
          5 kB
          Matei Zaharia
        2. 4977.1.patch
          19 kB
          Vivek Ratan
        3. 4977.2.patch
          20 kB
          Vivek Ratan
        4. 4977.3.patch
          20 kB
          Vivek Ratan
        5. 4977.4.patch
          22 kB
          Vivek Ratan
        6. 4977.4.patch
          21 kB
          Hemanth Yamijala

          Activity

            People

            • Assignee:
              vivekr Vivek Ratan
              Reporter:
              matei Matei Zaharia
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: