Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-4416

Dead lock triggered by ShuffleScheduler

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 0.10.1
    • 0.10.3
    • None
    • None
    • Important

    Description

      How this bug is found:

      I was executing a sql with Hive on tez on a cluster that has low disk capacity. An exception was thrown during the execution (which is quite reasonable). Yet the task didn't stop normally, but keep hanging there for a very long while. Therefore, I printed out the jstack and did some investigation. Here's what I found.

      (The .jstack file and the screenshot of  jstack segment are attached below.)

       

      How this dead lock is triggered:

      1. Fail to copy files on local disk, which will trigger copyFailed() from FetcherOrderedGrouped.copyFromHost(), which is a synchronized method on ShuffleScheduler instance. 
      2. Method called from 1 will eventually goes to ShuffleScheduler.close(), in which it tries to kill the Referee's thread by calling referee.interrupt() and referee.join().
      3. Meanwhile, Referee is waiting for ShuffleScheduler's instance lock in its run() method, which is hold by the process from 1. Hence a dead lock happens.

      Attachments

        1. screenshot.PNG
          97 kB
          Omega-Ariston
        2. container.jstack
          36 kB
          Omega-Ariston

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Omega-Ariston Omega-Ariston
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: