Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-4334

Fix deadlock in ShuffleScheduler between ShuffleScheduler.close() and the ShufflePenaltyReferee thread

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.10.3
    • None
    • None

    Description

      Deadlock can be generated between a thread calling ShuffleScheduler.close() and the ShufflePenaltyReferee thread.

      Example (produced with an earlier version):

      {{"Fetcher_O

      { attempt_1611850856294_0026_1_03_000000_0_10344 Reducer_3}

      #13" #2669 daemon prio=5 os_prio=0 tid=0x00002b9de869d000 nid=0xf99 in Object.wait() [0x00002b9de4983000]
      at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.close(ShuffleScheduler.java:481)
      at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.cleanupShuffleScheduler(Shuffle.java:352)
      at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.cleanupShuffleSchedulerIgnoreErrors(Shuffle.java:343)
      at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.reportException(Shuffle.java:407)

      • locked <0x00002b96bbb9d7a8> (a org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle)
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:1033)
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:781)
      • locked <0x00002b96b98a7860> (a org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler)
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:414)

      "ShufflePenaltyReferee

      {Reducer_3}

      " #2645 daemon prio=5 os_prio=0 tid=0x00002b9560fae800 nid=0xf7d waiting for monitor entry [0x00002b9de733b000]
      java.lang.Thread.State: BLOCKED (on object monitor)
      at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler$Referee.run(ShuffleScheduler.java:1322)

      • waiting to lock <0x00002b96b98a7860> (a org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler)}}

      We can fix the deadlock with:

      1) do not hold ShuffleScheduler.this when calling exceptionReporter.reportException()
      2) remove synchronized in copyFailed()

      Attachments

        Issue Links

          Activity

            People

              glapark Sungwoo Park
              glapark Sungwoo Park
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 20m
                  3h 20m