Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-862

Potential NN deadlock in processDistributedUpgradeCommand

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.22.0, 0.23.1
    • Fix Version/s: None
    • Component/s: namenode
    • Labels:
      None

      Description

      Haven't seen this in practice, but the lock order is inconsistent. processReport locks FSNamesystem, then calls UpgradeManager.startUpgrade, getUpgradeState, and getUpgradeStatus (each of which locks the UpgradeManager). FSNameSystem.processDistributedUpgradeCommand calls upgradeManager.processUpgradeCommand which is synchronized on UpgradeManager, which can call FSNameSystem.leaveSafeMode which synchronizes on FSNamesystem.

        Activity

        Hide
        tlipcon Todd Lipcon added a comment -

        Here's the graph of the lock cycle.

        Show
        tlipcon Todd Lipcon added a comment - Here's the graph of the lock cycle.
        Hide
        aklochkov Andrey Klochkov added a comment -

        Confirming that this happens in practice, at least in tests. The TestDistributedUpgrade test is flaky due to this reason. We're capturing thread dumps of tests failing due to timeouts (HADOOP-8755) and here's the tread dump of TestDistributedUpgrade failure (see attachment). Thread #110 is blocked by #107 (or #109) and in turn #107 (109?) is blocked by #110. The first one acquired a monitor on the UpgradeManagerNamenode instance, and the second one got an fsLock, so both are waiting for each other. The test fails to start the cluster as DN heartbeats can't be processed by NN.

        Show
        aklochkov Andrey Klochkov added a comment - Confirming that this happens in practice, at least in tests. The TestDistributedUpgrade test is flaky due to this reason. We're capturing thread dumps of tests failing due to timeouts ( HADOOP-8755 ) and here's the tread dump of TestDistributedUpgrade failure (see attachment). Thread #110 is blocked by #107 (or #109) and in turn #107 (109?) is blocked by #110. The first one acquired a monitor on the UpgradeManagerNamenode instance, and the second one got an fsLock, so both are waiting for each other. The test fails to start the cluster as DN heartbeats can't be processed by NN.
        Hide
        tlipcon Todd Lipcon added a comment -

        Given that we removed the "distributed upgrade" code recently, maybe we should just backport that patch to earlier branches to avoid this issue entirely? Thanks for digging into this, Andrey!

        Show
        tlipcon Todd Lipcon added a comment - Given that we removed the "distributed upgrade" code recently, maybe we should just backport that patch to earlier branches to avoid this issue entirely? Thanks for digging into this, Andrey!
        Hide
        aklochkov Andrey Klochkov added a comment -

        Yes, backporting would be helpful then. Can you initiate it please?

        Show
        aklochkov Andrey Klochkov added a comment - Yes, backporting would be helpful then. Can you initiate it please?

          People

          • Assignee:
            Unassigned
            Reporter:
            tlipcon Todd Lipcon
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development