Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-15809

DeadNodeDetector doesn't remove live nodes from dead node set.

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.3.1, 3.4.0
    • 3.3.1, 3.4.0
    • datanode
    • None
    • Reviewed

    Description

      We found the dead node detector might never remove the alive nodes from the dead node set in a big cluster. For example:

      1. 200 nodes are added to the dead node set by DeadNodeDetector.
      2. DeadNodeDetector#checkDeadNodes() adds 100 nodes to the deadNodesProbeQueue because the queue limited length is 100.
      3. The probe threads start working and probe 30 nodes.
      4. DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead node set  and adds 30 nodes to the deadNodesProbeQueue. But the order is the same as the last time. So the 30 nodes that has already been probed are added to the queue again.
      5. Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If they are all dead then the live nodes behind them could never be recovered.

      Attachments

        1. HDFS-15809.001.patch
          17 kB
          Jinglun
        2. HDFS-15809.002.patch
          17 kB
          Jinglun
        3. HDFS-15809.003.patch
          18 kB
          Jinglun
        4. HDFS-15809.004.patch
          15 kB
          Jinglun
        5. HDFS-15809.005.patch
          15 kB
          Jinglun
        6. HDFS-15809.006.patch
          15 kB
          Jinglun
        7. HDFS-15809.007.patch
          15 kB
          Jinglun

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            LiJinglun Jinglun
            LiJinglun Jinglun
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment