Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-7400

More reliable namenode health check to detect OS/HW issues

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      We had this scenario on an active NN machine.

      • Disk array controller firmware has a bug. So disks stop working.
      • ZKFC and NN still considered the node healthy; Communications between ZKFC and ZK as well as ZKFC and NN are good.
      • The machine can be pinged.
      • The machine can't be sshed.

      So all clients and DNs can't use the NN. But ZKFC and NN still consider the node healthy.

      The question is how we can have ZKFC and NN detect such OS/HW specific issues quickly? Some ideas we discussed briefly,

      • Have other machines help to make the decision whether the NN is actually healthy. Then you have to figure out to make the decision accurate in the case of network issue, etc.
      • Run OS/HW health check script external to ZKFC/NN on the same machine. If it detects disk or other issues, it can reboot the machine for example.
      • Run OS/HW health check script inside ZKFC/NN. For example NN's HAServiceProtocol#monitorHealth can be modified to call such health check script.

      Thoughts?

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            mingma Ming Ma Assign to me
            mingma Ming Ma

            Dates

              Created:
              Updated:

              Slack

                Issue deployment