Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-7725

Incorrect "nodes in service" metrics caused all writes to fail

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      One of our clusters sometimes couldn't allocate blocks from any DNs. BlockPlacementPolicyDefault complains with the following messages for all DNs.

      the node is too busy (load:x > y)
      

      It turns out the HeartbeatManager's nodesInService was computed incorrectly when admins decomm or recomm dead nodes. Here are two scenarios.

      • Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is intentional. cc / Zhe Zhang, Andrew Wang, Aaron Myers Here is the sequence of event without HDFS-7374.
        • Cluster has one live node. nodesInService == 1
        • The node becomes dead. nodesInService == 0
        • Decomm the node. nodesInService == -1
      • However, HDFS-7374 introduces another inconsistency when recomm is involved.
        • Cluster has one live node. nodesInService == 1
        • The node becomes dead. nodesInService == 0
        • Decomm the node. nodesInService == 0
        • Recomm the node. nodesInService == 1

      Attachments

        1. HDFS-7725.patch
          3 kB
          Ming Ma
        2. HDFS-7725-2.patch
          3 kB
          Ming Ma
        3. HDFS-7725-3.patch
          7 kB
          Ming Ma

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            mingma Ming Ma
            mingma Ming Ma
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment