Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-7725

Incorrect "nodes in service" metrics caused all writes to fail

    Details

      Description

      One of our clusters sometimes couldn't allocate blocks from any DNs. BlockPlacementPolicyDefault complains with the following messages for all DNs.

      the node is too busy (load:x > y)
      

      It turns out the HeartbeatManager's nodesInService was computed incorrectly when admins decomm or recomm dead nodes. Here are two scenarios.

      • Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is intentional. cc / Zhe Zhang, Andrew Wang, Aaron T. Myers Here is the sequence of event without HDFS-7374.
        • Cluster has one live node. nodesInService == 1
        • The node becomes dead. nodesInService == 0
        • Decomm the node. nodesInService == -1
      • However, HDFS-7374 introduces another inconsistency when recomm is involved.
        • Cluster has one live node. nodesInService == 1
        • The node becomes dead. nodesInService == 0
        • Decomm the node. nodesInService == 0
        • Recomm the node. nodesInService == 1

        Attachments

        1. HDFS-7725-3.patch
          7 kB
          Ming Ma
        2. HDFS-7725-2.patch
          3 kB
          Ming Ma
        3. HDFS-7725.patch
          3 kB
          Ming Ma

          Issue Links

            Activity

              People

              • Assignee:
                mingma Ming Ma
                Reporter:
                mingma Ming Ma
              • Votes:
                0 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: