Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-7725

Incorrect "nodes in service" metrics caused all writes to fail

    XMLWordPrintableJSON

Details

    Description

      One of our clusters sometimes couldn't allocate blocks from any DNs. BlockPlacementPolicyDefault complains with the following messages for all DNs.

      the node is too busy (load:x > y)
      

      It turns out the HeartbeatManager's nodesInService was computed incorrectly when admins decomm or recomm dead nodes. Here are two scenarios.

      • Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is intentional. cc / zhz, andrew.wang, atm Here is the sequence of event without HDFS-7374.
        • Cluster has one live node. nodesInService == 1
        • The node becomes dead. nodesInService == 0
        • Decomm the node. nodesInService == -1
      • However, HDFS-7374 introduces another inconsistency when recomm is involved.
        • Cluster has one live node. nodesInService == 1
        • The node becomes dead. nodesInService == 0
        • Decomm the node. nodesInService == 0
        • Recomm the node. nodesInService == 1

      Attachments

        1. HDFS-7725-3.patch
          7 kB
          Ming Ma
        2. HDFS-7725-2.patch
          3 kB
          Ming Ma
        3. HDFS-7725.patch
          3 kB
          Ming Ma

        Issue Links

          Activity

            People

              mingma Ming Ma
              mingma Ming Ma
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: