Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-9107

Prevent NN's unrecoverable death spiral after full GC

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 2.0.0-alpha
    • 2.8.0, 2.9.0, 3.0.0-alpha1, 2.7.5
    • namenode
    • None

    Description

      A full GC pause in the NN that exceeds the dead node interval can lead to an infinite cycle of full GCs. The most common situation that precipitates an unrecoverable state is a network issue that temporarily cuts off multiple racks.

      The NN wakes up and falsely starts marking nodes dead. This bloats the replication queues which increases memory pressure. The replications create a flurry of incremental block reports and a glut of over-replicated blocks.

      The "dead" nodes heartbeat within seconds. The NN forces a re-registration which requires a full block report - more memory pressure. The NN now has to invalidate all the over-replicated blocks. The extra blocks are added to invalidation queues, tracked in an excess blocks map, etc - much more memory pressure.

      All the memory pressure can push the NN into another full GC which repeats the entire cycle.

      Attachments

        1. HDFS-9107.patch
          6 kB
          Daryn Sharp
        2. HDFS-9107.patch
          1 kB
          Daryn Sharp

        Activity

          People

            daryn Daryn Sharp
            daryn Daryn Sharp
            Votes:
            0 Vote for this issue
            Watchers:
            24 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: