Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-15761

Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately

Details

    • Bug
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      To decommission a dead DN, the complete logic should be
      Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED

      Currently logic:

      If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped.

      This logic is introduced by HDFS-7374 which is made because of HDFS-6791.

      HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes dead during decommission, which could possibly make a dead DN in DECOMMISSION_INPROGRESS forever, if the DN could never be alive.

      However, putting a dead DN to DECOMMISSIONED directly is not secure. For example, 3 DN of the same block are dead at the same time, then the administrator wants to decommission them unintentionally. Namenode should check first before transit them to DECOMMISSIONED. Otherwise, it would be a data loss.

      In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The administrator needs to do some manual intervention, either repair the dead machine or service or recover the data before take action on them.

      This change is to add Dead, DECOMMISSION_INPROGRESS back.
      1. Dead normal DN is in DECOMMISSION_INPROGRESS first.
      2. NN checks pendingReplicationBlocksCount and underReplicatedBlocksCount are both 0.
      3. Transit the dead DN to DECOMMISSIONED.

      2 is implemented by HDFS-7409, which adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to DECOMMISSIONED state if all files on the filesystem are fully-replicated.

      Attachments

        Issue Links

          Activity

            NickyYe Ye Ni added a comment - - edited
            NickyYe Ye Ni added a comment - - edited cc mingma , andrew.wang , zhz , elgoiri
            elgoiri Íñigo Goiri added a comment -

            Do we have a test that makes sure that we go to DECOMMISSION_INPROGRESS and then into DECOMMISSIONED?

            elgoiri Íñigo Goiri added a comment - Do we have a test that makes sure that we go to DECOMMISSION_INPROGRESS and then into DECOMMISSIONED?
            NickyYe Ye Ni added a comment -

            elgoiri Yes, TestDecommissioningStatus.java line 434, 443, 496 and 497.

            NickyYe Ye Ni added a comment - elgoiri  Yes,  TestDecommissioningStatus.java  line 434, 443, 496 and 497.

            People

              NickyYe Ye Ni
              NickyYe Ye Ni
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m