Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-15451

Restarting name node stuck in safe mode when using provided storage

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 3.2.2, 3.3.1, 3.4.0
    • namenode
    • None
    • Reviewed

    Description

      When HDFS provided storage is used (dfs.namenode.provided.enabled=true), sometimes restarting name node will result in it stuck at safe mode.

      The problem is that data node send block report to name node successfully, but name node is not processing the report properly, then HDFS remains in safe mode due to missing blocks.

      Looking at name node log, this is the sequence of log for a specific data node:

      2020-07-01 19:46:41,997 INFO blockmanagement.BlockReportLeaseManager: Registered DN af19d9e0-7b9b-45e0-9aa6-b2f404098084 (10.244.6.131:9866).
      2020-07-01 19:46:42,012 DEBUG blockmanagement.BlockReportLeaseManager: Created a new BR lease 0x476aaae689ebbc01 for DN af19d9e0-7b9b-45e0-9aa6-b2f404098084.  numPending = 4
      2020-07-01 19:46:42,340 INFO BlockStateChange: BLOCK* processReport 0xcc610f42d0218cd9: discarded non-initial block report from DatanodeRegistration(10.244.6.131:9866, datanodeUuid=af19d9e0-7b9b-45e0-9aa6-b2f404098084, infoPort=0, infoSecurePort=9865, ipcPort=9867, 
      storageInfo=lv=-57;cid=CID-f49d3421-e04f-40b9-89ef-cf4fee73ad6a;nsid=497894240;c=1572548424451) because namenode still in startup phase
      2020-07-01 19:46:42,648 WARN blockmanagement.BlockReportLeaseManager: BR lease 0x476aaae689ebbc01 is not valid for DN af19d9e0-7b9b-45e0-9aa6-b2f404098084, because the DN is not in the pending set.
      

      The root cause is when BlockManager is processing report, it will skip processing when storageInfo.getBlockReportCount() > 0 and remove the lease:

      blockReportLeaseManager.removeLease(node)
      

      This is because every data node will report a DS-PROVIDED storage, along with other storages (like DISK storage). All DS -PROVIDED storages are actually pointing to the same storageInfo, therefore the second data node sending block report with DS-PROVIDED will have blockReportCount > 0. Then the lease is removed for the data node, then processing future block reports from this node will fail at checkLease() with message "BR lease is not valid".

      Attachments

        Activity

          People

            shanyu shanyu zhao
            shanyu shanyu zhao
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: