Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-6148

LeaseManager crashes while initiating block recovery

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Duplicate
    • Affects Version/s: 2.3.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Target Version/s:

      Description

      While running branch-2.4, the LeaseManager crashed with an NPE. This does not always happen on block recovery.

      Exception in thread
      "org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@5d66b728"
      java.lang.NullPointerException
      at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoUnderConstruction$
      ReplicaUnderConstruction.isAlive(BlockInfoUnderConstruction.java:121)
      at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoUnderConstruction.
      initializeBlockRecovery(BlockInfoUnderConstruction.java:286)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3746)
      at org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:474)
      at org.apache.hadoop.hdfs.server.namenode.LeaseManager.access$900(LeaseManager.java:68)
      at org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:411)
      at java.lang.Thread.run(Thread.java:722)

        Issue Links

          Activity

          Hide
          Kihwal Lee added a comment -

          replicas.size() was non-zero and there was a corresponding ReplicaUnderConstruction, but its expectedLocation seemed to be null. This can happen if setExpectedStorageLocations() were called with array of nulls. This might happen if a last block with null locations is turned into a BlockInfoUnderConstruction. There might be other ways though.

          Show
          Kihwal Lee added a comment - replicas.size() was non-zero and there was a corresponding ReplicaUnderConstruction, but its expectedLocation seemed to be null. This can happen if setExpectedStorageLocations() were called with array of nulls. This might happen if a last block with null locations is turned into a BlockInfoUnderConstruction. There might be other ways though.
          Hide
          Kihwal Lee added a comment -

          It may have something to do with loading fsimage + edits and processing under-construction files. LeaseManager crashes one hour after NN start-up.

          Show
          Kihwal Lee added a comment - It may have something to do with loading fsimage + edits and processing under-construction files. LeaseManager crashes one hour after NN start-up.
          Hide
          Kihwal Lee added a comment -

          Sorry, it was seen on a 2.3 cluster. I will verify wether we still have this bug in 2.4

          Show
          Kihwal Lee added a comment - Sorry, it was seen on a 2.3 cluster. I will verify wether we still have this bug in 2.4
          Hide
          Kihwal Lee added a comment -

          This happens when only NN restarts and an incremental block report is received after the node registration, but before adding the storage. I.e. queued incremental block report coming first before the first heartbeat. In this case, BlockInfoUnderConstruction#addReplicaIfNotPresent() is called from addStoredBlockUnderConstruction(), but the StorageInfo is null. Since the storage is not added yet, node.getStorageInfo(storageID) is null. As a result, the BlockInfoUnderConstruction will have one ReplicaUnderConstruction with its expectedLocation set to null. This is apparent from the log message from the processing of such an incremental block report.

          WARN BlockStateChange: BLOCK* addStoredBlock: Redundant addStoredBlock request received for 
          blk_1089713407_xxxxx{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, 
          replicas=[ReplicaUnderConstruction[null|FINALIZED]]} on 1.2.3.4:1004 size 0
          

          After this, fsck will fail with a NPE and the LeaseManager will also crash with a NPE.

          Show
          Kihwal Lee added a comment - This happens when only NN restarts and an incremental block report is received after the node registration, but before adding the storage. I.e. queued incremental block report coming first before the first heartbeat. In this case, BlockInfoUnderConstruction#addReplicaIfNotPresent() is called from addStoredBlockUnderConstruction() , but the StorageInfo is null. Since the storage is not added yet, node.getStorageInfo(storageID) is null. As a result, the BlockInfoUnderConstruction will have one ReplicaUnderConstruction with its expectedLocation set to null. This is apparent from the log message from the processing of such an incremental block report. WARN BlockStateChange: BLOCK* addStoredBlock: Redundant addStoredBlock request received for blk_1089713407_xxxxx{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[null|FINALIZED]]} on 1.2.3.4:1004 size 0 After this, fsck will fail with a NPE and the LeaseManager will also crash with a NPE.
          Hide
          Kihwal Lee added a comment -

          Looks like HDFS-6094 fixed this in 2.4.0 by making NN learn StorageInfo from incremental block reports.

          Show
          Kihwal Lee added a comment - Looks like HDFS-6094 fixed this in 2.4.0 by making NN learn StorageInfo from incremental block reports.
          Hide
          Kihwal Lee added a comment -

          Marking it as a dupe of HDFS-6094

          Show
          Kihwal Lee added a comment - Marking it as a dupe of HDFS-6094
          Hide
          Tsz Wo Nicholas Sze added a comment -

          It is really a good that this has already been fixed. Thanks for checking it!

          Show
          Tsz Wo Nicholas Sze added a comment - It is really a good that this has already been fixed. Thanks for checking it!

            People

            • Assignee:
              Kihwal Lee
              Reporter:
              Kihwal Lee
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development