Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-923

DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.10.1
    • 0.12.0
    • None
    • None

    Description

      The datanode sends a heartbeat to the namenode every 3 seconds. The namenode processes the heartbeat and sends a list of block-to-be-replicated and blocks-to-be-deleted as part of the heartbeat response.

      At times when a couple of datanodes fail, the heartbeat processing on the namenode becomes pretty heavyweight. It acquires the global FSNamesystem lock, traverses the neededReplication structure, generates a list of blocks to be replicated and responds to the heartbeat message. Determining the list of blocks-to-be-replciated is pretty heavyweight, takes plenty of CPU and blocks processing of other heartbeats because of the global FSNamesystem lock.

      It would improve scalability a lot if heartbeat processing does not require the FSNamesystem lock. In fact, the pre-existing "heartbeat" lock already exists for this purpose.

      I propose that the Heartbeat message be separate from the "retrieve blocks-to-replicate and blocks-to-delete" messages. The datanode can continue to heartbeat once every 3 seconds while it can afford to "retrieve blocks-to-replicate" at a much coarser interval. Heartbeat processing on the namenode will be fast because it does not require the global FSNamesystem lock. Moreover, a datanode failure will not aggrevate the heartbeat processing time on the namenode.

      Attachments

        1. pendingTransferThread2.patch
          17 kB
          Dhruba Borthakur

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            dhruba Dhruba Borthakur
            dhruba Dhruba Borthakur
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment