Hadoop Common
  1. Hadoop Common
  2. HADOOP-923

DFS Scalability: datanode heartbeat timeouts cause cascading timeouts of other datanodes

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.10.1
    • Fix Version/s: 0.12.0
    • Component/s: None
    • Labels:
      None

      Description

      The datanode sends a heartbeat to the namenode every 3 seconds. The namenode processes the heartbeat and sends a list of block-to-be-replicated and blocks-to-be-deleted as part of the heartbeat response.

      At times when a couple of datanodes fail, the heartbeat processing on the namenode becomes pretty heavyweight. It acquires the global FSNamesystem lock, traverses the neededReplication structure, generates a list of blocks to be replicated and responds to the heartbeat message. Determining the list of blocks-to-be-replciated is pretty heavyweight, takes plenty of CPU and blocks processing of other heartbeats because of the global FSNamesystem lock.

      It would improve scalability a lot if heartbeat processing does not require the FSNamesystem lock. In fact, the pre-existing "heartbeat" lock already exists for this purpose.

      I propose that the Heartbeat message be separate from the "retrieve blocks-to-replicate and blocks-to-delete" messages. The datanode can continue to heartbeat once every 3 seconds while it can afford to "retrieve blocks-to-replicate" at a much coarser interval. Heartbeat processing on the namenode will be fast because it does not require the global FSNamesystem lock. Moreover, a datanode failure will not aggrevate the heartbeat processing time on the namenode.

        Activity

        dhruba borthakur created issue -
        Hide
        dhruba borthakur added a comment -

        I gathered simulated measurements on the time to process a single heartbeat on the namenode versus the number of blocks to replicate. Here is the data:

        pending Blocks to replicate time to process one heartbeat (millisec)
        100,000 2
        500,000 5
        600,000 6

        100,000 blocks typically corresponds to about 12 TB. This is analogous to the capacity of three typical datanodes., Thus, if three datanodes go down at the same time, the namenode spends about 2 ms of CPU time to process a single incoming heartbeat. The global FSNamesystem lock is kept for this entire 2 ms.

        In the current implementation, heartbeats are sent by the datanode once every 3 seconds. A 1500 node cluster will cause the namenode to spend about 3 seconds (2ms * 1500) processing just heartbeat requests. The current DFS's scalabilty could be limited to 1500 datanodes.

        The above results could vary depending on the type of hardware and communcation link that is being used.

        Show
        dhruba borthakur added a comment - I gathered simulated measurements on the time to process a single heartbeat on the namenode versus the number of blocks to replicate. Here is the data: pending Blocks to replicate time to process one heartbeat (millisec) 100,000 2 500,000 5 600,000 6 100,000 blocks typically corresponds to about 12 TB. This is analogous to the capacity of three typical datanodes., Thus, if three datanodes go down at the same time, the namenode spends about 2 ms of CPU time to process a single incoming heartbeat. The global FSNamesystem lock is kept for this entire 2 ms. In the current implementation, heartbeats are sent by the datanode once every 3 seconds. A 1500 node cluster will cause the namenode to spend about 3 seconds (2ms * 1500) processing just heartbeat requests. The current DFS's scalabilty could be limited to 1500 datanodes. The above results could vary depending on the type of hardware and communcation link that is being used.
        Hide
        dhruba borthakur added a comment -

        Introduce a new DatanodeProtocol call named sendBlockModifications(). The namenode returns the blocks that are to be replicated or deleted as part of this call. The existing method sendHeartbeat() just updates the heartbeat array in the namenode, it does not send back the list of blocks that are pending replication or the blocks that are to be deleted.

        The Datanode invokes the sendHeartbeat RPC once every 3 seconds. The Datanode invokes the sendBlockModifications RPC once every 10 heartbeats.

        The namenode acquires only the heartbeat lock while processing the sendHeartbeat call. The namenode acquires the global FSnamesystem lock while processing the sendBlockModifications call.

        The above change ensures that heartbeats processing time does not depend on the amount of blocks that are pending to be replicated.

        Show
        dhruba borthakur added a comment - Introduce a new DatanodeProtocol call named sendBlockModifications(). The namenode returns the blocks that are to be replicated or deleted as part of this call. The existing method sendHeartbeat() just updates the heartbeat array in the namenode, it does not send back the list of blocks that are pending replication or the blocks that are to be deleted. The Datanode invokes the sendHeartbeat RPC once every 3 seconds. The Datanode invokes the sendBlockModifications RPC once every 10 heartbeats. The namenode acquires only the heartbeat lock while processing the sendHeartbeat call. The namenode acquires the global FSnamesystem lock while processing the sendBlockModifications call. The above change ensures that heartbeats processing time does not depend on the amount of blocks that are pending to be replicated.
        Hide
        Doug Cutting added a comment -

        Overall this sounds like a great direction.

        > new DatanodeProtocol call named sendBlockModifications().

        This seems more get-like than send-like. So maybe it should be called 'getBlockModifications()' or 'getBlockInstructions()?

        Show
        Doug Cutting added a comment - Overall this sounds like a great direction. > new DatanodeProtocol call named sendBlockModifications(). This seems more get-like than send-like. So maybe it should be called 'getBlockModifications()' or 'getBlockInstructions()?
        Hide
        dhruba borthakur added a comment -

        A first version of the patch. Review comments needed.

        I compared the two implementation
        Approach 1. keeping a single heartbeat RPC and making the namenode transparently return the list-of-blocks-to-be-transferred once every nth heartbeat call.
        Approach 2. Introducing a new RPC call that is invoked by the datanode to retrieve list-of-blocks-to-be-transferred. (default 20 seconds).

        This current patch implements Approach 2. This approach allows for having a dedicated namenode thread to process heartbeats (if the need arises).

        Show
        dhruba borthakur added a comment - A first version of the patch. Review comments needed. I compared the two implementation Approach 1. keeping a single heartbeat RPC and making the namenode transparently return the list-of-blocks-to-be-transferred once every nth heartbeat call. Approach 2. Introducing a new RPC call that is invoked by the datanode to retrieve list-of-blocks-to-be-transferred. (default 20 seconds). This current patch implements Approach 2. This approach allows for having a dedicated namenode thread to process heartbeats (if the need arises).
        dhruba borthakur made changes -
        Field Original Value New Value
        Attachment scalableheartbeats1.patch [ 12349710 ]
        dhruba borthakur made changes -
        Attachment scalableheartbeats1.patch [ 12349710 ]
        Hide
        dhruba borthakur added a comment -

        A background thread that computes pendingTransfers and chooseTargets. It stores the computed work into the relevent DatanodeDescriptor. A heartbeat from a datanode retrieves this pre-computed work from the Datanode Descriptor.

        Show
        dhruba borthakur added a comment - A background thread that computes pendingTransfers and chooseTargets. It stores the computed work into the relevent DatanodeDescriptor. A heartbeat from a datanode retrieves this pre-computed work from the Datanode Descriptor.
        dhruba borthakur made changes -
        Attachment pendingTransferThread.patch [ 12350598 ]
        Hide
        Hairong Kuang added a comment -

        Two comments:
        1. I feel that it is not neccessary to balance # of transfers when the heartbeat thread picks up the replication work. First the background thread that computes pendingTransfers has already balanced the load. Second block replication work needs to be done asap to avoid data loss. Since the datanode has been assinged the block replication work, no other datanode is able to pick up the work. If the work does not get to send to the datanode in the current heartbeat, it has to wait for at least another heartbeat interval.

        2. The background thread that computes pendindingTransfer scans only 100 datanodes per interation and then sleep for 3 seconds. I feel that the approach does not scale well. For example, when a cluster size becomes 2000, a datanode's work gets computed every 2000/100*3=1min if we ignore the computation overhead, which is far less frequently than what we do now (every 3 seonds). Another minor flaw is that the thread uses the index to record the next node to be checked. But if the heartbeat queue gets updated between two consecutive interations, the index may not point to the right node.

        Show
        Hairong Kuang added a comment - Two comments: 1. I feel that it is not neccessary to balance # of transfers when the heartbeat thread picks up the replication work. First the background thread that computes pendingTransfers has already balanced the load. Second block replication work needs to be done asap to avoid data loss. Since the datanode has been assinged the block replication work, no other datanode is able to pick up the work. If the work does not get to send to the datanode in the current heartbeat, it has to wait for at least another heartbeat interval. 2. The background thread that computes pendindingTransfer scans only 100 datanodes per interation and then sleep for 3 seconds. I feel that the approach does not scale well. For example, when a cluster size becomes 2000, a datanode's work gets computed every 2000/100*3=1min if we ignore the computation overhead, which is far less frequently than what we do now (every 3 seonds). Another minor flaw is that the thread uses the index to record the next node to be checked. But if the heartbeat queue gets updated between two consecutive interations, the index may not point to the right node.
        dhruba borthakur made changes -
        Attachment pendingTransferThread.patch [ 12350598 ]
        Hide
        dhruba borthakur added a comment -

        Incorporated review comments. the change from the previous version is in method FSNamesystem.computeDatanodeWork().

        Show
        dhruba borthakur added a comment - Incorporated review comments. the change from the previous version is in method FSNamesystem.computeDatanodeWork().
        dhruba borthakur made changes -
        Attachment pendingTransferThread2.patch [ 12350970 ]
        Hide
        dhruba borthakur added a comment -

        Code has been reviewed by Hairong.

        Show
        dhruba borthakur added a comment - Code has been reviewed by Hairong.
        dhruba borthakur made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Doug Cutting added a comment -

        I just committed this. Thanks, Dhruba!

        Show
        Doug Cutting added a comment - I just committed this. Thanks, Dhruba!
        Doug Cutting made changes -
        Resolution Fixed [ 1 ]
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Fix Version/s 0.12.0 [ 12312293 ]
        Doug Cutting made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Owen O'Malley made changes -
        Component/s dfs [ 12310710 ]
        Haryadi Gunawi made changes -
        Description The datanode sends a heartbeat to the namenode every 3 seconds. The namenode processes the heartbeat and sends a list of block-to-be-replicated and blocks-to-be-deleted as part of the heartbeat response.

        At times when a couple of datanodes fail, the heartbeat processing on the namenode becomes pretty heavyweight. It acquires the global FSNamesystem lock, traverses the neededReplication structure, generates a list of blocks to be replicated and responds to the heartbeat message. Determining the list of blocks-to-be-replciated is pretty heavyweight, takes plenty of CPU and blocks processing of other heartbeats because of the global FSNamesystem lock.

        It would improve scalability a lot if heartbeat processing does not require the FSNamesystem lock. In fact, the pre-existing "heartbeat" lock already exists for this purpose.

        I propose that the Heartbeat message be separate from the "retrieve blocks-to-replicate and blocks-to-delete" messages. The datanode can continue to heartbeat once every 3 seconds while it can afford to "retrieve blocks-to-replicate" at a much coarser interval. Heartbeat processing on the namenode will be fast because it does not require the global FSNamesystem lock. Moreover, a datanode failure will not aggrevate the heartbeat processing time on the namenode.

        The datanode sends a heartbeat to the namenode every 3 seconds. The namenode processes the heartbeat and sends a list of block-to-be-replicated and blocks-to-be-deleted as part of the heartbeat response.

        At times when a couple of datanodes fail, the heartbeat processing on the namenode becomes pretty heavyweight. It acquires the global FSNamesystem lock, traverses the neededReplication structure, generates a list of blocks to be replicated and responds to the heartbeat message. Determining the list of blocks-to-be-replciated is pretty heavyweight, takes plenty of CPU and blocks processing of other heartbeats because of the global FSNamesystem lock.

        It would improve scalability a lot if heartbeat processing does not require the FSNamesystem lock. In fact, the pre-existing "heartbeat" lock already exists for this purpose.

        I propose that the Heartbeat message be separate from the "retrieve blocks-to-replicate and blocks-to-delete" messages. The datanode can continue to heartbeat once every 3 seconds while it can afford to "retrieve blocks-to-replicate" at a much coarser interval. Heartbeat processing on the namenode will be fast because it does not require the global FSNamesystem lock. Moreover, a datanode failure will not aggrevate the heartbeat processing time on the namenode.
         
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        19d 22h 1 dhruba borthakur 12/Feb/07 22:59
        Patch Available Patch Available Resolved Resolved
        1h 52m 1 Doug Cutting 13/Feb/07 00:51
        Resolved Resolved Closed Closed
        17d 22h 10m 1 Doug Cutting 02/Mar/07 23:02

          People

          • Assignee:
            dhruba borthakur
            Reporter:
            dhruba borthakur
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development