Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-994

DFS Scalability : a BlockReport that returns large number of blocks-to-be-deleted cause datanode to lost connectivity to namenode

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.12.0
    • None
    • None

    Description

      The Datanode periodically invokes a block report RPC to the Namenode. This RPC returns the number of blocks that are to be invalidated by the Datanode. The Datanode then starts to delete all the corresponding files. This block deletion is done by the heartbeat thread in the Datanode. If the number of files to be deleted is large, the Datanode stops sending heartbeats for this entire duration. The Namenode declares the Datanode as "dead" and starts replicating its blocks.

      In my observed case, the block report returns 1669 blocks that were to be invalidated. The Datanode was running on a RAID5 ext3 filesystem and 4 active tasks were running on it. The deletion of these 1669 files took about 30 minutes, Wow! The average disk service time during this period was less than 10 ms. The Datanode was using about 30% CPU during this time.

      Attachments

        1. blockReportInvalidateBlock.patch
          1 kB
          Dhruba Borthakur

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            dhruba Dhruba Borthakur
            dhruba Dhruba Borthakur
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment