Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-14186

blockreport storm slow down namenode restart seriously in large cluster

Add voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Reopened
    • Major
    • Resolution: Unresolved
    • 2.7.1
    • None
    • namenode
    • None

    Description

      In the current implementation, the datanode sends blockreport immediately after register to namenode successfully when restart, and the blockreport storm will make namenode high load to process them. One result is some received RPC have to skip because queue time is timeout. If some datanodes' heartbeat RPC are continually skipped for long times (default is heartbeatExpireInterval=630s) it will be set DEAD, then datanode has to re-register and send blockreport again, aggravate blockreport storm and trap in a vicious circle, and slow down (more than one hour and even more) namenode startup seriously in a large (several thousands of datanodes) and busy cluster especially. Although there are many work to optimize namenode startup, the issue still exists.
      I propose to postpone dead datanode check when namenode have finished startup.
      Any comments and suggestions are welcome.

      Attachments

        1. HDFS-14186.001.patch
          11 kB
          Xiaoqiao He

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            hexiaoqiao Xiaoqiao He
            hexiaoqiao Xiaoqiao He

            Dates

              Created:
              Updated:

              Slack

                Issue deployment