Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-2606

Namenode unstable when replicating 500k blocks at once

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.14.3
    • 0.17.0
    • None
    • None

    Description

      We tried to decommission about 40 nodes at once, each containing 12k blocks. (about 500k total)
      (This also happened when we first tried to decommission 2 million blocks)

      Clients started experiencing "java.lang.RuntimeException: java.net.SocketTimeoutException: timed out waiting for rpc
      response" and namenode was in 100% cpu state.

      It was spending most of its time on one thread,

      "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@7f401d28" daemon prio=10 tid=0x0000002e10702800 nid=0x6718
      runnable [0x0000000041a42000..0x0000000041a42a30]
      java.lang.Thread.State: RUNNABLE
      at org.apache.hadoop.dfs.FSNamesystem.containingNodeList(FSNamesystem.java:2766)
      at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2870)

      • locked <0x0000002aa3cef720> (a org.apache.hadoop.dfs.UnderReplicatedBlocks)
      • locked <0x0000002aa3c42e28> (a org.apache.hadoop.dfs.FSNamesystem)
        at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1928)
        at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1868)
        at java.lang.Thread.run(Thread.java:619)

      We confirmed that Namenode was not in the fullGC states when these problem happened.

      Also, dfsadmin -metasave was showing "Blocks waiting for replication" was decreasing very slowly.

      I believe this is not specific to decommission and same problem would happen if we lose one rack.

      Attachments

        1. ReplicatorTestOld.patch
          38 kB
          Konstantin Shvachko
        2. ReplicatorNew2.patch
          47 kB
          Konstantin Shvachko
        3. ReplicatorNew1.patch
          47 kB
          Konstantin Shvachko
        4. ReplicatorNew.patch
          43 kB
          Konstantin Shvachko

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            shv Konstantin Shvachko
            knoguchi Koji Noguchi
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment