Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-2606

Namenode unstable when replicating 500k blocks at once

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.14.3
    • 0.17.0
    • None
    • None

    Description

      We tried to decommission about 40 nodes at once, each containing 12k blocks. (about 500k total)
      (This also happened when we first tried to decommission 2 million blocks)

      Clients started experiencing "java.lang.RuntimeException: java.net.SocketTimeoutException: timed out waiting for rpc
      response" and namenode was in 100% cpu state.

      It was spending most of its time on one thread,

      "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@7f401d28" daemon prio=10 tid=0x0000002e10702800 nid=0x6718
      runnable [0x0000000041a42000..0x0000000041a42a30]
      java.lang.Thread.State: RUNNABLE
      at org.apache.hadoop.dfs.FSNamesystem.containingNodeList(FSNamesystem.java:2766)
      at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2870)

      • locked <0x0000002aa3cef720> (a org.apache.hadoop.dfs.UnderReplicatedBlocks)
      • locked <0x0000002aa3c42e28> (a org.apache.hadoop.dfs.FSNamesystem)
        at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1928)
        at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1868)
        at java.lang.Thread.run(Thread.java:619)

      We confirmed that Namenode was not in the fullGC states when these problem happened.

      Also, dfsadmin -metasave was showing "Blocks waiting for replication" was decreasing very slowly.

      I believe this is not specific to decommission and same problem would happen if we lose one rack.

      Attachments

        1. ReplicatorTestOld.patch
          38 kB
          Konstantin Shvachko
        2. ReplicatorNew.patch
          43 kB
          Konstantin Shvachko
        3. ReplicatorNew1.patch
          47 kB
          Konstantin Shvachko
        4. ReplicatorNew2.patch
          47 kB
          Konstantin Shvachko

        Issue Links

          Activity

            People

              shv Konstantin Shvachko
              knoguchi Koji Noguchi
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: