Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-9293

FSEditLog's 'OpInstanceCache' instance of threadLocal cache exists dirty 'rpcId',which may cause standby NN too busy to communicate

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 2.2.0, 2.7.1
    • Fix Version/s: None
    • Component/s: namenode
    • Labels:
      None

      Description

      In our cluster (hadoop 2.2.0-HA,700+ DN),we found standby NN tail editlog slowly,and hold the fsnamesystem writelock during the work and the DN's heartbeart/blockreport IPC request blocked.Lead to Active NN remove stale DN which can't send heartbeat because blocking at process Standby NN Regiest common(FIXED at 2.7.1).

      Below is the standby NN stack:

      "Edit log tailer" prio=10 tid=0x00007f28fcf35800 nid=0x1a7d runnable [0x00007f0dd1d76000]
      java.lang.Thread.State: RUNNABLE
      at java.util.PriorityQueue.remove(PriorityQueue.java:360)
      at org.apache.hadoop.util.LightWeightCache.put(LightWeightCache.java:217)
      at org.apache.hadoop.ipc.RetryCache.addCacheEntry(RetryCache.java:270)

      • locked <0x00007f12817714b8> (a org.apache.hadoop.ipc.RetryCache)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntry(FSNamesystem.java:724)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:406)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:199)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:112)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:733)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:227)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:321)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:279)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296)
        at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:456)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:292)

      When apply editLogOp,if the IPC retryCache is found,need to remove the previous from priorityQueue(O(N)), The updateblock is don't need record rpcId on editlog except 'client request updatePipeline',but we found many 'UpdateBlocksOp' has repeat ipcId.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Deng FEI DENG FEI
                Reporter:
                Deng FEI DENG FEI
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Due:
                  Created:
                  Updated:
                  Resolved: