Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-9143

updateCountForQuota method during EditlogTailer loadEdit can make SNN timeout very often

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Duplicate
    • Affects Version/s: 2.4.0, 2.6.0
    • Fix Version/s: None
    • Component/s: namenode
    • Labels:
      None

      Description

      I have seen many logs from datanodes in our cluster reporting socket timeout when sending heartbeat or blockReceivedAndDeleted to Standby NameNode, but it never happen to Active NameNode.
      At first, i thought it maybe caused by Editlog Tailer fetch Editlog too much making full gc, but after i watched the gc log, it is not. So i investigate the code path and log, find it only take very few seconds for the SNN to fetch the journal and merge it. But when you open the webpage of SNN during merge processing, it can not response like stop the world time of full GC, but there is no gc at that time. So i jstack SNN for some time, and finding all the time consumed by updateCountForQuota method in FSImage.
      The updateCountForQuota is called ervry time when loadEdits, it update the count of each directory with quota in the namespace from ROOT, besides it hold the write lock of FSImage, so every time when SNN merge the edit from JN, it is always making the stop world.
      I don't think it is necessary for SNN to updateCountForQuota everytime when tail the edit, when trasition to Active, it call updateCountForQuota and never missing any quota data.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                jiangyu1211 jiangyu
              • Votes:
                0 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: