Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-7609

Avoid retry cache collision when Standby NameNode loading edits

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      One day my namenode crashed because of two journal node timed out at the same time under very high load, leaving behind about 100 million transactions in edits log.(I still have no idea why they were not rolled into fsimage.)
      I tryed to restart namenode, but it showed that almost 20 hours would be needed before finish, and it was loading fsedits most of the time. I also tryed to restart namenode in recover mode, the loading speed had no different.
      I looked into the stack trace, judged that it is caused by the retry cache. So I set dfs.namenode.enable.retrycache to false, the restart process finished in half an hour.

      I think the retry cached is useless during startup, at least during recover process.

      Attachments

        1. HDFS-7609.patch
          21 kB
          Ming Ma
        2. HDFS-7609-2.patch
          27 kB
          Ming Ma
        3. HDFS-7609-3.patch
          22 kB
          Ming Ma
        4. HDFS-7609-branch-2.7.2.txt
          23 kB
          Vinod Kumar Vavilapalli
        5. HDFS-7609-CreateEditsLogWithRPCIDs.patch
          2 kB
          Chris Nauroth
        6. recovery_do_not_use_retrycache.patch
          2 kB
          Carrey Zhan

        Issue Links

          Activity

            People

              mingma Ming Ma
              CarreyZhan Carrey Zhan
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: