Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-4811

race condition between 2 namenodes in standby that are trying to checkpoint with one another can delete or corrupt a good fsimage

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.1.0-beta, 3.0.0-alpha1
    • None
    • ha
    • None

    Description

      The problem occurs under concurrent execution of the namenode running its own checkpoint in StandbyCheckpointer in thread 1 while also getting a checkpoint from a different namenode in GetImageServlet in thread 2. It is possible for thread 2 to finish writing the checkpoint to the directory, but then get suspended before it has a chance to rename it to its final destination as an fsimage file. Then, thread 1 wakes up and starts writing its own data to the checkpoint file. When thread 2 resumes, it then tries to rename the file that thread 1 still holds open for writing. Depending on OS, this either moves thread 1's incomplete checkpoint to fsimage, or it just outright deletes the existing good fsimage until thread 1 finishes writing and renames.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              cnauroth Chris Nauroth
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: