Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-3519

Checkpoint upload may interfere with a concurrent saveNamespace

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments


    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • None
    • 2.7.0
    • namenode
    • None
    • Reviewed


      TestStandbyCheckpoints failed in precommit build 2620 due to the following issue:

      • both nodes were in Standby state, and configured to checkpoint "as fast as possible"
      • NN1 starts to save its own namespace
      • NN2 starts to upload a checkpoint for the same txid. So, both threads are writing to the same file fsimage.ckpt_12, but the actual file contents correspond to the uploading thread's data.
      • NN1 finished its saveNamespace operation while NN2 was still uploading. So, it renamed the ckpt file. However, the contents of the file are still empty since NN2 hasn't sent any bytes
      • NN2 finishes the upload, and the rename() call fails, which causes the directory to be marked failed, etc.

      The result is that there is a file fsimage_12 which appears to be a finalized image but in fact is incompletely transferred. When the transfer completes, the problem "heals itself" so there wouldn't be persistent corruption unless the machine crashes at the same time. And even then, we'd still have the earlier checkpoint to restore from.

      This same race could occur in a non-HA setup if a user puts the NN in safe mode and issues saveNamespace operations concurrent with a 2NN checkpointing, I believe.


        1. test-output.txt
          144 kB
          Todd Lipcon
        2. HDFS-3519-branch-2.patch
          7 kB
          Ming Ma
        3. HDFS-3519-3.patch
          7 kB
          Ming Ma
        4. HDFS-3519-2.patch
          6 kB
          Ming Ma
        5. HDFS-3519.patch
          6 kB
          Ming Ma

        Issue Links


          This comment will be Viewable by All Users Viewable by All Users


            mingma Ming Ma
            tlipcon Todd Lipcon
            0 Vote for this issue
            10 Start watching this issue




                Issue deployment