Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-4596

Shutting down namenode during checkpointing can lead to md5sum error

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.4-alpha, 3.0.0-alpha1
    • Fix Version/s: 2.1.0-beta
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      This is a really rare error that can hit if a NN shutdown happens during the checkpointing process.

      Checkpointing and restarting nominally looks like this:

      1. FSImage is written to a tmp file and then renamed
      2. MD5 file is written to a tmp file and then renamed
      3. NN is killed and restarted
      4. NN scans storage directories and picks up the renamed image file
      5. NN validates that the image file matches its md5 file

      If the NN is killed before step 2 completes, this is what happens:

      1. FSImage is written to a tmp file and then renamed
      2. NN is killed and restarted (no MD5 file!)
      3. NN scans storage directories and picks up the renamed image file
      4. Since there's no matching MD5 file, NN errors out with a checksum error

      I think we can fix this by inverting the order of writing the image then md5, or inverting the order of reading the image then md5.

        Attachments

        1. hdfs-4596-1.patch
          5 kB
          Andrew Wang

          Issue Links

            Activity

              People

              • Assignee:
                andrew.wang Andrew Wang
                Reporter:
                andrew.wang Andrew Wang
              • Votes:
                0 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: