Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-4596

Shutting down namenode during checkpointing can lead to md5sum error

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.0.4-alpha, 3.0.0-alpha1
    • 2.1.0-beta
    • namenode
    • None
    • Reviewed

    Description

      This is a really rare error that can hit if a NN shutdown happens during the checkpointing process.

      Checkpointing and restarting nominally looks like this:

      1. FSImage is written to a tmp file and then renamed
      2. MD5 file is written to a tmp file and then renamed
      3. NN is killed and restarted
      4. NN scans storage directories and picks up the renamed image file
      5. NN validates that the image file matches its md5 file

      If the NN is killed before step 2 completes, this is what happens:

      1. FSImage is written to a tmp file and then renamed
      2. NN is killed and restarted (no MD5 file!)
      3. NN scans storage directories and picks up the renamed image file
      4. Since there's no matching MD5 file, NN errors out with a checksum error

      I think we can fix this by inverting the order of writing the image then md5, or inverting the order of reading the image then md5.

      Attachments

        1. hdfs-4596-1.patch
          5 kB
          Andrew Wang

        Issue Links

          Activity

            People

              andrew.wang Andrew Wang
              andrew.wang Andrew Wang
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: