Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3652

1.x: FSEditLog failure removes the wrong edit stream when storage dirs have same name

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.0.3, 1.1.0
    • Fix Version/s: 1.0.4
    • Component/s: namenode
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      In FSEditLog.removeEditsForStorageDir, we iterate over the edits streams trying to find the stream corresponding to a given dir. To check equality, we currently use the following condition:

            File parentDir = getStorageDirForStream(idx);
            if (parentDir.getName().equals(sd.getRoot().getName())) {
      

      ... which is horribly incorrect. If two or more storage dirs happen to have the same terminal path component (eg /data/1/nn and /data/2/nn) then it will pick the wrong stream(s) to remove.

      1. hdfs-3652.txt
        2 kB
        Todd Lipcon

        Activity

        Matt Foley made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Matt Foley added a comment -

        Closed upon release of Hadoop-1.0.4.

        Show
        Matt Foley added a comment - Closed upon release of Hadoop-1.0.4.
        Matt Foley made changes -
        Target Version/s 1.0.4, 1.1.0 [ 12322463, 12317959 ] 1.0.4 [ 12322463 ]
        Matt Foley made changes -
        Fix Version/s 1.1.0 [ 12317959 ]
        Matt Foley made changes -
        Target Version/s 1.0.4, 1.1.0, 1.2.0 [ 12322463, 12317959, 12321657 ] 1.0.4, 1.1.0 [ 12322463, 12317959 ]
        Fix Version/s 1.2.0 [ 12321657 ]
        Affects Version/s 1.2.0 [ 12321657 ]
        Hide
        Matt Foley added a comment -

        since 1.2.0 is unreleased, it is sufficient to state it is fixed in 1.1.0.

        Show
        Matt Foley added a comment - since 1.2.0 is unreleased, it is sufficient to state it is fixed in 1.1.0.
        Todd Lipcon made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Hadoop Flags Reviewed [ 10343 ]
        Fix Version/s 1.0.4 [ 12322463 ]
        Fix Version/s 1.1.0 [ 12317959 ]
        Fix Version/s 1.2.0 [ 12321657 ]
        Resolution Fixed [ 1 ]
        Hide
        Todd Lipcon added a comment -

        Committed to branch-1, branch-1.0, and branch-1.1.

        Show
        Todd Lipcon added a comment - Committed to branch-1, branch-1.0, and branch-1.1.
        Todd Lipcon made changes -
        Target Version/s 1.0.3, 1.1.0 [ 12320249, 12317959 ] 1.0.4, 1.1.0, 1.2.0 [ 12322463, 12317959, 12321657 ]
        Hide
        Eli Collins added a comment -

        +1 lgtm as well. Nice find.

        Show
        Eli Collins added a comment - +1 lgtm as well. Nice find.
        Hide
        Aaron T. Myers added a comment -

        +1, the patch looks good to me. Great find/fix, Todd.

        Show
        Aaron T. Myers added a comment - +1, the patch looks good to me. Great find/fix, Todd.
        Todd Lipcon made changes -
        Field Original Value New Value
        Attachment hdfs-3652.txt [ 12536309 ]
        Hide
        Todd Lipcon added a comment -

        Attached patch is for branch-1.

        I modified the existing storage dir failure test so that all of the name dirs have the same name, and it started to fail. After fixing the bug, it passes.

        Show
        Todd Lipcon added a comment - Attached patch is for branch-1. I modified the existing storage dir failure test so that all of the name dirs have the same name, and it started to fail. After fixing the bug, it passes.
        Hide
        Todd Lipcon added a comment -

        This has data-loss implications as well. I am able to reproduce the following:

        • NN is writing to three dirs: /data/1/nn, /data/2/nn, and /data/3/nn
        • I modified the NN to inject an IOException when creating "edits.new" in /data/3/nn, which causes "removeEditsForStorageDir" to get called inside rollEditLog
        • Upon triggering a checkpoint:
          • all three logs are closed successfully
          • /data/1/nn and /data/2/nn are successfully opened for "edits.new"
          • /data/3/nn throws an IOE which gets caught. This calls removeEditsForStorageDir, which removes the wrong stream (augmented logging):
            12/07/12 16:23:54 INFO namenode.FSNamesystem: Roll Edit Log from 127.0.0.1
            12/07/12 16:23:54 INFO namenode.FSNamesystem: Number of transactions: 0 Total time for transactions(ms): 0Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0 0 0 
            12/07/12 16:23:54 WARN namenode.FSNamesystem: Removing edits stream /tmp/name1/nn/current/edits.new
            12/07/12 16:23:54 WARN common.Storage: Removing storage dir /tmp/name3/nn
            java.io.IOException: Injected fault for /tmp/name3/nn/current/edits.new
                    at org.apache.hadoop.hdfs.server.namenode.FSEditLog$EditLogFileOutputStream.<init>(FSEditLog.java:146)
            
        • The NN is now only writing to /tmp/name2/nn/current/edits.new, but considers both name1 and name2 to be good from a storage-directory standpoint. However, /tmp/name1/nn/current/edits.new exists as an empty edit log file (just the header and preallocated region of 0xffs)
        • When rollFSImage is called, it successfully calls close only on the name2 log - which truncates it to the correct transaction boundary. Then it renames both name2/.../edits.new and name1/.../edits.new to edits, and opens them both for append (assuming they've been truncated to a transaction boundary).
        • The NN is now writing to name1 and name2, but name1's log looks like this:
        <valid header> <preallocated bytes of 0xffffffffffff.....> <transactions>
        
        • Upon the next checkpoint, the 2NN will likely download this log, since it's listed first in the name directory list. Upon doing so, it will see the 0xff at the head of the log and not read any of the edits (which come after all of the 0xffs)
        • The 2NN then uploads the "merged" image back to the NN, which blows away the "edits" file. Thus, its in-memory data has gotten out of sync with the disk data, and the next time a checkpoint occurs or the NN restarts, it will fail.

        This is not an issue in trunk since the code was largely rewritten by HDFS-1073.

        The workaround for existing users is simple: rename the directories to eg /data/1/nn1 and /data/2/nn2. The fix is also simple. I will upload the fix this afternoon.

        Show
        Todd Lipcon added a comment - This has data-loss implications as well. I am able to reproduce the following: NN is writing to three dirs: /data/1/nn, /data/2/nn, and /data/3/nn I modified the NN to inject an IOException when creating "edits.new" in /data/3/nn, which causes "removeEditsForStorageDir" to get called inside rollEditLog Upon triggering a checkpoint: all three logs are closed successfully /data/1/nn and /data/2/nn are successfully opened for "edits.new" /data/3/nn throws an IOE which gets caught. This calls removeEditsForStorageDir , which removes the wrong stream (augmented logging): 12/07/12 16:23:54 INFO namenode.FSNamesystem: Roll Edit Log from 127.0.0.1 12/07/12 16:23:54 INFO namenode.FSNamesystem: Number of transactions: 0 Total time for transactions(ms): 0Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0 0 0 12/07/12 16:23:54 WARN namenode.FSNamesystem: Removing edits stream /tmp/name1/nn/current/edits. new 12/07/12 16:23:54 WARN common.Storage: Removing storage dir /tmp/name3/nn java.io.IOException: Injected fault for /tmp/name3/nn/current/edits. new at org.apache.hadoop.hdfs.server.namenode.FSEditLog$EditLogFileOutputStream.<init>(FSEditLog.java:146) The NN is now only writing to /tmp/name2/nn/current/edits.new, but considers both name1 and name2 to be good from a storage-directory standpoint. However, /tmp/name1/nn/current/edits.new exists as an empty edit log file (just the header and preallocated region of 0xffs) When rollFSImage is called, it successfully calls close only on the name2 log - which truncates it to the correct transaction boundary. Then it renames both name2/.../edits.new and name1/.../edits.new to edits , and opens them both for append (assuming they've been truncated to a transaction boundary). The NN is now writing to name1 and name2, but name1's log looks like this: <valid header> <preallocated bytes of 0xffffffffffff.....> <transactions> Upon the next checkpoint, the 2NN will likely download this log, since it's listed first in the name directory list. Upon doing so, it will see the 0xff at the head of the log and not read any of the edits (which come after all of the 0xffs) The 2NN then uploads the "merged" image back to the NN, which blows away the "edits" file. Thus, its in-memory data has gotten out of sync with the disk data, and the next time a checkpoint occurs or the NN restarts, it will fail. This is not an issue in trunk since the code was largely rewritten by HDFS-1073 . The workaround for existing users is simple: rename the directories to eg /data/1/nn1 and /data/2/nn2. The fix is also simple. I will upload the fix this afternoon.
        Hide
        Matt Foley added a comment -

        Urk! Quite a catch. When patch available, please commit to branch-1.0 as well as branch-1.1 and branch-1.

        Show
        Matt Foley added a comment - Urk! Quite a catch. When patch available, please commit to branch-1.0 as well as branch-1.1 and branch-1.
        Todd Lipcon created issue -

          People

          • Assignee:
            Todd Lipcon
            Reporter:
            Todd Lipcon
          • Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development