Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: HA branch (HDFS-1623)
    • Fix Version/s: HA branch (HDFS-1623)
    • Component/s: ha, namenode
    • Labels:
      None

      Description

      Currently, the NN logs its edits to each of its edits directories in sequence. This can produce the following bad sequence:

      • NN accumulates 100 edits (tx 1-100) in the buffer. Writes and syncs to local drive, then crashes
      • Failover occurs. SBN takes over at txid=1, since txid 1 never got writen.
      • First NN restarts. It reads up to txid 100 from its local directories. It is now "ahead" of the active NN with inconsistent state.
        The solution is to write to the shared edits dir, and sync that, before writing to any local drives.
      1. hdfs-2874.txt
        24 kB
        Todd Lipcon
      2. hdfs-2874.txt
        24 kB
        Todd Lipcon
      3. hdfs-2874.txt
        9 kB
        Todd Lipcon

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          Attached patch fixes this issue so long as the shared dirs are marked "required". The shared/non-shared distinction doesn't make it down into JournalSet, so went with the required/non-required distinction, which I think makes sense anyway. Once we implement HDFS-2769, we'll ensure that shared dirs are always required.

          Show
          Todd Lipcon added a comment - Attached patch fixes this issue so long as the shared dirs are marked "required". The shared/non-shared distinction doesn't make it down into JournalSet, so went with the required/non-required distinction, which I think makes sense anyway. Once we implement HDFS-2769 , we'll ensure that shared dirs are always required.
          Hide
          Aaron T. Myers added a comment -

          It occurs to me that using the "required" dirs as a proxy for the shared dirs isn't necessarily correct. It's true that the shared dir should always be marked required, but an admin might also configure some local or other dirs to be required. Were they to do this, the sync order would again be undefined, and the shared dir might not get logged first, and thus we might hit this case.

          So, it seems like we either need to push down the shared/non-shared distinction into the JournalSet, or we need to somehow push down a generic "sync order" concept into the journal set. Or, perhaps, guarantee that the journalset syncs to journals in the order given at initialization.

          Show
          Aaron T. Myers added a comment - It occurs to me that using the "required" dirs as a proxy for the shared dirs isn't necessarily correct. It's true that the shared dir should always be marked required, but an admin might also configure some local or other dirs to be required. Were they to do this, the sync order would again be undefined, and the shared dir might not get logged first, and thus we might hit this case. So, it seems like we either need to push down the shared/non-shared distinction into the JournalSet, or we need to somehow push down a generic "sync order" concept into the journal set. Or, perhaps, guarantee that the journalset syncs to journals in the order given at initialization.
          Hide
          Uma Maheswara Rao G added a comment -

          It occurs to me that using the "required" dirs as a proxy for the shared dirs isn't necessarily correct. It's true that the shared dir should always be marked required, but an admin might also configure some local or other dirs to be required.

          I too agree with Aaron on this point.

          I am just thinking, can't we maintain priority level based lists, while updating jouranal itself we can update in that order. Also we can provide iterator based on priority level. Always shared jouranals should have high priority. Journals which are required and non shared can be the next priority..etc?

          Show
          Uma Maheswara Rao G added a comment - It occurs to me that using the "required" dirs as a proxy for the shared dirs isn't necessarily correct. It's true that the shared dir should always be marked required, but an admin might also configure some local or other dirs to be required. I too agree with Aaron on this point. I am just thinking, can't we maintain priority level based lists, while updating jouranal itself we can update in that order. Also we can provide iterator based on priority level. Always shared jouranals should have high priority. Journals which are required and non shared can be the next priority..etc?
          Hide
          Todd Lipcon added a comment -

          Yes, we could certainly attach a priority level to each journal. But I'm trying to think of a solution that won't add complexity to configuration, which is already getting reasonably hairy here... I'll sleep on it

          Show
          Todd Lipcon added a comment - Yes, we could certainly attach a priority level to each journal. But I'm trying to think of a solution that won't add complexity to configuration, which is already getting reasonably hairy here... I'll sleep on it
          Hide
          Hari Mankude added a comment -

          Adding any type of write ordering to edit log updates will increase commit latency. For example, writing to shared edit log first, waiting for this to complete and then writing to local dirs will double the write latency.

          A better design is to write all the locations in parallel and then make a policy decision during recovery as to go with higher txid or the lower one.

          Show
          Hari Mankude added a comment - Adding any type of write ordering to edit log updates will increase commit latency. For example, writing to shared edit log first, waiting for this to complete and then writing to local dirs will double the write latency. A better design is to write all the locations in parallel and then make a policy decision during recovery as to go with higher txid or the lower one.
          Hide
          Todd Lipcon added a comment -

          Adding any type of write ordering to edit log updates will increase commit latency. For example, writing to shared edit log first, waiting for this to complete and then writing to local dirs will double the write latency.

          We already do write to each volume serially, so the latency is no different. I agree there is a potential optimization to do them in parallel, but since we've been living fine without for years, I don't think it's on the critical path for most installations.

          A better design is to write all the locations in parallel and then make a policy decision during recovery as to go with higher txid or the lower one.

          Except that the remote NN can't see the local edit dirs, so no such policy exists.

          Show
          Todd Lipcon added a comment - Adding any type of write ordering to edit log updates will increase commit latency. For example, writing to shared edit log first, waiting for this to complete and then writing to local dirs will double the write latency. We already do write to each volume serially, so the latency is no different. I agree there is a potential optimization to do them in parallel, but since we've been living fine without for years, I don't think it's on the critical path for most installations. A better design is to write all the locations in parallel and then make a policy decision during recovery as to go with higher txid or the lower one. Except that the remote NN can't see the local edit dirs, so no such policy exists.
          Hide
          Todd Lipcon added a comment -

          Attached patch fixes the above identified issue:

          • changes getNamespaceEditsDirs to return a List rather than a Collection, where the ordering is prescribed to be shared dirs followed by non-shared dirs, and maintaining order to be the same as specified in the config.
          • tracks through that type change - even though we could pass around the list as a Collection, I thought it better to be explicit that the ordering here is important, so changed to List throughout.
          • adds a unit test for this new ordering guarantee
          Show
          Todd Lipcon added a comment - Attached patch fixes the above identified issue: changes getNamespaceEditsDirs to return a List rather than a Collection, where the ordering is prescribed to be shared dirs followed by non-shared dirs, and maintaining order to be the same as specified in the config. tracks through that type change - even though we could pass around the list as a Collection, I thought it better to be explicit that the ordering here is important, so changed to List throughout. adds a unit test for this new ordering guarantee
          Hide
          Todd Lipcon added a comment -

          I ran all the tests which reference EditLog and they passed:
          TestOfflineEditsViewer,TestHDFSConcat,TestEditLogRace,TestNameEditsConfigs,TestSaveNamespace,TestEditLogFileOutputStream,TestFileJournalManager,TestCheckpoint,TestEditLog,TestFSEditLogLoader,TestFsLimits,TestSecurityTokenEditLog,TestStorageRestore,TestBackupNode,TestEditLogJournalFailures,TestEditLogTailer,TestEditLogsDuringFailover,TestFailureToReadEdits,TestHASafeMode,TestHAStateTransitions,TestFailureOfSharedDir,TestDNFencing,TestStandbyIsHot,TestGenericJournalConf,TestCheckPointForSecurityTokens,TestNNStorageRetentionManager,TestPersistBlocks,TestPBHelper,TestNNLeaseRecovery

          Show
          Todd Lipcon added a comment - I ran all the tests which reference EditLog and they passed: TestOfflineEditsViewer,TestHDFSConcat,TestEditLogRace,TestNameEditsConfigs,TestSaveNamespace,TestEditLogFileOutputStream,TestFileJournalManager,TestCheckpoint,TestEditLog,TestFSEditLogLoader,TestFsLimits,TestSecurityTokenEditLog,TestStorageRestore,TestBackupNode,TestEditLogJournalFailures,TestEditLogTailer,TestEditLogsDuringFailover,TestFailureToReadEdits,TestHASafeMode,TestHAStateTransitions,TestFailureOfSharedDir,TestDNFencing,TestStandbyIsHot,TestGenericJournalConf,TestCheckPointForSecurityTokens,TestNNStorageRetentionManager,TestPersistBlocks,TestPBHelper,TestNNLeaseRecovery
          Hide
          Hari Mankude added a comment -

          Can you run the test where this happens?

          1. NN is writing to shared edits and local edits. (With this fix, shared edits is written first)
          2. Lets say, transaction with txid 100 was written to shared edits and then NN died. txid 100 was not written to local edit dir. So 99 is the last txid recorded in the local edit dir.
          3. Same NN is restarted.
          4. Does it restart the next txid from 101 or 100? If it starts from 101, local edits dir will have a gap. If it starts from 100, then last transaction in shared edits dir will be invalid.

          Show
          Hari Mankude added a comment - Can you run the test where this happens? 1. NN is writing to shared edits and local edits. (With this fix, shared edits is written first) 2. Lets say, transaction with txid 100 was written to shared edits and then NN died. txid 100 was not written to local edit dir. So 99 is the last txid recorded in the local edit dir. 3. Same NN is restarted. 4. Does it restart the next txid from 101 or 100? If it starts from 101, local edits dir will have a gap. If it starts from 100, then last transaction in shared edits dir will be invalid.
          Hide
          Todd Lipcon added a comment -

          It should restart from 101 since, when it starts up, it reads the highest txids available. Having a gap in the local edits is fine/expected - you can always have gaps in individual edits dirs so long as the union of edits dirs makes a complete sequence.

          I will try to run such a manual test using a mdadm faulty mount.

          Show
          Todd Lipcon added a comment - It should restart from 101 since, when it starts up, it reads the highest txids available. Having a gap in the local edits is fine/expected - you can always have gaps in individual edits dirs so long as the union of edits dirs makes a complete sequence. I will try to run such a manual test using a mdadm faulty mount.
          Hide
          Aaron T. Myers added a comment -

          Patch largely looks good, Todd. A few comments/questions:

          1. Maybe add a comment mentioning that we use LinkedHashSet since it provides a predictable-order implementation of the Set interface?
          2. "we need to make sure all edits are on place" - s/on/in/g
          3. Why the call to Lists.newArrayList in getNamespaceEditsDirs?
          4. Looks like you retained the functionality wherein required journals are operated on first, which should no longer be necessary, right? It should be OK as you have it, though, since the shared edits dir is automatically marked required, and therefore will necessarily be operated on before all others (required or non-required.)
          5. I don't follow why the changes in GenericTestUtils were necessary.
          Show
          Aaron T. Myers added a comment - Patch largely looks good, Todd. A few comments/questions: Maybe add a comment mentioning that we use LinkedHashSet since it provides a predictable-order implementation of the Set interface? "we need to make sure all edits are on place" - s/on/in/g Why the call to Lists.newArrayList in getNamespaceEditsDirs? Looks like you retained the functionality wherein required journals are operated on first, which should no longer be necessary, right? It should be OK as you have it, though, since the shared edits dir is automatically marked required, and therefore will necessarily be operated on before all others (required or non-required.) I don't follow why the changes in GenericTestUtils were necessary.
          Hide
          Todd Lipcon added a comment -

          Maybe add a comment mentioning that we use LinkedHashSet since it provides a predictable-order implementation of the Set interface?

          "we need to make sure all edits are on place" - s/on/in/g

          Fixed

          Why the call to Lists.newArrayList in getNamespaceEditsDirs?

          The function returns a List, so we have to copy out of the LinkedHashSet to a new List object there

          Looks like you retained the functionality wherein required journals are operated on first, which should no longer be necessary, right? It should be OK as you have it, though, since the shared edits dir is automatically marked required, and therefore will necessarily be operated on before all others (required or non-required.)

          Fixed in the new revision.

          I don't follow why the changes in GenericTestUtils were necessary.

          A bug fix that I noticed while working on the unit tests - but you're right, they're unrelated to this JIRA. I removed from this patch.

          Show
          Todd Lipcon added a comment - Maybe add a comment mentioning that we use LinkedHashSet since it provides a predictable-order implementation of the Set interface? "we need to make sure all edits are on place" - s/on/in/g Fixed Why the call to Lists.newArrayList in getNamespaceEditsDirs? The function returns a List, so we have to copy out of the LinkedHashSet to a new List object there Looks like you retained the functionality wherein required journals are operated on first, which should no longer be necessary, right? It should be OK as you have it, though, since the shared edits dir is automatically marked required, and therefore will necessarily be operated on before all others (required or non-required.) Fixed in the new revision. I don't follow why the changes in GenericTestUtils were necessary. A bug fix that I noticed while working on the unit tests - but you're right, they're unrelated to this JIRA. I removed from this patch.
          Hide
          Aaron T. Myers added a comment -

          Thanks for the explanations, Todd.

          +1, the latest patch looks good to me.

          Show
          Aaron T. Myers added a comment - Thanks for the explanations, Todd. +1, the latest patch looks good to me.
          Hide
          Todd Lipcon added a comment -

          I ran two manual tests using mdadm fault injection:

          Test 1:

          • Set up the shared edits dir in an HA setup on the faulty mount
          • started both NNs, put NN1 in active
          • turned the mount to "write-all" fault mode (block further writes)
          • issued a "touchz" command to the FS. The NN crashed with the following:
            12/02/03 18:12:51 FATAL namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=FileJournalManager(root=/mnt/todd/name-shared), stream=org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream@2ad1918a))
            java.io.IOException: Input/output error
            ...
            12/02/03 18:12:51 FATAL namenode.FSEditLog: Could not sync enough journals to persistent storage. Unsynced transactions: 2
            
          • Verified using xxd that the transaction was not written to the local storage directories either.

          Test 2:

          • Set up a non-HA NN with multiple directories, one of which was on the faulty storage
          • Set the mount to block all writes
          • Issued a touchz command
          • NN correctly disabled the bad mount:
            12/02/03 18:20:11 ERROR namenode.FSEditLog: Error: flush failed for (journal JournalAndStream(mgr=FileJournalManager(root=/mnt/todd/name), stream=org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream@7b0b23cf))
            

            ... and kept running as expected

          • Shut down the NN
          • Cleared the fault and remounted
          • Verified with xxd that the edit was persisted to the non-faulty directories
          • Restarted NN. Verified existence of the file that I had touched. Startup messages included:
            12/02/03 18:27:18 INFO namenode.FileJournalManager: Recovering unfinalized segments in /tmp/name1-name/current
            12/02/03 18:27:18 INFO namenode.FileJournalManager: Finalizing edits file /tmp/name1-name/current/edits_inprogress_0000000000000000001 -> /tmp/name1-name/current/edits_0000000000000000001-0000000000000000006
            12/02/03 18:27:18 INFO namenode.FileJournalManager: Recovering unfinalized segments in /tmp/name1-name2/current
            12/02/03 18:27:18 INFO namenode.FileJournalManager: Finalizing edits file /tmp/name1-name2/current/edits_inprogress_0000000000000000001 -> /tmp/name1-name2/current/edits_0000000000000000001-0000000000000000006
            12/02/03 18:27:18 INFO namenode.FileJournalManager: Recovering unfinalized segments in /mnt/todd/name/current
            12/02/03 18:27:18 INFO namenode.FileJournalManager: Finalizing edits file /mnt/todd/name/current/edits_inprogress_0000000000000000001 -> /mnt/todd/name/current/edits_0000000000000000001-0000000000000000003
            

            (/mnt/todd was the mount which had died - hence it had fewer edits)

          Show
          Todd Lipcon added a comment - I ran two manual tests using mdadm fault injection: Test 1: Set up the shared edits dir in an HA setup on the faulty mount started both NNs, put NN1 in active turned the mount to "write-all" fault mode (block further writes) issued a "touchz" command to the FS. The NN crashed with the following: 12/02/03 18:12:51 FATAL namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=FileJournalManager(root=/mnt/todd/name-shared), stream=org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream@2ad1918a)) java.io.IOException: Input/output error ... 12/02/03 18:12:51 FATAL namenode.FSEditLog: Could not sync enough journals to persistent storage. Unsynced transactions: 2 Verified using xxd that the transaction was not written to the local storage directories either. Test 2: Set up a non-HA NN with multiple directories, one of which was on the faulty storage Set the mount to block all writes Issued a touchz command NN correctly disabled the bad mount: 12/02/03 18:20:11 ERROR namenode.FSEditLog: Error: flush failed for (journal JournalAndStream(mgr=FileJournalManager(root=/mnt/todd/name), stream=org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream@7b0b23cf)) ... and kept running as expected Shut down the NN Cleared the fault and remounted Verified with xxd that the edit was persisted to the non-faulty directories Restarted NN. Verified existence of the file that I had touched. Startup messages included: 12/02/03 18:27:18 INFO namenode.FileJournalManager: Recovering unfinalized segments in /tmp/name1-name/current 12/02/03 18:27:18 INFO namenode.FileJournalManager: Finalizing edits file /tmp/name1-name/current/edits_inprogress_0000000000000000001 -> /tmp/name1-name/current/edits_0000000000000000001-0000000000000000006 12/02/03 18:27:18 INFO namenode.FileJournalManager: Recovering unfinalized segments in /tmp/name1-name2/current 12/02/03 18:27:18 INFO namenode.FileJournalManager: Finalizing edits file /tmp/name1-name2/current/edits_inprogress_0000000000000000001 -> /tmp/name1-name2/current/edits_0000000000000000001-0000000000000000006 12/02/03 18:27:18 INFO namenode.FileJournalManager: Recovering unfinalized segments in /mnt/todd/name/current 12/02/03 18:27:18 INFO namenode.FileJournalManager: Finalizing edits file /mnt/todd/name/current/edits_inprogress_0000000000000000001 -> /mnt/todd/name/current/edits_0000000000000000001-0000000000000000003 (/mnt/todd was the mount which had died - hence it had fewer edits)
          Hide
          Todd Lipcon added a comment -

          Committed to HA branch, thanks for the reviews, all.

          Show
          Todd Lipcon added a comment - Committed to HA branch, thanks for the reviews, all.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-HAbranch-build #68 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/68/)
          HDFS-2874. Edit log should log to shared dirs before local dirs. Contributed by Todd Lipcon.

          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1240445
          Files :

          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-1623.txt
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/Util.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SecondaryNameNode.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestClusterId.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLogJournalFailures.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestFailureOfSharedDir.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-HAbranch-build #68 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/68/ ) HDFS-2874 . Edit log should log to shared dirs before local dirs. Contributed by Todd Lipcon. todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1240445 Files : /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/CHANGES. HDFS-1623 .txt /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/Util.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SecondaryNameNode.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestClusterId.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLogJournalFailures.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestFailureOfSharedDir.java

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development