Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3077

Quorum-based protocol for reading and writing edit logs

    Details

      Description

      Currently, one of the weak points of the HA design is that it relies on shared storage such as an NFS filer for the shared edit log. One alternative that has been proposed is to depend on BookKeeper, a ZooKeeper subproject which provides a highly available replicated edit log on commodity hardware. This JIRA is to implement another alternative, based on a quorum commit protocol, integrated more tightly in HDFS and with the requirements driven only by HDFS's needs rather than more generic use cases. More details to follow.

      1. qjournal-design.tex
        43 kB
        Todd Lipcon
      2. qjournal-design.tex
        48 kB
        Todd Lipcon
      3. qjournal-design.pdf
        229 kB
        Todd Lipcon
      4. qjournal-design.pdf
        251 kB
        Todd Lipcon
      5. qjournal-design.pdf
        275 kB
        Todd Lipcon
      6. qjournal-design.pdf
        285 kB
        Todd Lipcon
      7. qjournal-design.pdf
        287 kB
        Todd Lipcon
      8. qjournal-design.pdf
        293 kB
        Todd Lipcon
      9. hdfs-3077-test-merge.txt
        525 kB
        Todd Lipcon
      10. hdfs-3077-partial.txt
        110 kB
        Todd Lipcon
      11. hdfs-3077-branch-2.txt
        527 kB
        Todd Lipcon
      12. hdfs-3077.txt
        209 kB
        Todd Lipcon
      13. hdfs-3077.txt
        214 kB
        Todd Lipcon
      14. hdfs-3077.txt
        231 kB
        Todd Lipcon
      15. hdfs-3077.txt
        239 kB
        Todd Lipcon
      16. hdfs-3077.txt
        239 kB
        Todd Lipcon
      17. hdfs-3077.txt
        239 kB
        Todd Lipcon
      18. hdfs-3077.txt
        239 kB
        Todd Lipcon

        Issue Links

        1.
        Upgrade guava to 11.0.2 Sub-task Resolved Todd Lipcon
         
        2.
        Add infrastructure for waiting for a quorum of ListenableFutures to respond Sub-task Resolved Todd Lipcon
         
        3.
        Add preliminary QJournalProtocol interface, translators Sub-task Resolved Todd Lipcon
         
        4.
        Simple refactors in existing NN code to assist QuorumJournalManager extension Sub-task Closed Todd Lipcon
         
        5.
        Allow EditLogFileInputStream to read from a remote URL Sub-task Closed Todd Lipcon
         
        6.
        Supply NamespaceInfo when instantiating JournalManagers Sub-task Closed Todd Lipcon
         
        7.
        Active NN should exit when it cannot write to quorum number of Journal Daemons Sub-task Resolved Unassigned
         
        8.
        Add class to manage JournalList Sub-task Resolved Unassigned
         
        9.
        QJM: support purgeEditLogs() call to remotely purge logs Sub-task Resolved Todd Lipcon
         
        10.
        QJM: JNStorage should read its storage info even before a writer becomes active Sub-task Resolved Todd Lipcon
         
        11.
        QJM: Fix getEditLogManifest to fetch httpPort if necessary Sub-task Resolved Todd Lipcon
         
        12.
        Genericize format() to non-file JournalManagers Sub-task Closed Todd Lipcon
         
        13.
        Fix QJM startup when individual JNs have gaps Sub-task Resolved Todd Lipcon
         
        14.
        QJM: if a logger misses an RPC, don't retry that logger until next segment Sub-task Resolved Todd Lipcon
         
        15.
        QJM: exhaustive failure injection test for skipped RPCs Sub-task Resolved Todd Lipcon
         
        16. QJM: improve formatting behavior for JNs Sub-task Open Todd Lipcon
         
        17.
        JournalManager#format() should be able to throw IOException Sub-task Closed Ivan Kelly
         
        18.
        Implement genericized format() in QJM Sub-task Resolved Todd Lipcon
         
        19.
        QJM: validate journal dir at startup Sub-task Resolved Todd Lipcon
         
        20.
        QJM: add segment txid as a parameter to journal() RPC Sub-task Resolved Todd Lipcon
         
        21.
        Avoid throwing NPE when finalizeSegment() is called on invalid segment Sub-task Resolved Todd Lipcon
         
        22.
        QJM: handle empty log segments during recovery Sub-task Resolved Todd Lipcon
         
        23.
        QJM: improvements to QJM fault testing Sub-task Resolved Todd Lipcon
         
        24.
        QJM: hadoop-daemon.sh should be updated to accept "journalnode" Sub-task Resolved Eli Collins
         
        25.
        Fixes for edge cases in QJM recovery protocol Sub-task Resolved Todd Lipcon
         
        26. QJM: implement md5sum verification Sub-task Open Todd Lipcon
         
        27. QJM: don't require a fencer to be configured if shared storage has built-in single-writer semantics Sub-task Open Unassigned
         
        28.
        QJM: track last "committed" txid Sub-task Resolved Todd Lipcon
         
        29. QJM: Support rolling restart of JNs Sub-task Open Todd Lipcon
         
        30.
        QJM: expose non-file journal manager details in web UI Sub-task Resolved Todd Lipcon
         
        31.
        QJM: add metrics to JournalNode Sub-task Resolved Todd Lipcon
         
        32.
        QJM: Provide defaults for dfs.journalnode.*address Sub-task Resolved Eli Collins
         
        33.
        QJM: Journal format() should reset cached values Sub-task Resolved Todd Lipcon
         
        34.
        QJM: optimize log sync when JN is lagging behind Sub-task Resolved Todd Lipcon
         
        35.
        QJM: SBN fails if selectInputStreams throws RTE Sub-task Resolved Todd Lipcon
         
        36.
        QJM: Make QJM work with security enabled Sub-task Resolved Aaron T. Myers
         
        37.
        QJM: testRecoverAfterDoubleFailures can be flaky due to IPC client caching Sub-task Resolved Todd Lipcon
         
        38.
        QJM: enable TCP_NODELAY for IPC Sub-task Resolved Todd Lipcon
         
        39.
        QJM: Writer-side metrics Sub-task Resolved Todd Lipcon
         
        40.
        QJM: avoid validating log segments on log rolls Sub-task Resolved Todd Lipcon
         
        41.
        QJM: send 'heartbeat' messages to JNs even when they are out-of-sync Sub-task Resolved Todd Lipcon
         
        42.
        QJM: journalnode does not die/log ERROR when keytab is not found in secure mode Sub-task Resolved Unassigned
         
        43.
        QJM: quorum timeout on failover with large log segment Sub-task Resolved Todd Lipcon
         
        44.
        QJM: acceptRecovery should abort current segment Sub-task Resolved Todd Lipcon
         
        45.
        QJM: Failover fails with auth error in secure cluster Sub-task Resolved Todd Lipcon
         
        46.
        JournalNodes log JournalNotFormattedException backtrace error before being formatted Sub-task Resolved Todd Lipcon
         
        47.
        QJM: Add user documentation for QJM Sub-task Resolved Aaron T. Myers
         
        48.
        QJM: Add JournalNode to the start / stop scripts Sub-task Closed Andy Isaacson
         
        49.
        QJM: remove currently unused "md5sum" field. Sub-task Resolved Todd Lipcon
         
        50.
        QJM: misc TODO cleanup, improved log messages, etc Sub-task Resolved Todd Lipcon
         
        51.
        QJM: Make acceptRecovery() atomic Sub-task Resolved Todd Lipcon
         
        52.
        QJM: purge temporary files when no longer within retention period Sub-task Resolved Todd Lipcon
         
        53.
        TestJournalNode#testJournal fails because of test case execution order Sub-task Resolved Chao Shi
         
        54.
        Unclosed FileInputStream in GetJournalEditServlet Sub-task Resolved Chao Shi
         
        55. QJM: Sychronize past log segments to JNs that missed them Sub-task Open Todd Lipcon
         
        56. QJM: Merge newEpoch and prepareRecovery Sub-task Open Suresh Srinivas
         

          Activity

          Hide
          Harsh J added a comment -

          Hey Fengdong - The Fix Version indicates this has also made it to the 2.0.3-alpha release (our next release, upcoming soon). A date for a stable 2.x release has not been set yet. I'd also like to encourage you to use the mailing lists for such type of questions instead of the JIRA in future

          Show
          Harsh J added a comment - Hey Fengdong - The Fix Version indicates this has also made it to the 2.0.3-alpha release (our next release, upcoming soon). A date for a stable 2.x release has not been set yet. I'd also like to encourage you to use the mailing lists for such type of questions instead of the JIRA in future
          Hide
          Fengdong Yu added a comment -

          this is a great feature for HDFS HA, but it was Fixed in version 3.0.0 ? I hope it can be released during the first yarn stable release.

          Show
          Fengdong Yu added a comment - this is a great feature for HDFS HA, but it was Fixed in version 3.0.0 ? I hope it can be released during the first yarn stable release.
          Hide
          Todd Lipcon added a comment -

          Committed backport to branch-2. Thanks for looking at the backport patch, Andrew and Aaron.

          Show
          Todd Lipcon added a comment - Committed backport to branch-2. Thanks for looking at the backport patch, Andrew and Aaron.
          Hide
          Andrew Purtell added a comment -

          +1

          I've been maintaining a backport of this to branch-2 and the attached patch is for all intents and purposes identical. The necessary proto and pom changes for successful compilation are included. Tried applying this patch on current barnch-2 and all QJM tests pass.

          Show
          Andrew Purtell added a comment - +1 I've been maintaining a backport of this to branch-2 and the attached patch is for all intents and purposes identical. The necessary proto and pom changes for successful compilation are included. Tried applying this patch on current barnch-2 and all QJM tests pass.
          Hide
          Aaron T. Myers added a comment -

          +1, the branch-2 patch looks good to me.

          I applied the branch-2 patches posted on HDFS-3049 and HDFS-3571, applied the branch-2 patch posted here, and then ran all of the tests that have were changed or added in those patches. Everything passed as expected.

          Show
          Aaron T. Myers added a comment - +1, the branch-2 patch looks good to me. I applied the branch-2 patches posted on HDFS-3049 and HDFS-3571 , applied the branch-2 patch posted here, and then ran all of the tests that have were changed or added in those patches. Everything passed as expected.
          Hide
          Todd Lipcon added a comment -

          This patch consists of the full merge for branch-2. There were a few conflicts here and there but nothing too major. A few notes I made while doing the backport:

          • Incorporates change from HDFS-4121 in trunk to add proto namespace declaration to QJournalProtocol.proto
          • change the pom.xml to include QJournalProtocol (part of HDFS-4041 from trunk)

          This backport depends on HDFS-3049 and HDFS-3571 backports - I posted patches for those on the respective JIRAs. I ran the tests in the qjournal package and they passed.

          Show
          Todd Lipcon added a comment - This patch consists of the full merge for branch-2. There were a few conflicts here and there but nothing too major. A few notes I made while doing the backport: Incorporates change from HDFS-4121 in trunk to add proto namespace declaration to QJournalProtocol.proto change the pom.xml to include QJournalProtocol (part of HDFS-4041 from trunk) This backport depends on HDFS-3049 and HDFS-3571 backports - I posted patches for those on the respective JIRAs. I ran the tests in the qjournal package and they passed.
          Hide
          Todd Lipcon added a comment -

          Reopening for merge to branch-2

          Show
          Todd Lipcon added a comment - Reopening for merge to branch-2
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #1224 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1224/)
          Merge CHANGES for HDFS-3077 into the main CHANGES.txt file (Revision 1397352)

          Result = SUCCESS
          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1397352
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-3077.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1224 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1224/ ) Merge CHANGES for HDFS-3077 into the main CHANGES.txt file (Revision 1397352) Result = SUCCESS todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1397352 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES. HDFS-3077 .txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #1193 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1193/)
          Merge CHANGES for HDFS-3077 into the main CHANGES.txt file (Revision 1397352)

          Result = SUCCESS
          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1397352
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-3077.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1193 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1193/ ) Merge CHANGES for HDFS-3077 into the main CHANGES.txt file (Revision 1397352) Result = SUCCESS todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1397352 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES. HDFS-3077 .txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #2874 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2874/)
          Merge CHANGES for HDFS-3077 into the main CHANGES.txt file (Revision 1397352)

          Result = FAILURE
          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1397352
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-3077.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #2874 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2874/ ) Merge CHANGES for HDFS-3077 into the main CHANGES.txt file (Revision 1397352) Result = FAILURE todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1397352 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES. HDFS-3077 .txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #2849 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2849/)
          Merge CHANGES for HDFS-3077 into the main CHANGES.txt file (Revision 1397352)

          Result = SUCCESS
          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1397352
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-3077.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #2849 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2849/ ) Merge CHANGES for HDFS-3077 into the main CHANGES.txt file (Revision 1397352) Result = SUCCESS todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1397352 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES. HDFS-3077 .txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #2911 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2911/)
          Merge CHANGES for HDFS-3077 into the main CHANGES.txt file (Revision 1397352)

          Result = SUCCESS
          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1397352
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-3077.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #2911 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2911/ ) Merge CHANGES for HDFS-3077 into the main CHANGES.txt file (Revision 1397352) Result = SUCCESS todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1397352 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES. HDFS-3077 .txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #1223 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1223/)
          Merge HDFS-3077 into trunk (Revision 1396943)

          Result = FAILURE
          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1396943
          Files :

          • /hadoop/common/trunk
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/bin/hadoop-daemon.sh
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/docs
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/packages/templates/conf/hadoop-policy.xml
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/core
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/test/GenericTestUtils.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-3077.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/dev-support/findbugsExcludeFile.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/pom.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperEditLogOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/HDFSPolicyProvider.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/PBHelper.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLogger.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLoggerSet.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/IPCLoggerChannel.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/IPCLoggerChannelMetrics.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/LoggerTooFarBehindException.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumCall.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumException.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumJournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/SegmentRecoveryComparator.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/JournalNotFormattedException.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/JournalOutOfSyncException.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/QJournalProtocol.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/RequestInfo.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolPB.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolServerSideTranslatorPB.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolTranslatorPB.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/GetJournalEditServlet.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JNStorage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalFaultInjector.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalMetrics.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNodeHttpServer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNodeRpcServer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/HdfsServerConstants.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/Storage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogBackupOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditsDoubleBuffer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FileJournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeResourcePolicy.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeJspHelper.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/EditLogTailer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/RemoteEditLog.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/RemoteEditLogManifest.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/BinaryEditsVisitor.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/BestEffortLongFile.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/PersistentLongFile.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/native
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/QJournalProtocol.proto
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/hdfs.proto
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/datanode
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/hdfs/dfshealth.jsp
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal/index.html
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal/journalstatus.jsp
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/proto-journal-web.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/secondary
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/DFSTestUtil.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/MiniJournalCluster.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/QJMTestUtil.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/TestMiniJournalCluster.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/TestNNWithQJM.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestEpochsAreUnique.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestIPCLoggerChannel.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQJMWithFaults.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumCall.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumJournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumJournalManagerUnit.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestSegmentRecoveryComparator.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournal.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournalNode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/security/token/block/TestBlockToken.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLogFileOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeRecovery.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestInitializeSharedEdits.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestBestEffortLongFile.java
          • /hadoop/common/trunk/hadoop-mapreduce-project
          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/conf
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/c++
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/block_forensics
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/build-contrib.xml
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/build.xml
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/data_join
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/eclipse-plugin
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/index
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/vaidya
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/examples
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/java
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/fs
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/hdfs
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/ipc
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/webapps/job
          • /hadoop/common/trunk/hadoop-project/src/site/site.xml
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailability.apt.vm
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailabilityWithNFS.apt.vm
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailabilityWithQJM.apt.vm
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1223 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1223/ ) Merge HDFS-3077 into trunk (Revision 1396943) Result = FAILURE todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1396943 Files : /hadoop/common/trunk /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/bin/hadoop-daemon.sh /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/docs /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/packages/templates/conf/hadoop-policy.xml /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/core /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/test/GenericTestUtils.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES. HDFS-3077 .txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/dev-support/findbugsExcludeFile.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/pom.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperEditLogOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/HDFSPolicyProvider.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/PBHelper.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLogger.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLoggerSet.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/IPCLoggerChannel.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/IPCLoggerChannelMetrics.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/LoggerTooFarBehindException.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumCall.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumException.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumJournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/SegmentRecoveryComparator.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/JournalNotFormattedException.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/JournalOutOfSyncException.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/QJournalProtocol.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/RequestInfo.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolPB.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolServerSideTranslatorPB.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolTranslatorPB.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/GetJournalEditServlet.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JNStorage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalFaultInjector.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalMetrics.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNodeHttpServer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNodeRpcServer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/HdfsServerConstants.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/Storage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogBackupOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditsDoubleBuffer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FileJournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeResourcePolicy.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeJspHelper.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/EditLogTailer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/RemoteEditLog.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/RemoteEditLogManifest.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/BinaryEditsVisitor.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/BestEffortLongFile.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/PersistentLongFile.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/native /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/QJournalProtocol.proto /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/hdfs.proto /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/datanode /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/hdfs /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/hdfs/dfshealth.jsp /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal/index.html /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal/journalstatus.jsp /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/proto-journal-web.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/secondary /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/hdfs /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/DFSTestUtil.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/MiniJournalCluster.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/QJMTestUtil.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/TestMiniJournalCluster.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/TestNNWithQJM.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestEpochsAreUnique.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestIPCLoggerChannel.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQJMWithFaults.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumCall.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumJournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumJournalManagerUnit.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestSegmentRecoveryComparator.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournal.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournalNode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/security/token/block/TestBlockToken.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLogFileOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeRecovery.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestInitializeSharedEdits.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestBestEffortLongFile.java /hadoop/common/trunk/hadoop-mapreduce-project /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/conf /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml /hadoop/common/trunk/hadoop-mapreduce-project/src/c++ /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/block_forensics /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/build-contrib.xml /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/build.xml /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/data_join /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/eclipse-plugin /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/index /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/vaidya /hadoop/common/trunk/hadoop-mapreduce-project/src/examples /hadoop/common/trunk/hadoop-mapreduce-project/src/java /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/fs /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/hdfs /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/ipc /hadoop/common/trunk/hadoop-mapreduce-project/src/webapps/job /hadoop/common/trunk/hadoop-project/src/site/site.xml /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailability.apt.vm /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailabilityWithNFS.apt.vm /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailabilityWithQJM.apt.vm
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #1192 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1192/)
          Merge HDFS-3077 into trunk (Revision 1396943)

          Result = SUCCESS
          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1396943
          Files :

          • /hadoop/common/trunk
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/bin/hadoop-daemon.sh
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/docs
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/packages/templates/conf/hadoop-policy.xml
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/core
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/test/GenericTestUtils.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-3077.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/dev-support/findbugsExcludeFile.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/pom.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperEditLogOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/HDFSPolicyProvider.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/PBHelper.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLogger.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLoggerSet.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/IPCLoggerChannel.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/IPCLoggerChannelMetrics.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/LoggerTooFarBehindException.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumCall.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumException.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumJournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/SegmentRecoveryComparator.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/JournalNotFormattedException.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/JournalOutOfSyncException.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/QJournalProtocol.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/RequestInfo.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolPB.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolServerSideTranslatorPB.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolTranslatorPB.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/GetJournalEditServlet.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JNStorage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalFaultInjector.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalMetrics.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNodeHttpServer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNodeRpcServer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/HdfsServerConstants.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/Storage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogBackupOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditsDoubleBuffer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FileJournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeResourcePolicy.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeJspHelper.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/EditLogTailer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/RemoteEditLog.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/RemoteEditLogManifest.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/BinaryEditsVisitor.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/BestEffortLongFile.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/PersistentLongFile.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/native
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/QJournalProtocol.proto
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/hdfs.proto
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/datanode
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/hdfs/dfshealth.jsp
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal/index.html
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal/journalstatus.jsp
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/proto-journal-web.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/secondary
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/DFSTestUtil.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/MiniJournalCluster.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/QJMTestUtil.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/TestMiniJournalCluster.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/TestNNWithQJM.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestEpochsAreUnique.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestIPCLoggerChannel.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQJMWithFaults.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumCall.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumJournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumJournalManagerUnit.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestSegmentRecoveryComparator.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournal.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournalNode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/security/token/block/TestBlockToken.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLogFileOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeRecovery.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestInitializeSharedEdits.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestBestEffortLongFile.java
          • /hadoop/common/trunk/hadoop-mapreduce-project
          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/conf
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/c++
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/block_forensics
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/build-contrib.xml
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/build.xml
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/data_join
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/eclipse-plugin
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/index
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/vaidya
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/examples
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/java
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/fs
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/hdfs
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/ipc
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/webapps/job
          • /hadoop/common/trunk/hadoop-project/src/site/site.xml
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailability.apt.vm
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailabilityWithNFS.apt.vm
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailabilityWithQJM.apt.vm
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1192 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1192/ ) Merge HDFS-3077 into trunk (Revision 1396943) Result = SUCCESS todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1396943 Files : /hadoop/common/trunk /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/bin/hadoop-daemon.sh /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/docs /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/packages/templates/conf/hadoop-policy.xml /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/core /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/test/GenericTestUtils.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES. HDFS-3077 .txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/dev-support/findbugsExcludeFile.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/pom.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperEditLogOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/HDFSPolicyProvider.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/PBHelper.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLogger.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLoggerSet.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/IPCLoggerChannel.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/IPCLoggerChannelMetrics.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/LoggerTooFarBehindException.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumCall.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumException.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumJournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/SegmentRecoveryComparator.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/JournalNotFormattedException.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/JournalOutOfSyncException.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/QJournalProtocol.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/RequestInfo.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolPB.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolServerSideTranslatorPB.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolTranslatorPB.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/GetJournalEditServlet.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JNStorage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalFaultInjector.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalMetrics.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNodeHttpServer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNodeRpcServer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/HdfsServerConstants.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/Storage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogBackupOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditsDoubleBuffer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FileJournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeResourcePolicy.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeJspHelper.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/EditLogTailer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/RemoteEditLog.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/RemoteEditLogManifest.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/BinaryEditsVisitor.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/BestEffortLongFile.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/PersistentLongFile.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/native /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/QJournalProtocol.proto /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/hdfs.proto /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/datanode /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/hdfs /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/hdfs/dfshealth.jsp /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal/index.html /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal/journalstatus.jsp /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/proto-journal-web.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/secondary /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/hdfs /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/DFSTestUtil.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/MiniJournalCluster.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/QJMTestUtil.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/TestMiniJournalCluster.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/TestNNWithQJM.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestEpochsAreUnique.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestIPCLoggerChannel.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQJMWithFaults.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumCall.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumJournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumJournalManagerUnit.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestSegmentRecoveryComparator.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournal.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournalNode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/security/token/block/TestBlockToken.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLogFileOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeRecovery.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestInitializeSharedEdits.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestBestEffortLongFile.java /hadoop/common/trunk/hadoop-mapreduce-project /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/conf /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml /hadoop/common/trunk/hadoop-mapreduce-project/src/c++ /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/block_forensics /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/build-contrib.xml /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/build.xml /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/data_join /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/eclipse-plugin /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/index /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/vaidya /hadoop/common/trunk/hadoop-mapreduce-project/src/examples /hadoop/common/trunk/hadoop-mapreduce-project/src/java /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/fs /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/hdfs /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/ipc /hadoop/common/trunk/hadoop-mapreduce-project/src/webapps/job /hadoop/common/trunk/hadoop-project/src/site/site.xml /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailability.apt.vm /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailabilityWithNFS.apt.vm /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailabilityWithQJM.apt.vm
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #2845 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2845/)
          Merge HDFS-3077 into trunk (Revision 1396943)

          Result = SUCCESS
          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1396943
          Files :

          • /hadoop/common/trunk
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/bin/hadoop-daemon.sh
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/docs
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/packages/templates/conf/hadoop-policy.xml
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/core
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/test/GenericTestUtils.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-3077.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/dev-support/findbugsExcludeFile.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/pom.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperEditLogOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/HDFSPolicyProvider.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/PBHelper.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLogger.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLoggerSet.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/IPCLoggerChannel.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/IPCLoggerChannelMetrics.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/LoggerTooFarBehindException.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumCall.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumException.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumJournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/SegmentRecoveryComparator.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/JournalNotFormattedException.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/JournalOutOfSyncException.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/QJournalProtocol.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/RequestInfo.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolPB.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolServerSideTranslatorPB.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolTranslatorPB.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/GetJournalEditServlet.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JNStorage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalFaultInjector.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalMetrics.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNodeHttpServer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNodeRpcServer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/HdfsServerConstants.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/Storage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogBackupOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditsDoubleBuffer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FileJournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeResourcePolicy.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeJspHelper.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/EditLogTailer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/RemoteEditLog.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/RemoteEditLogManifest.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/BinaryEditsVisitor.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/BestEffortLongFile.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/PersistentLongFile.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/native
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/QJournalProtocol.proto
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/hdfs.proto
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/datanode
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/hdfs/dfshealth.jsp
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal/index.html
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal/journalstatus.jsp
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/proto-journal-web.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/secondary
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/DFSTestUtil.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/MiniJournalCluster.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/QJMTestUtil.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/TestMiniJournalCluster.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/TestNNWithQJM.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestEpochsAreUnique.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestIPCLoggerChannel.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQJMWithFaults.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumCall.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumJournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumJournalManagerUnit.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestSegmentRecoveryComparator.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournal.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournalNode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/security/token/block/TestBlockToken.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLogFileOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeRecovery.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestInitializeSharedEdits.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestBestEffortLongFile.java
          • /hadoop/common/trunk/hadoop-mapreduce-project
          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/conf
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/c++
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/block_forensics
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/build-contrib.xml
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/build.xml
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/data_join
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/eclipse-plugin
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/index
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/vaidya
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/examples
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/java
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/fs
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/hdfs
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/ipc
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/webapps/job
          • /hadoop/common/trunk/hadoop-project/src/site/site.xml
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailability.apt.vm
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailabilityWithNFS.apt.vm
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailabilityWithQJM.apt.vm
          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #2845 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2845/ ) Merge HDFS-3077 into trunk (Revision 1396943) Result = SUCCESS todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1396943 Files : /hadoop/common/trunk /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/bin/hadoop-daemon.sh /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/docs /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/packages/templates/conf/hadoop-policy.xml /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/core /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/test/GenericTestUtils.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES. HDFS-3077 .txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/dev-support/findbugsExcludeFile.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/pom.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperEditLogOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/HDFSPolicyProvider.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/PBHelper.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLogger.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLoggerSet.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/IPCLoggerChannel.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/IPCLoggerChannelMetrics.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/LoggerTooFarBehindException.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumCall.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumException.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumJournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/SegmentRecoveryComparator.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/JournalNotFormattedException.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/JournalOutOfSyncException.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/QJournalProtocol.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/RequestInfo.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolPB.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolServerSideTranslatorPB.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolTranslatorPB.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/GetJournalEditServlet.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JNStorage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalFaultInjector.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalMetrics.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNodeHttpServer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNodeRpcServer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/HdfsServerConstants.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/Storage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogBackupOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditsDoubleBuffer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FileJournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeResourcePolicy.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeJspHelper.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/EditLogTailer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/RemoteEditLog.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/RemoteEditLogManifest.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/BinaryEditsVisitor.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/BestEffortLongFile.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/PersistentLongFile.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/native /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/QJournalProtocol.proto /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/hdfs.proto /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/datanode /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/hdfs /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/hdfs/dfshealth.jsp /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal/index.html /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal/journalstatus.jsp /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/proto-journal-web.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/secondary /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/hdfs /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/DFSTestUtil.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/MiniJournalCluster.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/QJMTestUtil.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/TestMiniJournalCluster.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/TestNNWithQJM.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestEpochsAreUnique.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestIPCLoggerChannel.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQJMWithFaults.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumCall.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumJournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumJournalManagerUnit.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestSegmentRecoveryComparator.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournal.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournalNode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/security/token/block/TestBlockToken.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLogFileOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeRecovery.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestInitializeSharedEdits.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestBestEffortLongFile.java /hadoop/common/trunk/hadoop-mapreduce-project /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/conf /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml /hadoop/common/trunk/hadoop-mapreduce-project/src/c++ /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/block_forensics /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/build-contrib.xml /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/build.xml /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/data_join /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/eclipse-plugin /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/index /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/vaidya /hadoop/common/trunk/hadoop-mapreduce-project/src/examples /hadoop/common/trunk/hadoop-mapreduce-project/src/java /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/fs /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/hdfs /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/ipc /hadoop/common/trunk/hadoop-mapreduce-project/src/webapps/job /hadoop/common/trunk/hadoop-project/src/site/site.xml /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailability.apt.vm /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailabilityWithNFS.apt.vm /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailabilityWithQJM.apt.vm
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #2907 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2907/)
          Merge HDFS-3077 into trunk (Revision 1396943)

          Result = SUCCESS
          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1396943
          Files :

          • /hadoop/common/trunk
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/bin/hadoop-daemon.sh
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/docs
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/packages/templates/conf/hadoop-policy.xml
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/core
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/test/GenericTestUtils.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-3077.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/dev-support/findbugsExcludeFile.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/pom.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperEditLogOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/HDFSPolicyProvider.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/PBHelper.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLogger.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLoggerSet.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/IPCLoggerChannel.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/IPCLoggerChannelMetrics.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/LoggerTooFarBehindException.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumCall.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumException.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumJournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/SegmentRecoveryComparator.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/JournalNotFormattedException.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/JournalOutOfSyncException.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/QJournalProtocol.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/RequestInfo.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolPB.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolServerSideTranslatorPB.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolTranslatorPB.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/GetJournalEditServlet.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JNStorage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalFaultInjector.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalMetrics.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNodeHttpServer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNodeRpcServer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/HdfsServerConstants.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/Storage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogBackupOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditsDoubleBuffer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FileJournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeResourcePolicy.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeJspHelper.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/EditLogTailer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/RemoteEditLog.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/RemoteEditLogManifest.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/BinaryEditsVisitor.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/BestEffortLongFile.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/PersistentLongFile.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/native
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/QJournalProtocol.proto
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/hdfs.proto
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/datanode
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/hdfs/dfshealth.jsp
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal/index.html
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal/journalstatus.jsp
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/proto-journal-web.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/secondary
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/DFSTestUtil.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/MiniJournalCluster.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/QJMTestUtil.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/TestMiniJournalCluster.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/TestNNWithQJM.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestEpochsAreUnique.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestIPCLoggerChannel.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQJMWithFaults.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumCall.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumJournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumJournalManagerUnit.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestSegmentRecoveryComparator.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournal.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournalNode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/security/token/block/TestBlockToken.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLogFileOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeRecovery.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestInitializeSharedEdits.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestBestEffortLongFile.java
          • /hadoop/common/trunk/hadoop-mapreduce-project
          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/conf
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/c++
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/block_forensics
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/build-contrib.xml
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/build.xml
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/data_join
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/eclipse-plugin
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/index
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/vaidya
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/examples
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/java
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/fs
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/hdfs
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/ipc
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/webapps/job
          • /hadoop/common/trunk/hadoop-project/src/site/site.xml
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailability.apt.vm
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailabilityWithNFS.apt.vm
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailabilityWithQJM.apt.vm
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #2907 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2907/ ) Merge HDFS-3077 into trunk (Revision 1396943) Result = SUCCESS todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1396943 Files : /hadoop/common/trunk /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/bin/hadoop-daemon.sh /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/docs /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/packages/templates/conf/hadoop-policy.xml /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/core /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/test/GenericTestUtils.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES. HDFS-3077 .txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/dev-support/findbugsExcludeFile.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/pom.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperEditLogOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/HDFSPolicyProvider.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/PBHelper.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLogger.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLoggerSet.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/IPCLoggerChannel.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/IPCLoggerChannelMetrics.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/LoggerTooFarBehindException.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumCall.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumException.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumJournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/SegmentRecoveryComparator.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/JournalNotFormattedException.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/JournalOutOfSyncException.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/QJournalProtocol.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/RequestInfo.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolPB.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolServerSideTranslatorPB.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolTranslatorPB.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/GetJournalEditServlet.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JNStorage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalFaultInjector.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalMetrics.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNodeHttpServer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNodeRpcServer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/HdfsServerConstants.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/Storage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogBackupOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditsDoubleBuffer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FileJournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeResourcePolicy.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeJspHelper.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/EditLogTailer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/RemoteEditLog.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/RemoteEditLogManifest.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/BinaryEditsVisitor.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/BestEffortLongFile.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/PersistentLongFile.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/native /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/QJournalProtocol.proto /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/hdfs.proto /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/datanode /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/hdfs /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/hdfs/dfshealth.jsp /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal/index.html /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal/journalstatus.jsp /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/proto-journal-web.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/secondary /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/hdfs /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/DFSTestUtil.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/MiniJournalCluster.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/QJMTestUtil.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/TestMiniJournalCluster.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/TestNNWithQJM.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestEpochsAreUnique.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestIPCLoggerChannel.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQJMWithFaults.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumCall.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumJournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumJournalManagerUnit.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestSegmentRecoveryComparator.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournal.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournalNode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/security/token/block/TestBlockToken.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLogFileOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeRecovery.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestInitializeSharedEdits.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestBestEffortLongFile.java /hadoop/common/trunk/hadoop-mapreduce-project /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/conf /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml /hadoop/common/trunk/hadoop-mapreduce-project/src/c++ /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/block_forensics /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/build-contrib.xml /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/build.xml /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/data_join /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/eclipse-plugin /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/index /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/vaidya /hadoop/common/trunk/hadoop-mapreduce-project/src/examples /hadoop/common/trunk/hadoop-mapreduce-project/src/java /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/fs /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/hdfs /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/ipc /hadoop/common/trunk/hadoop-mapreduce-project/src/webapps/job /hadoop/common/trunk/hadoop-project/src/site/site.xml /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailability.apt.vm /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailabilityWithNFS.apt.vm /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailabilityWithQJM.apt.vm
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #2869 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2869/)
          Merge HDFS-3077 into trunk (Revision 1396943)

          Result = FAILURE
          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1396943
          Files :

          • /hadoop/common/trunk
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/bin/hadoop-daemon.sh
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/docs
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/packages/templates/conf/hadoop-policy.xml
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/core
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/test/GenericTestUtils.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-3077.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/dev-support/findbugsExcludeFile.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/pom.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperEditLogOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/HDFSPolicyProvider.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/PBHelper.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLogger.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLoggerSet.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/IPCLoggerChannel.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/IPCLoggerChannelMetrics.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/LoggerTooFarBehindException.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumCall.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumException.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumJournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/SegmentRecoveryComparator.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/JournalNotFormattedException.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/JournalOutOfSyncException.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/QJournalProtocol.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/RequestInfo.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolPB.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolServerSideTranslatorPB.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolTranslatorPB.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/GetJournalEditServlet.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JNStorage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalFaultInjector.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalMetrics.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNodeHttpServer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNodeRpcServer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/HdfsServerConstants.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/Storage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogBackupOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditsDoubleBuffer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FileJournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeResourcePolicy.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeJspHelper.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/EditLogTailer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/RemoteEditLog.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/RemoteEditLogManifest.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/BinaryEditsVisitor.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/BestEffortLongFile.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/PersistentLongFile.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/native
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/QJournalProtocol.proto
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/hdfs.proto
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/datanode
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/hdfs/dfshealth.jsp
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal/index.html
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal/journalstatus.jsp
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/proto-journal-web.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/secondary
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/hdfs
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/DFSTestUtil.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/MiniJournalCluster.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/QJMTestUtil.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/TestMiniJournalCluster.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/TestNNWithQJM.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestEpochsAreUnique.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestIPCLoggerChannel.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQJMWithFaults.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumCall.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumJournalManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumJournalManagerUnit.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestSegmentRecoveryComparator.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournal.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournalNode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/security/token/block/TestBlockToken.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLogFileOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeRecovery.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestInitializeSharedEdits.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestBestEffortLongFile.java
          • /hadoop/common/trunk/hadoop-mapreduce-project
          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/conf
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/c++
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/block_forensics
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/build-contrib.xml
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/build.xml
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/data_join
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/eclipse-plugin
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/index
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/vaidya
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/examples
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/java
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/fs
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/hdfs
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/ipc
          • /hadoop/common/trunk/hadoop-mapreduce-project/src/webapps/job
          • /hadoop/common/trunk/hadoop-project/src/site/site.xml
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailability.apt.vm
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailabilityWithNFS.apt.vm
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailabilityWithQJM.apt.vm
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #2869 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2869/ ) Merge HDFS-3077 into trunk (Revision 1396943) Result = FAILURE todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1396943 Files : /hadoop/common/trunk /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/bin/hadoop-daemon.sh /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/docs /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/packages/templates/conf/hadoop-policy.xml /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/core /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/test/GenericTestUtils.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES. HDFS-3077 .txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/dev-support/findbugsExcludeFile.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/pom.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperEditLogOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/HDFSPolicyProvider.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/PBHelper.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLogger.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLoggerSet.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/IPCLoggerChannel.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/IPCLoggerChannelMetrics.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/LoggerTooFarBehindException.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumCall.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumException.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumJournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/SegmentRecoveryComparator.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/JournalNotFormattedException.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/JournalOutOfSyncException.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/QJournalProtocol.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/RequestInfo.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolPB.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolServerSideTranslatorPB.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolTranslatorPB.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/GetJournalEditServlet.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JNStorage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalFaultInjector.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalMetrics.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNodeHttpServer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNodeRpcServer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/HdfsServerConstants.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/Storage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogBackupOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditsDoubleBuffer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FileJournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeResourcePolicy.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeJspHelper.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/EditLogTailer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/RemoteEditLog.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/RemoteEditLogManifest.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/BinaryEditsVisitor.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/BestEffortLongFile.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/PersistentLongFile.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/native /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/QJournalProtocol.proto /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/proto/hdfs.proto /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/datanode /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/hdfs /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/hdfs/dfshealth.jsp /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal/index.html /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/journal/journalstatus.jsp /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/proto-journal-web.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/secondary /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/hdfs /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/DFSTestUtil.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/MiniJournalCluster.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/QJMTestUtil.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/TestMiniJournalCluster.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/TestNNWithQJM.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestEpochsAreUnique.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestIPCLoggerChannel.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQJMWithFaults.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumCall.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumJournalManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumJournalManagerUnit.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestSegmentRecoveryComparator.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournal.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/server/TestJournalNode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/security/token/block/TestBlockToken.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLogFileOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeRecovery.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestInitializeSharedEdits.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestBestEffortLongFile.java /hadoop/common/trunk/hadoop-mapreduce-project /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/conf /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml /hadoop/common/trunk/hadoop-mapreduce-project/src/c++ /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/block_forensics /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/build-contrib.xml /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/build.xml /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/data_join /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/eclipse-plugin /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/index /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/vaidya /hadoop/common/trunk/hadoop-mapreduce-project/src/examples /hadoop/common/trunk/hadoop-mapreduce-project/src/java /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/fs /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/hdfs /hadoop/common/trunk/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/ipc /hadoop/common/trunk/hadoop-mapreduce-project/src/webapps/job /hadoop/common/trunk/hadoop-project/src/site/site.xml /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailability.apt.vm /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailabilityWithNFS.apt.vm /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailabilityWithQJM.apt.vm
          Hide
          Liang Xie added a comment -

          God bless us, really glad to see QJM will be merged to TRUNK soon

          Show
          Liang Xie added a comment - God bless us, really glad to see QJM will be merged to TRUNK soon
          Hide
          Sanjay Radia added a comment -

          If a JN crashes and is reformatted, then this would imply that the JN has to copy multiple GB worth of data from another JN before it can actively start participating as a destination for new logs. This will take quite some time.

          • Correctness - unless i am missing something, as soon as one of the JN's disk is reformatted you have lost the property that each segment is replicated at at least Q JNs. Hence it is best to recover all segments on this reformatted JN.
          • Performance: The QJM client does parallel writes so a single slow JN will not be a problem. Further, NN batches journal write and hence the batching will increase if QJM slows down.
          • Comparison with local disk: Local disks fail much less than connections to JNs in a busy network. Hence you are likely to see more segments and more gaps in QJM compared to local disk based journal.

          But we can discuss these further and in the worst case we can make the full-vs-partial recovery configurable.
          Please proceed with the merge. Thanks for the work you put into this Jira.

          Show
          Sanjay Radia added a comment - If a JN crashes and is reformatted, then this would imply that the JN has to copy multiple GB worth of data from another JN before it can actively start participating as a destination for new logs. This will take quite some time. Correctness - unless i am missing something, as soon as one of the JN's disk is reformatted you have lost the property that each segment is replicated at at least Q JNs. Hence it is best to recover all segments on this reformatted JN. Performance: The QJM client does parallel writes so a single slow JN will not be a problem. Further, NN batches journal write and hence the batching will increase if QJM slows down. Comparison with local disk: Local disks fail much less than connections to JNs in a busy network. Hence you are likely to see more segments and more gaps in QJM compared to local disk based journal. But we can discuss these further and in the worst case we can make the full-vs-partial recovery configurable. Please proceed with the merge. Thanks for the work you put into this Jira.
          Hide
          Sanjay Radia added a comment -

          Lets merge the newEpoch and prepareRecovery. Given that this works for ZAB I still fail to see why it cannot work for us. I think because of (1), merging the two steps will no longer be an issue.

          I still don't understand why this is better and not just different. If you and Suresh want to make the change, it's OK by me, but I expect that you will re-run the same validation before committing (eg run the randomized fault test a few hundred thousand times). This testing found a bunch of errors in the design before, so any chance to the design should go through the same test regimen to make sure we aren't missing some subtlety.

          Okay, I assume that the tests are also in the branch.
          Have filed HDFS-4028 for this.

          Show
          Sanjay Radia added a comment - Lets merge the newEpoch and prepareRecovery. Given that this works for ZAB I still fail to see why it cannot work for us. I think because of (1), merging the two steps will no longer be an issue. I still don't understand why this is better and not just different. If you and Suresh want to make the change, it's OK by me, but I expect that you will re-run the same validation before committing (eg run the randomized fault test a few hundred thousand times). This testing found a bunch of errors in the design before, so any chance to the design should go through the same test regimen to make sure we aren't missing some subtlety. Okay, I assume that the tests are also in the branch. Have filed HDFS-4028 for this.
          Hide
          Todd Lipcon added a comment -

          I filed HDFS-4025 to synchronize segment history.

          Show
          Todd Lipcon added a comment - I filed HDFS-4025 to synchronize segment history.
          Hide
          Aaron T. Myers added a comment -

          I still don't understand why this is better and not just different.

          +1. I personally find the protocol as it exists now easier to understand than this proposed change would be. It makes more sense to me to separate these two operations, which are in fact different semantic operations with different purposes.

          Show
          Aaron T. Myers added a comment - I still don't understand why this is better and not just different . +1. I personally find the protocol as it exists now easier to understand than this proposed change would be. It makes more sense to me to separate these two operations, which are in fact different semantic operations with different purposes.
          Hide
          Todd Lipcon added a comment -

          I did not understand this well. Why are we retrying any request to JournalNodes? Given most of the requests are not idempotent and cannot be retried why is this an advantage?

          Currently we don't retry most requests, but it would actually be easy to support retries in most cases, because the client always sends a unique <epoch, ipc serial number> in each call. If the server receives the same epoch and serial number twice in a row, it can safely re-respond with the previous response. This is not true for the newEpoch calls, because this is where we enforce the unique epoch ID.

          As for why we'd want to retry, it seems useful to be able to do so after a small network blip between NN and JNs, for example.

          Recover all transactions. We do this in the same fashion as ZAB rather then the way you suggested "the NN can run the recovery process for each of these earlier segments individually". Note this requires two changes:

          • The protocol message contents change - the response to phase 1 is highest txid and not highest segment's txid. The JN recovers all previous transactions from another JN.
          • When a JN joins an existing writer it first syncs previous segments.

          I'll agree to work on the improvement where we recover previous segments, but I disagree that it should be done as part of the recovery phase. Here are my reasons:

          • Currently with local journals, we don't do this. When an edits directory crashes and becomes available again, we start writing to it without first re-copying all previous segments back into it. This has worked fine for us. So I don't think we need a stronger guarantee on the JN.
          • The NN may maintain a few GB worth of edits due to various retention policies. If a JN crashes and is reformatted, then this would imply that the JN has to copy multiple GB worth of data from another JN before it can actively start participating as a destination for new logs. This will take quite some time.
          • Furthermore, because the JN will be syncing its logs from another JN, we need to make sure the copying is throttled. Otherwise, the restart of a JN will suck up disk and network bandwidth from the other JN which is trying to provide low latency logging for the active namenode. If we didn't throttle it, the transfer and disk IO would balloon the latency for the namespace significantly, which I think it's best to avoid. If we do throttle it (say to 10MB/sec), then syncing several GB of logs will take several minutes, during which time the fault tolerance of the cluster is compromised.
          • Similar to the above, if there are several GB of logs to synchronize, this will impact NN startup (or failover) time a lot.

          I think, instead, the synchronization should be done as a background thread:

          • The thread periodically wakes up and scans for any in-progress segments or gaps in the local directory
          • If it finds one (and it is not the highest numbered segment), then it starts the synchronization process.
            • We reuse existing code to RPC to the other nodes and find one which has the finalized segment, and copy it using the transfer throttler to avoid impacting latency of either the sending or receiving node.

          Lets merge the newEpoch and prepareRecovery. Given that this works for ZAB I still fail to see why it cannot work for us. I think because of (1), merging the two steps will no longer be an issue.

          I still don't understand why this is better and not just different. If you and Suresh want to make the change, it's OK by me, but I expect that you will re-run the same validation before committing (eg run the randomized fault test a few hundred thousand times). This testing found a bunch of errors in the design before, so any chance to the design should go through the same test regimen to make sure we aren't missing some subtlety.

          If the above sounds good to you, let's file the follow-up JIRAs and merge? Thanks.

          Show
          Todd Lipcon added a comment - I did not understand this well. Why are we retrying any request to JournalNodes? Given most of the requests are not idempotent and cannot be retried why is this an advantage? Currently we don't retry most requests, but it would actually be easy to support retries in most cases, because the client always sends a unique <epoch, ipc serial number> in each call. If the server receives the same epoch and serial number twice in a row, it can safely re-respond with the previous response. This is not true for the newEpoch calls, because this is where we enforce the unique epoch ID. As for why we'd want to retry, it seems useful to be able to do so after a small network blip between NN and JNs, for example. Recover all transactions. We do this in the same fashion as ZAB rather then the way you suggested "the NN can run the recovery process for each of these earlier segments individually". Note this requires two changes: The protocol message contents change - the response to phase 1 is highest txid and not highest segment's txid. The JN recovers all previous transactions from another JN. When a JN joins an existing writer it first syncs previous segments. I'll agree to work on the improvement where we recover previous segments, but I disagree that it should be done as part of the recovery phase. Here are my reasons: Currently with local journals, we don't do this. When an edits directory crashes and becomes available again, we start writing to it without first re-copying all previous segments back into it. This has worked fine for us. So I don't think we need a stronger guarantee on the JN. The NN may maintain a few GB worth of edits due to various retention policies. If a JN crashes and is reformatted, then this would imply that the JN has to copy multiple GB worth of data from another JN before it can actively start participating as a destination for new logs. This will take quite some time. Furthermore, because the JN will be syncing its logs from another JN, we need to make sure the copying is throttled. Otherwise, the restart of a JN will suck up disk and network bandwidth from the other JN which is trying to provide low latency logging for the active namenode. If we didn't throttle it, the transfer and disk IO would balloon the latency for the namespace significantly, which I think it's best to avoid. If we do throttle it (say to 10MB/sec), then syncing several GB of logs will take several minutes, during which time the fault tolerance of the cluster is compromised. Similar to the above, if there are several GB of logs to synchronize, this will impact NN startup (or failover) time a lot. I think, instead, the synchronization should be done as a background thread: The thread periodically wakes up and scans for any in-progress segments or gaps in the local directory If it finds one (and it is not the highest numbered segment), then it starts the synchronization process. We reuse existing code to RPC to the other nodes and find one which has the finalized segment, and copy it using the transfer throttler to avoid impacting latency of either the sending or receiving node. Lets merge the newEpoch and prepareRecovery. Given that this works for ZAB I still fail to see why it cannot work for us. I think because of (1), merging the two steps will no longer be an issue. I still don't understand why this is better and not just different . If you and Suresh want to make the change, it's OK by me, but I expect that you will re-run the same validation before committing (eg run the randomized fault test a few hundred thousand times). This testing found a bunch of errors in the design before, so any chance to the design should go through the same test regimen to make sure we aren't missing some subtlety. If the above sounds good to you, let's file the follow-up JIRAs and merge? Thanks.
          Hide
          Sanjay Radia added a comment -

          I propose we do the following

          1. Recover all transactions. We do this in the same fashion as ZAB rather then the way you suggested "the NN can run the recovery process for each of these earlier segments individually". Note this requires two changes:
            • The protocol message contents change - the response to phase 1 is highest txid and not highest segment's txid. The JN recovers all previous transactions from another JN.
            • When a JN joins an existing writer it first syncs previous segments.
          2. Lets merge the newEpoch and prepareRecovery. Given that this works for ZAB I still fail to see why it cannot work for us. I think because of (1), merging the two steps will no longer be an issue.

          I am fine doing this in 2 separate jiras after the merge if you prefer. You have already volunteered to work on (1). Suresh or I can work on (2).

          Show
          Sanjay Radia added a comment - I propose we do the following Recover all transactions. We do this in the same fashion as ZAB rather then the way you suggested "the NN can run the recovery process for each of these earlier segments individually". Note this requires two changes: The protocol message contents change - the response to phase 1 is highest txid and not highest segment's txid. The JN recovers all previous transactions from another JN. When a JN joins an existing writer it first syncs previous segments. Lets merge the newEpoch and prepareRecovery. Given that this works for ZAB I still fail to see why it cannot work for us. I think because of (1), merging the two steps will no longer be an issue. I am fine doing this in 2 separate jiras after the merge if you prefer. You have already volunteered to work on (1). Suresh or I can work on (2).
          Hide
          Suresh Srinivas added a comment -

          I wanted to avoid two threads of discussions going on at the same time...

          But, I'm not sure it's simpler or more robust. My reasoning is that starting a new epoch (thus fencing the prior writer) is semantically different than beginning recovery for a particular segment. So I think it's clearer to put them in different pieces of code, even if they could be piggy-backed one on top of the other for future round trips.

          I think it is more robust because of less number of messages. Lets say it is not more robust - at least now the protocol is starts looking more relatable to ZAB/Paxos. NEWEPOCH + ACK in ZAB or Prepare + Promise in paxos indeed fences/prevents the writer with older epoch. So I am not sure separation of fencing makes the design clearer. In my case it was the opposite.

          Another reason is that the current separation allows correct behavior in the face of IPC retries on PrepareRecovery, since PrepareRecovery is idempotent. NewEpoch is necessarily not idempotent, because it is the one IPC that requires a strictly greater epoch id (in order to preserve uniqueness of epochs). This means that, if there's some timeout prepare phase, we can safely add retries a few times to get past it, while such a policy doesn't work on NewEpoch.

          I did not understand this well. Why are we retrying any request to JournalNodes? Given most of the requests are not idempotent and cannot be retried why is this an advantage?

          Show
          Suresh Srinivas added a comment - I wanted to avoid two threads of discussions going on at the same time... But, I'm not sure it's simpler or more robust. My reasoning is that starting a new epoch (thus fencing the prior writer) is semantically different than beginning recovery for a particular segment. So I think it's clearer to put them in different pieces of code, even if they could be piggy-backed one on top of the other for future round trips. I think it is more robust because of less number of messages. Lets say it is not more robust - at least now the protocol is starts looking more relatable to ZAB/Paxos. NEWEPOCH + ACK in ZAB or Prepare + Promise in paxos indeed fences/prevents the writer with older epoch. So I am not sure separation of fencing makes the design clearer. In my case it was the opposite. Another reason is that the current separation allows correct behavior in the face of IPC retries on PrepareRecovery, since PrepareRecovery is idempotent. NewEpoch is necessarily not idempotent, because it is the one IPC that requires a strictly greater epoch id (in order to preserve uniqueness of epochs). This means that, if there's some timeout prepare phase, we can safely add retries a few times to get past it, while such a policy doesn't work on NewEpoch. I did not understand this well. Why are we retrying any request to JournalNodes? Given most of the requests are not idempotent and cannot be retried why is this an advantage?
          Hide
          Todd Lipcon added a comment -

          This wasn't obvious from HDFS-3077 document and a limitation of HDFS-3077; don't you agree? Segment holes is operationally messy when manual recovery is necessary in the field.

          Wouldn't the same criticism hold for local storage? If you configure three local disk drives, you get exactly the same behavior: if any drive throws an IOException, that drive is dropped for further edits on that segment. If you have enabled the namedir restore configuration, it will be retried on the next log roll. Otherwise, it is not retried until the NN is entirely restarted.

          The exact same is true of QJM, except that we always retry each JN on a log roll.

          In both cases, on the read side, we look at all available directories (or JNs) and string together a contiguous set of edits from whatever pieces exist.

          Of course, in normal operation without failures, all directories (or JNs) will have a complete history of edits.

          If we want to improve this, let's treat it as an improvement after the merge. I'm happy to work on it.

          Show
          Todd Lipcon added a comment - This wasn't obvious from HDFS-3077 document and a limitation of HDFS-3077 ; don't you agree? Segment holes is operationally messy when manual recovery is necessary in the field. Wouldn't the same criticism hold for local storage? If you configure three local disk drives, you get exactly the same behavior: if any drive throws an IOException, that drive is dropped for further edits on that segment. If you have enabled the namedir restore configuration, it will be retried on the next log roll. Otherwise, it is not retried until the NN is entirely restarted. The exact same is true of QJM, except that we always retry each JN on a log roll. In both cases, on the read side, we look at all available directories (or JNs) and string together a contiguous set of edits from whatever pieces exist. Of course, in normal operation without failures, all directories (or JNs) will have a complete history of edits. If we want to improve this, let's treat it as an improvement after the merge. I'm happy to work on it.
          Hide
          Sanjay Radia added a comment -

          The updated journal file isn't sufficient because it doesn't record information about whether it was an accepted recovery proposal or whether it was just left over at the last write. You need to ensure the property that, if the recovery coordinator thinks a value is accepted, then no different recovery will be accepted in the future (otherwise you risk having two different finalized lengths for the same log segment). In order to do so, you need to wait until a quorum of nodes are Finalized before you know that any future recovery will be able to rely only on the finalization state.

          I don't know enough about the details of the ZAB implementation to understand why they can get away without this, if in fact they can. My guess is that it's because the transaction IDs themselves have the epoch number as their high order bits, and hence you can't ever confuse the first txn of epoch N+1 with the last transaction of epoch N.

          Yes, ZAB avoids this because epoch and txid are combined.
          Lets please add the counter example that you describe above in the doc (if it is already there just add a comment that the example
          explains why the extra persistent info is needed.)

          Show
          Sanjay Radia added a comment - The updated journal file isn't sufficient because it doesn't record information about whether it was an accepted recovery proposal or whether it was just left over at the last write. You need to ensure the property that, if the recovery coordinator thinks a value is accepted, then no different recovery will be accepted in the future (otherwise you risk having two different finalized lengths for the same log segment). In order to do so, you need to wait until a quorum of nodes are Finalized before you know that any future recovery will be able to rely only on the finalization state. I don't know enough about the details of the ZAB implementation to understand why they can get away without this, if in fact they can. My guess is that it's because the transaction IDs themselves have the epoch number as their high order bits, and hence you can't ever confuse the first txn of epoch N+1 with the last transaction of epoch N. Yes, ZAB avoids this because epoch and txid are combined. Lets please add the counter example that you describe above in the doc (if it is already there just add a comment that the example explains why the extra persistent info is needed.)
          Hide
          Sanjay Radia added a comment -

          Currently, we only run recovery on the highest txid segment at startup. This means that every segment is stored on at least a quorum of nodes. But it does not mean that previous segments get replicated to all available nodes.

          This wasn't obvious from HDFS-3077 document and a limitation of HDFS-3077; don't you agree? Segment holes is operationally messy when manual recovery is necessary in the field.
          Do the following two suggestions make sense?

          1. When a JN joins, it must have sync'ed all previous segments before accepting new writes.
          2. At recovery, sync missing segments (due to 1, a JN may miss several segments but the set of segments is all at the end - there cannot be holes.)

          If we wanted to improve this[deal with missing segments], however, ... If we merged NewEpoch and PrepareRecovery, that wouldn't be possible.

          Todd the way segments are playing out in our protocol is scaring me; Zookeeper's ZAB avoids all this - they recover all previous transactions. It seems that segments have complicated our protocol significantly.
          With the additional subtleties you have pointed out I am worried only a few will be able to maintain this code.

          Show
          Sanjay Radia added a comment - Currently, we only run recovery on the highest txid segment at startup. This means that every segment is stored on at least a quorum of nodes. But it does not mean that previous segments get replicated to all available nodes. This wasn't obvious from HDFS-3077 document and a limitation of HDFS-3077 ; don't you agree? Segment holes is operationally messy when manual recovery is necessary in the field. Do the following two suggestions make sense? When a JN joins, it must have sync'ed all previous segments before accepting new writes. At recovery, sync missing segments (due to 1, a JN may miss several segments but the set of segments is all at the end - there cannot be holes.) If we wanted to improve this [deal with missing segments] , however, ... If we merged NewEpoch and PrepareRecovery, that wouldn't be possible. Todd the way segments are playing out in our protocol is scaring me; Zookeeper's ZAB avoids all this - they recover all previous transactions. It seems that segments have complicated our protocol significantly. With the additional subtleties you have pointed out I am worried only a few will be able to maintain this code.
          Hide
          Todd Lipcon added a comment -

          The JN would need to respond additionally with the rest of the fields in PrepareRecoveryResponseProto (eg acceptedInEpoch), as if the client called PrepareRecovery on whatever the highest segment txid was. Then we could evaluate those responses, and only feed those that agreed on the max(segmentTxId) into the recovery comparator.

          But, I'm not sure it's simpler or more robust. My reasoning is that starting a new epoch (thus fencing the prior writer) is semantically different than beginning recovery for a particular segment. So I think it's clearer to put them in different pieces of code, even if they could be piggy-backed one on top of the other for future round trips. Here's one example of why I think it makes more sense to keep them separate:

          Currently, we only run recovery on the highest txid segment at startup. This means that every segment is stored on at least a quorum of nodes. But it does not mean that previous segments get replicated to all available nodes. If we wanted to improve this, however, you could have each of the NNs return a list of segment txids for which they have an incomplete segment. Then, the NN can run the recovery process for each of these earlier segments individually, all from the same epoch. If we merged NewEpoch and PrepareRecovery, that wouldn't be possible.

          Another reason is that the current separation allows correct behavior in the face of IPC retries on PrepareRecovery, since PrepareRecovery is idempotent. NewEpoch is necessarily not idempotent, because it is the one IPC that requires a strictly greater epoch id (in order to preserve uniqueness of epochs). This means that, if there's some timeout prepare phase, we can safely add retries a few times to get past it, while such a policy doesn't work on NewEpoch.

          Show
          Todd Lipcon added a comment - The JN would need to respond additionally with the rest of the fields in PrepareRecoveryResponseProto (eg acceptedInEpoch), as if the client called PrepareRecovery on whatever the highest segment txid was. Then we could evaluate those responses, and only feed those that agreed on the max(segmentTxId) into the recovery comparator. But, I'm not sure it's simpler or more robust. My reasoning is that starting a new epoch (thus fencing the prior writer) is semantically different than beginning recovery for a particular segment. So I think it's clearer to put them in different pieces of code, even if they could be piggy-backed one on top of the other for future round trips. Here's one example of why I think it makes more sense to keep them separate: Currently, we only run recovery on the highest txid segment at startup. This means that every segment is stored on at least a quorum of nodes. But it does not mean that previous segments get replicated to all available nodes. If we wanted to improve this, however, you could have each of the NNs return a list of segment txids for which they have an incomplete segment. Then, the NN can run the recovery process for each of these earlier segments individually, all from the same epoch. If we merged NewEpoch and PrepareRecovery, that wouldn't be possible. Another reason is that the current separation allows correct behavior in the face of IPC retries on PrepareRecovery, since PrepareRecovery is idempotent. NewEpoch is necessarily not idempotent, because it is the one IPC that requires a strictly greater epoch id (in order to preserve uniqueness of epochs). This means that, if there's some timeout prepare phase, we can safely add retries a few times to get past it, while such a policy doesn't work on NewEpoch.
          Hide
          Sanjay Radia added a comment -

          Then, depending on which JNs are available during recovery, the prepareRecovery() call is different, and thus, we'd need different responses. It's not really simple to piggy-back the segment info on NewEpoch, because we don't yet know which segment is the one to be recovered (it may be some segment that is only available on one live node)

          Here is how I think it will work in this scenario:

          • NN after restart sends combined NewEpoch + PrepareRecovery
            • JNs respond lastPromisedEpoch, highestSegmentTxid and highestTxid
          • NN then choses the segmentTxid and last Txid to recover to and sends accept epoch, highestSegmentTxid, highestTxid, master to recover from
            • rest of the steps are same as your document.

          This combined requests works right?

          Show
          Sanjay Radia added a comment - Then, depending on which JNs are available during recovery, the prepareRecovery() call is different, and thus, we'd need different responses. It's not really simple to piggy-back the segment info on NewEpoch, because we don't yet know which segment is the one to be recovered (it may be some segment that is only available on one live node) Here is how I think it will work in this scenario: NN after restart sends combined NewEpoch + PrepareRecovery JNs respond lastPromisedEpoch, highestSegmentTxid and highestTxid NN then choses the segmentTxid and last Txid to recover to and sends accept epoch, highestSegmentTxid, highestTxid, master to recover from rest of the steps are same as your document. This combined requests works right?
          Hide
          Sanjay Radia added a comment -

          Comments so far:

          • JournalNode States - I am a little confused about how the state of JN is captured in the code.
            • The "inWritingState" seems to be captured by (curSegment != null) - this is used fairly often, lets hide this behind a method isJournalSegmentOpen(...)
            • The journal state should be more concrete: Init, writing, recovering (perhaps more than one recovering states)
          • JournalNodes joining a pack
            Can you please explain the following two cases
            • a JournalNode (previously down) that is joining a set of other JNs especially when the others are in writing mode.
            • a new JournalNode join the pack.
          • Exceptions
            • journal operation (ie write)
              shouldn't this thrown EpochException/FencedException" – this exception is critical so that client side does not retry the operation.
              (perhaps this can be a EpochException which is turned into a FencedException on client side.)
            • Should there be some other more concrete exception that are subclasses of IOException?
          • AsyncLogger
            JavaDoc states "This is essentially a wrapper around {@link QJournalProtocol}

            with the key differences being ..."
            Should this be "This is essentially a wrapper around

            {@link JournalManager }

            with the key differences being ..."

          • Javadoc
            • QJM contructor - document at least the URI.
            • Qprotocol - some methods do not have the parameters documented. Referring to the doc is fine for method semantics in some cases.
            • RequestInfo - document parameters
            • Javadoc for class Journal:
              A JournalNode can manage journals for several independent NN namespaces.
              The Journal class implements a single journal i.e. the part that stores the journal transactions persistently.
              Each such journal (identified by a journal id) is entirely independent despite being hosted by
              a single journalNode daemon (ie the same JVM).
          Show
          Sanjay Radia added a comment - Comments so far: JournalNode States - I am a little confused about how the state of JN is captured in the code. The "inWritingState" seems to be captured by (curSegment != null) - this is used fairly often, lets hide this behind a method isJournalSegmentOpen(...) The journal state should be more concrete: Init, writing, recovering (perhaps more than one recovering states) JournalNodes joining a pack Can you please explain the following two cases a JournalNode (previously down) that is joining a set of other JNs especially when the others are in writing mode. a new JournalNode join the pack. Exceptions journal operation (ie write) shouldn't this thrown EpochException/FencedException" – this exception is critical so that client side does not retry the operation. (perhaps this can be a EpochException which is turned into a FencedException on client side.) Should there be some other more concrete exception that are subclasses of IOException? AsyncLogger JavaDoc states "This is essentially a wrapper around {@link QJournalProtocol} with the key differences being ..." Should this be "This is essentially a wrapper around {@link JournalManager } with the key differences being ..." Javadoc QJM contructor - document at least the URI. Qprotocol - some methods do not have the parameters documented. Referring to the doc is fine for method semantics in some cases. RequestInfo - document parameters Javadoc for class Journal: A JournalNode can manage journals for several independent NN namespaces. The Journal class implements a single journal i.e. the part that stores the journal transactions persistently. Each such journal (identified by a journal id) is entirely independent despite being hosted by a single journalNode daemon (ie the same JVM).
          Hide
          Todd Lipcon added a comment -

          Oops, you're correct. I should have said "JN2 crashes before receiving it". But the rest of the scenario stands.

          Show
          Todd Lipcon added a comment - Oops, you're correct. I should have said "JN2 crashes before receiving it". But the rest of the scenario stands.
          Hide
          Ted Yu added a comment -

          JN2: highestSegmentTxId = 101, since it never got any transactions for segment 201

          JN3: highestSegmentTxId = 201, since it got one transaction for segment 201

          Are TxId for the above two JN mixed up ?
          Earlier you said 'JN3 crashes before receiving it.'

          Show
          Ted Yu added a comment - JN2: highestSegmentTxId = 101, since it never got any transactions for segment 201 JN3: highestSegmentTxId = 201, since it got one transaction for segment 201 Are TxId for the above two JN mixed up ? Earlier you said 'JN3 crashes before receiving it.'
          Hide
          Todd Lipcon added a comment -

          the QJM is not replaceable by local disk journal if QJM is not available because the local disk journal and qjm will not be consistent

          In a shared-edits setup, the shared edits are always synced ahead of the local disk journals. This means that anything that's committed locally will also be in the QJM. If the QJM fails to sync, then the NN aborts (since the shared edits are marked as "required"). So, they're not "consistent" but you can always take finalized edits from a JN and copy them to a local drive. In the case of some disaster, you can also freely intermingle the files - having that flexibility without having to hex-edit out a header seems prudent.

          More importantly, IMO, you can continue to run tools like the OfflineEditsViewer against edits stored on a JN.

          We may have to revise the journal abstraction at little to deal with the above situation (independent of storing epoch in the first entry) since a QJM+localDisk journal is useful.

          This is in fact the way in which we've been doing all of our HA testing (and now some customer deploys). We use local FileJournalManagers on each NN, and the QuorumJournalManager as shared edits. Per above, this works fine and doesn't have any "lost edit" issues.

          Can you be specific about the consistency issue you're foreseeing here?

          Todd the change is small and i am trying to help you here.

          Maybe I'm mis-understanding the change. More below on this...

          Recall in HDFS-1073 you did not want to use transaction ids or name the log files using transaction id range and you argued against this for quite a while. As I predicted, txids have become a cornerstone of HA and managing journals

          To be clear, the 1073 design was always using transaction IDs, it was just a matter of the file naming that we argued about. But I don't think it's productive to argue about the past

          You have argued that prepare-recovery, using the epoch number from previous newEpoch, is like multi-paxos - not sure if multi-paxos is warranted here.

          Can you explain what you mean by "not sure if multi-paxos is warranted here?" I just meant that, similar to multi-paxos, you can use an earlier promise to verify all future messages against that earlier epoch ID. Otherwise each operation would require its own new epoch, and that's clearly not what we want.

          The response of the newEpoch() is highest txId while response to PrepareRecovery is the state of the highest segment, and optionally additional info if there was a previous recovery in progress

          I think there is some confusion here. The response to newEpoch is the highest segment txid, but the highest segment txid may not match up across all of the JNs. On a 3-JN setup, you may have three different responses to NewEpoch. For example, the following scenario:

          1. NN writing segment starting at txid 1
          2. JN1 crashes after txid 50
          3. NN successfully rolls, starts txid 101
          4. NN successfully finalizes segment 101-200. Starts segment 201
          5. NN sends txid 201 to JN2 and JN3, but JN3 crashes before receiving it.
          6. Everyone shuts down and restarts.

          The current state is then:
          JN1: highestSegmentTxId = 1, since it had crashed during that segment
          JN2: highestSegmentTxId = 101, since it never got any transactions for segment 201
          JN3: highestSegmentTxId = 201, since it got one transaction for segment 201

          Then, depending on which JNs are available during recovery, the prepareRecovery() call is different, and thus, we'd need different responses. It's not really simple to piggy-back the segment info on NewEpoch, because we don't yet know which segment is the one to be recovered (it may be some segment that is only available on one live node)

          Am I misunderstanding the concrete change that you're proposing? Maybe you can post a patch?

          In step 3b you state that recovery metadata is created and then deleted in step 4. Isn't the updated journal file sufficient? In paxos when phase 2 is completed, paxos protocol has essentially completed when quorum number of Journal have learned the new value. From what i understand, even in ZAB the journal is updated at that stage and no separate metadata is persisted.

          The updated journal file isn't sufficient because it doesn't record information about whether it was an accepted recovery proposal or whether it was just left over at the last write. You need to ensure the property that, if the recovery coordinator thinks a value is accepted, then no different recovery will be accepted in the future (otherwise you risk having two different finalized lengths for the same log segment). In order to do so, you need to wait until a quorum of nodes are Finalized before you know that any future recovery will be able to rely only on the finalization state.

          I don't know enough about the details of the ZAB implementation to understand why they can get away without this, if in fact they can. My guess is that it's because the transaction IDs themselves have the epoch number as their high order bits, and hence you can't ever confuse the first txn of epoch N+1 with the last transaction of epoch N.

          The final step (finalize-segment or ZAB's commit) is really to lets all the JNs know that the new writer is the leader and that they can publish the data to other readers (the standBy in our case).

          Agreed. At this point we delete the metadata for recovery, but we don't necessarily have to. It's just a convenient place to do the cleanup.

          Show
          Todd Lipcon added a comment - the QJM is not replaceable by local disk journal if QJM is not available because the local disk journal and qjm will not be consistent In a shared-edits setup, the shared edits are always synced ahead of the local disk journals. This means that anything that's committed locally will also be in the QJM. If the QJM fails to sync, then the NN aborts (since the shared edits are marked as "required"). So, they're not "consistent" but you can always take finalized edits from a JN and copy them to a local drive. In the case of some disaster, you can also freely intermingle the files - having that flexibility without having to hex-edit out a header seems prudent. More importantly, IMO, you can continue to run tools like the OfflineEditsViewer against edits stored on a JN. We may have to revise the journal abstraction at little to deal with the above situation (independent of storing epoch in the first entry) since a QJM+localDisk journal is useful. This is in fact the way in which we've been doing all of our HA testing (and now some customer deploys). We use local FileJournalManagers on each NN, and the QuorumJournalManager as shared edits. Per above, this works fine and doesn't have any "lost edit" issues. Can you be specific about the consistency issue you're foreseeing here? Todd the change is small and i am trying to help you here. Maybe I'm mis-understanding the change. More below on this... Recall in HDFS-1073 you did not want to use transaction ids or name the log files using transaction id range and you argued against this for quite a while. As I predicted, txids have become a cornerstone of HA and managing journals To be clear, the 1073 design was always using transaction IDs, it was just a matter of the file naming that we argued about. But I don't think it's productive to argue about the past You have argued that prepare-recovery, using the epoch number from previous newEpoch, is like multi-paxos - not sure if multi-paxos is warranted here. Can you explain what you mean by "not sure if multi-paxos is warranted here?" I just meant that, similar to multi-paxos, you can use an earlier promise to verify all future messages against that earlier epoch ID. Otherwise each operation would require its own new epoch, and that's clearly not what we want. The response of the newEpoch() is highest txId while response to PrepareRecovery is the state of the highest segment, and optionally additional info if there was a previous recovery in progress I think there is some confusion here. The response to newEpoch is the highest segment txid, but the highest segment txid may not match up across all of the JNs. On a 3-JN setup, you may have three different responses to NewEpoch. For example, the following scenario: 1. NN writing segment starting at txid 1 2. JN1 crashes after txid 50 3. NN successfully rolls, starts txid 101 4. NN successfully finalizes segment 101-200. Starts segment 201 5. NN sends txid 201 to JN2 and JN3, but JN3 crashes before receiving it. 6. Everyone shuts down and restarts. The current state is then: JN1: highestSegmentTxId = 1, since it had crashed during that segment JN2: highestSegmentTxId = 101, since it never got any transactions for segment 201 JN3: highestSegmentTxId = 201, since it got one transaction for segment 201 Then, depending on which JNs are available during recovery, the prepareRecovery() call is different, and thus, we'd need different responses. It's not really simple to piggy-back the segment info on NewEpoch, because we don't yet know which segment is the one to be recovered (it may be some segment that is only available on one live node) Am I misunderstanding the concrete change that you're proposing? Maybe you can post a patch? In step 3b you state that recovery metadata is created and then deleted in step 4. Isn't the updated journal file sufficient? In paxos when phase 2 is completed, paxos protocol has essentially completed when quorum number of Journal have learned the new value. From what i understand, even in ZAB the journal is updated at that stage and no separate metadata is persisted. The updated journal file isn't sufficient because it doesn't record information about whether it was an accepted recovery proposal or whether it was just left over at the last write. You need to ensure the property that, if the recovery coordinator thinks a value is accepted, then no different recovery will be accepted in the future (otherwise you risk having two different finalized lengths for the same log segment). In order to do so, you need to wait until a quorum of nodes are Finalized before you know that any future recovery will be able to rely only on the finalization state. I don't know enough about the details of the ZAB implementation to understand why they can get away without this, if in fact they can. My guess is that it's because the transaction IDs themselves have the epoch number as their high order bits, and hence you can't ever confuse the first txn of epoch N+1 with the last transaction of epoch N. The final step (finalize-segment or ZAB's commit) is really to lets all the JNs know that the new writer is the leader and that they can publish the data to other readers (the standBy in our case). Agreed. At this point we delete the metadata for recovery, but we don't necessarily have to. It's just a convenient place to do the cleanup.
          Hide
          Sanjay Radia added a comment -

          Wanted to get some clarification on what is persisted beyond the journals themselves.

          • In step 3b you state that recovery metadata is created and then deleted in step 4. Isn't the updated journal file sufficient? In paxos when phase 2 is completed, paxos protocol has essentially completed when quorum number of Journal have learned the new value. From what i understand, even in ZAB the journal is updated at that stage and no separate metadata is persisted.
          • The final step (finalize-segment or ZAB's commit) is really to lets all the JNs know that the new writer is the leader and that they can publish the data to other readers (the standBy in our case).
          Show
          Sanjay Radia added a comment - Wanted to get some clarification on what is persisted beyond the journals themselves. In step 3b you state that recovery metadata is created and then deleted in step 4. Isn't the updated journal file sufficient? In paxos when phase 2 is completed, paxos protocol has essentially completed when quorum number of Journal have learned the new value. From what i understand, even in ZAB the journal is updated at that stage and no separate metadata is persisted. The final step (finalize-segment or ZAB's commit) is really to lets all the JNs know that the new writer is the leader and that they can publish the data to other readers (the standBy in our case).
          Hide
          Sanjay Radia added a comment -

          To be perfectly frank, I'm not interested in changing the design substantially ...

          Todd the change is small and i am trying to help you here. Recall in HDFS-1073 you did not want to use transaction ids or name the log files using transaction id range and you argued against this for quite a while. As I predicted, txids have become a cornerstone of HA and managing journals.

          The only real change I have proposed to the protocol is to merge the first 2 operations.

          • You have argued that prepare-recovery, using the epoch number from previous newEpoch, is like multi-paxos - not sure if multi-paxos is warranted here.
          • The response of the newEpoch() is highest txId while response to PrepareRecovery is the state of the highest segment, and optionally additional info if there was a previous recovery in progress.
          • BTW I am okay with the document describing the protocol by mapping it to both paxos and zab for the reader.
          Show
          Sanjay Radia added a comment - To be perfectly frank, I'm not interested in changing the design substantially ... Todd the change is small and i am trying to help you here. Recall in HDFS-1073 you did not want to use transaction ids or name the log files using transaction id range and you argued against this for quite a while. As I predicted, txids have become a cornerstone of HA and managing journals. The only real change I have proposed to the protocol is to merge the first 2 operations. You have argued that prepare-recovery, using the epoch number from previous newEpoch, is like multi-paxos - not sure if multi-paxos is warranted here. The response of the newEpoch() is highest txId while response to PrepareRecovery is the state of the highest segment, and optionally additional info if there was a previous recovery in progress. BTW I am okay with the document describing the protocol by mapping it to both paxos and zab for the reader.
          Hide
          Sanjay Radia added a comment -

          Wrt to epoch number in edits file: you raise the issues of consistency where a JM that writes to local disk journal and also to QJM (QJM+localDisk journal); this is useful journal:

          • the QJM is not replaceable by local disk journal if QJM is not available because the local disk journal and qjm will not be consistent
          • during recovery, the journal abstraction doesn't provide any mechanisms maintaining for consistency across such "sister" journals.
            Hence if we want to allow a QHM+localDisk journal we will need to revise the the journal manager abstraction a little. BTW one could create a QJM+localDisk journal whose implementation does exactly that with some hacks.

          We may have to revise the journal abstraction at little to deal with the above situation (independent of storing epoch in the first entry) since a QJM+localDisk journal is useful. I suspect that the journal abstraction may need to have an operation called recover anyway. I am not sure what change is needed exactly in the abstraction, but this particular issue can be addressed in the next month or so.

          Show
          Sanjay Radia added a comment - Wrt to epoch number in edits file: you raise the issues of consistency where a JM that writes to local disk journal and also to QJM (QJM+localDisk journal); this is useful journal: the QJM is not replaceable by local disk journal if QJM is not available because the local disk journal and qjm will not be consistent during recovery, the journal abstraction doesn't provide any mechanisms maintaining for consistency across such "sister" journals. Hence if we want to allow a QHM+localDisk journal we will need to revise the the journal manager abstraction a little. BTW one could create a QJM+localDisk journal whose implementation does exactly that with some hacks. We may have to revise the journal abstraction at little to deal with the above situation (independent of storing epoch in the first entry) since a QJM+localDisk journal is useful. I suspect that the journal abstraction may need to have an operation called recover anyway. I am not sure what change is needed exactly in the abstraction, but this particular issue can be addressed in the next month or so.
          Hide
          Suresh Srinivas added a comment -

          I'm not interested in changing the design substantially at this point without a good reason. I've put several weeks into testing this design...

          Todd, several people have put in a bunch of time in this as well, not just in this jira, but also related HA jiras and HDFS-3092 which this jira incorporates.

          Just reviewing this design has required catching up with subtleties of Paxos and ZAB. So the comments made here are not frivolous, but come with quite a bit of investment of time. The design started saying it is ZAB like, then added paxos and since then has gone back and forth. The interest in getting this right is to ensure a critical component of HDFS is right.

          The reason to do this is to make sure not only the design is correct, but is simple and completely documented. This is more important given the complexity of this solution and the need for other HDFS contributors to understand so it can be maintained. This has been brought up Konstantin also in the mailing list.

          I know you have spent a lot of time on this. I am okay if you do not want to make further changes. But I think some of the changes that Sanjay is proposing is worth considering, if it simplifies the design. In general, a reviewer's feedback for a jira is not just to show design bugs, but also to provide feedback on other things, such as maintainability, simplicity etc.

          If you do not have cycles, it is understandable. Someone else can pick up the work that comes out of these discussions.

          Show
          Suresh Srinivas added a comment - I'm not interested in changing the design substantially at this point without a good reason. I've put several weeks into testing this design... Todd, several people have put in a bunch of time in this as well, not just in this jira, but also related HA jiras and HDFS-3092 which this jira incorporates. Just reviewing this design has required catching up with subtleties of Paxos and ZAB. So the comments made here are not frivolous, but come with quite a bit of investment of time. The design started saying it is ZAB like, then added paxos and since then has gone back and forth. The interest in getting this right is to ensure a critical component of HDFS is right. The reason to do this is to make sure not only the design is correct, but is simple and completely documented. This is more important given the complexity of this solution and the need for other HDFS contributors to understand so it can be maintained. This has been brought up Konstantin also in the mailing list. I know you have spent a lot of time on this. I am okay if you do not want to make further changes. But I think some of the changes that Sanjay is proposing is worth considering, if it simplifies the design. In general, a reviewer's feedback for a jira is not just to show design bugs, but also to provide feedback on other things, such as maintainability, simplicity etc. If you do not have cycles, it is understandable. Someone else can pick up the work that comes out of these discussions.
          Hide
          Todd Lipcon added a comment -

          You raised the objection that this breaks the Journal abstraction. Think of this as an "info-field" of the special no-op transaction where the journal impl specific information is stored;

          This would be problematic for several reasons:
          1) "rollEdits" is not a JournalManager operation. The JournalManager treats edits as opaque things written by the higher level FSEditLog code. Thus it cannot inject/modify the operations.
          2) If the JournalManager is meant to modify the transaction content, this implies that two different JournalManagers would produce different values for the same transaction. Thus, the locally-stored edit log segment would differ in contents from a remotely stored edit log segment. This makes me really nervous: we should see multiple copies of a log as identical replicas of the same information, not adulterated with any storage-specific info.
          3) In order to address the above issues, we'd have to add QJM-specific code into the NameNode, and introduce the concept of epochs into the generic interfaces. This "bleed" of QJM concepts into the main source code is something we are explicitly trying to avoid by introducing the JournalManager API.

          I am also thinking back to our discussion last summer during the HDFS-1073 work (particularly HDFS-2018 and HDFS-1580), where you had argued that segments themselves should be considered an implementation detail of the JournalManager. So, adding information which is required for correctness into the START_LOG_SEGMENT written by the NameNode layer takes us farther away from that goal instead of closer to it.

          Suresh and I have been looking at the design and compared it to Paxos and Zab in detail and have concluded that the design is closer to ZAB than Paxos...

          Sure, it's very close to ZAB as well, which I mentioned above in the discussion. I honestly see ZAB and Paxos as basically the same thing – ZAB (and QJM) use something very close to Paxos when they switch epochs. The main difference between QJM and ZAB is that ZAB actually maintains full histories at each of the nodes, because it needs to implement a state machine (the database state). In contrast, QJM allows a journal node to get kicked out for one segment, then join again in the next segment even if it's missing some txns in between. This is OK because it is not trying to maintain state, just act as storage, and IMO it makes things simpler. This difference is enough that I don't think we should explicitly say that this is an implementation of ZAB.

          To be perfectly frank, I'm not interested in changing the design substantially at this point without a good reason. I've put several weeks into testing this design, and unless you can find a counter-example or a bug, I am against changing it. If you want to do the work and produce a patch which makes the code simpler, and it can pass 20,000 runs of the randomized fault test, I'd be happy to review your patch. Or if you can point a flaw out in the current design that's addressed by your proposed change, I'll do the work to address it. But as is, I am confident that the design is correct and don't have more time to allocate to shifting things around unless there's a bug or another real problem which would negatively affect its usage.

          Show
          Todd Lipcon added a comment - You raised the objection that this breaks the Journal abstraction. Think of this as an "info-field" of the special no-op transaction where the journal impl specific information is stored; This would be problematic for several reasons: 1) "rollEdits" is not a JournalManager operation. The JournalManager treats edits as opaque things written by the higher level FSEditLog code. Thus it cannot inject/modify the operations. 2) If the JournalManager is meant to modify the transaction content, this implies that two different JournalManagers would produce different values for the same transaction. Thus, the locally-stored edit log segment would differ in contents from a remotely stored edit log segment. This makes me really nervous: we should see multiple copies of a log as identical replicas of the same information, not adulterated with any storage-specific info. 3) In order to address the above issues, we'd have to add QJM-specific code into the NameNode, and introduce the concept of epochs into the generic interfaces. This "bleed" of QJM concepts into the main source code is something we are explicitly trying to avoid by introducing the JournalManager API. I am also thinking back to our discussion last summer during the HDFS-1073 work (particularly HDFS-2018 and HDFS-1580 ), where you had argued that segments themselves should be considered an implementation detail of the JournalManager. So, adding information which is required for correctness into the START_LOG_SEGMENT written by the NameNode layer takes us farther away from that goal instead of closer to it. Suresh and I have been looking at the design and compared it to Paxos and Zab in detail and have concluded that the design is closer to ZAB than Paxos... Sure, it's very close to ZAB as well, which I mentioned above in the discussion. I honestly see ZAB and Paxos as basically the same thing – ZAB (and QJM) use something very close to Paxos when they switch epochs. The main difference between QJM and ZAB is that ZAB actually maintains full histories at each of the nodes, because it needs to implement a state machine (the database state). In contrast, QJM allows a journal node to get kicked out for one segment, then join again in the next segment even if it's missing some txns in between. This is OK because it is not trying to maintain state, just act as storage, and IMO it makes things simpler. This difference is enough that I don't think we should explicitly say that this is an implementation of ZAB. To be perfectly frank, I'm not interested in changing the design substantially at this point without a good reason. I've put several weeks into testing this design, and unless you can find a counter-example or a bug, I am against changing it. If you want to do the work and produce a patch which makes the code simpler, and it can pass 20,000 runs of the randomized fault test, I'd be happy to review your patch. Or if you can point a flaw out in the current design that's addressed by your proposed change, I'll do the work to address it. But as is, I am confident that the design is correct and don't have more time to allocate to shifting things around unless there's a bug or another real problem which would negatively affect its usage.
          Hide
          Sanjay Radia added a comment -

          Suresh and I have been looking at the design and compared it to Paxos and Zab in detail and have concluded that the design is closer to ZAB than Paxos.

          • In both cases the recovery establishes a leader and syncs missing transactions across a number of journal-participants. At the end the leader writes future transactions to the journal-participants.
          • The txid is used in both cases (called zxid in ZAB) in similar ways except in ZAB the epoch is part of the transaction id.
          • The recovery process discovers the highest txid, and then arranges to sync the missing transactions across the participant journals.
          • the steps are very similar - except the HDFS-3077 design has an extra initial step. If newEpoch and prepareRecovery are merged then the HDFS-3077 will become the same as ZAB.

          The proposal is to merge the first 2 steps and just model this after ZAB and use the ZAB terminology. We have discussed some of the implementation details with Mahadev of the ZK team and can benefit from insights in some of ZK's lower level details and the corner cases they deal with. There are some details on what is persisted and when it is persisted that we would like to discuss further.

          Show
          Sanjay Radia added a comment - Suresh and I have been looking at the design and compared it to Paxos and Zab in detail and have concluded that the design is closer to ZAB than Paxos. In both cases the recovery establishes a leader and syncs missing transactions across a number of journal-participants. At the end the leader writes future transactions to the journal-participants. The txid is used in both cases (called zxid in ZAB) in similar ways except in ZAB the epoch is part of the transaction id. The recovery process discovers the highest txid, and then arranges to sync the missing transactions across the participant journals. the steps are very similar - except the HDFS-3077 design has an extra initial step. If newEpoch and prepareRecovery are merged then the HDFS-3077 will become the same as ZAB. The proposal is to merge the first 2 steps and just model this after ZAB and use the ZAB terminology. We have discussed some of the implementation details with Mahadev of the ZK team and can benefit from insights in some of ZK's lower level details and the corner cases they deal with. There are some details on what is persisted and when it is persisted that we would like to discuss further.
          Hide
          Sanjay Radia added a comment -

          Storing Epoch number in first transaction
          Todd I think we should store the epoch number in the no-op transactions once consensus is reached.

          • You raised the objection that this breaks the Journal abstraction. Think of this as an "info-field" of the special no-op transaction where the journal impl specific information is stored; this is fine since the roll operation automatically adds the non-op transaction and as part of that the journal impl that is doing the roll can record information in the info-field of the transaction. Indeed it should also store the writer's IP. Further, even the local disk journal impl should record the writer's IP.
          • With respect to your comment on the copying journal from local disk of a NN to QJMs journal. If you are doing this because the persistent state of the QJM is corrupt then it does not work in the 3077's current impl because one has to also recreate the epoch file manually.
          • As I mentioned earlier - Zookeeper also stores the epoch - the zxid contains the epoch number. (BTW 3077's design is fairly close to ZAB and zookeeper's protocol - but more on this in another comment.)
          Show
          Sanjay Radia added a comment - Storing Epoch number in first transaction Todd I think we should store the epoch number in the no-op transactions once consensus is reached. You raised the objection that this breaks the Journal abstraction. Think of this as an "info-field" of the special no-op transaction where the journal impl specific information is stored; this is fine since the roll operation automatically adds the non-op transaction and as part of that the journal impl that is doing the roll can record information in the info-field of the transaction. Indeed it should also store the writer's IP. Further, even the local disk journal impl should record the writer's IP. With respect to your comment on the copying journal from local disk of a NN to QJMs journal. If you are doing this because the persistent state of the QJM is corrupt then it does not work in the 3077's current impl because one has to also recreate the epoch file manually. As I mentioned earlier - Zookeeper also stores the epoch - the zxid contains the epoch number. (BTW 3077's design is fairly close to ZAB and zookeeper's protocol - but more on this in another comment.)
          Hide
          Sanjay Radia added a comment -

          If we open a log and find that it has no transactions (ie not even the "noop").. safely remove

          Thanks for the clarification – i thought you were removing the file even when it contained the no-op operation.

          Show
          Sanjay Radia added a comment - If we open a log and find that it has no transactions (ie not even the "noop").. safely remove Thanks for the clarification – i thought you were removing the file even when it contained the no-op operation.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12547598/qjournal-design.pdf
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3262//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12547598/qjournal-design.pdf against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3262//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          New rev of PDF addresses the above comments:

          • Clarifies the mapping to Paxos, being more specific about the phase numbers
          • Clarifies example 2.10.6
          Show
          Todd Lipcon added a comment - New rev of PDF addresses the above comments: Clarifies the mapping to Paxos, being more specific about the phase numbers Clarifies example 2.10.6
          Hide
          Todd Lipcon added a comment -

          You list 4 steps and have comments that steps 2 and 4 correspond with the the two phases of paxos. However you add step 3 in the middle which is not part of paxos

          Not sure I follow. Here's the mapping, using "Paxos Made Simple" as a reference (http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf)

          Step 2's request (PrepareRecovery) corresponds to Prepare (Phase 1a)
          Step 2's response corresponds to "Promise" (Phase 1b)
          Step 3's request (AcceptRecovery) corresponds to "Accept!" (Phase 2a)
          Step 3's response corresponds to "Accepted" (Phase 2b)

          After this point, Paxos itself is actually complete - the final consensus value is learned once a quorum of nodes have completed Phase 2 for a given proposal.

          Step 4's request (Finalize) corresponds to "Learned" or "Commit" which are generally sent to "Learners" after the consensus phase completes. This is described in Section 2.3 of the above-referenced paper:

          More generally, the acceptors could respond with their acceptances to
          some set of distinguished learners, each of which can then inform all the
          learners when a value has been chosen.

          In our case, the client doubles as the "distinguished learner", and the JNs all double as the other Learners.

          Another concern is that this design has more steps than paxos which is generally considered complicated to get right

          Per above, I don't think there are any "extra" steps. It is a fairly faithful implementation of Paxos except that we use a side-channel (HTTP from node-to-node) instead of the main message passing channel to actually communicate the value to be decided.

          I am also confident that the implementation is correct, after many CPU-years of randomized fault testing. If you can find any counter-example I would be really glad to hear about it, and will immediately drop everything to investigate.

          > This is the same behavior that a NameNode takes at startup today in branch-2 – if there is an entirely empty edit log file, it is removed at startup.

          Why did we do this in normal local disk journal? Doesn't the no-op transaction handle the "empty" case. We had added the non-op transaction to deal with repeated restarts and also repeated rolls.

          The no-op transaction isn't added atomically when the file is created. The file is created empty, then the header is appended, then the noop (START_LOG_SEGMENT) txn is appended. The NN could crash between any of these steps. If we open a log and find that it has no transactions (ie not even the "noop"), then we know we crashed right after opening the log but before writing anything (maybe not even the header). So, we can safely remove it and pretend the log was never started.

          Show
          Todd Lipcon added a comment - You list 4 steps and have comments that steps 2 and 4 correspond with the the two phases of paxos. However you add step 3 in the middle which is not part of paxos Not sure I follow. Here's the mapping, using "Paxos Made Simple" as a reference ( http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf ) Step 2's request (PrepareRecovery) corresponds to Prepare (Phase 1a) Step 2's response corresponds to "Promise" (Phase 1b) Step 3's request (AcceptRecovery) corresponds to "Accept!" (Phase 2a) Step 3's response corresponds to "Accepted" (Phase 2b) After this point, Paxos itself is actually complete - the final consensus value is learned once a quorum of nodes have completed Phase 2 for a given proposal. Step 4's request (Finalize) corresponds to "Learned" or "Commit" which are generally sent to "Learners" after the consensus phase completes. This is described in Section 2.3 of the above-referenced paper: More generally, the acceptors could respond with their acceptances to some set of distinguished learners, each of which can then inform all the learners when a value has been chosen. In our case, the client doubles as the "distinguished learner", and the JNs all double as the other Learners. Another concern is that this design has more steps than paxos which is generally considered complicated to get right Per above, I don't think there are any "extra" steps. It is a fairly faithful implementation of Paxos except that we use a side-channel (HTTP from node-to-node) instead of the main message passing channel to actually communicate the value to be decided. I am also confident that the implementation is correct, after many CPU-years of randomized fault testing. If you can find any counter-example I would be really glad to hear about it, and will immediately drop everything to investigate. > This is the same behavior that a NameNode takes at startup today in branch-2 – if there is an entirely empty edit log file, it is removed at startup. Why did we do this in normal local disk journal? Doesn't the no-op transaction handle the "empty" case. We had added the non-op transaction to deal with repeated restarts and also repeated rolls. The no-op transaction isn't added atomically when the file is created. The file is created empty, then the header is appended, then the noop (START_LOG_SEGMENT) txn is appended. The NN could crash between any of these steps. If we open a log and find that it has no transactions (ie not even the "noop"), then we know we crashed right after opening the log but before writing anything (maybe not even the header). So, we can safely remove it and pretend the log was never started.
          Hide
          Sanjay Radia added a comment -

          2.10.6 ... Is this explanation a little clearer?

          Yes, thanks.

          This is the same behavior that a NameNode takes at startup today in branch-2 – if there is an entirely empty edit log file, it is removed at startup.

          Why did we do this in normal local disk journal? Doesn't the no-op transaction handle the "empty" case. We had added the non-op transaction to deal with repeated restarts and also repeated rolls.

          Show
          Sanjay Radia added a comment - 2.10.6 ... Is this explanation a little clearer? Yes, thanks. This is the same behavior that a NameNode takes at startup today in branch-2 – if there is an entirely empty edit log file, it is removed at startup. Why did we do this in normal local disk journal? Doesn't the no-op transaction handle the "empty" case. We had added the non-op transaction to deal with repeated restarts and also repeated rolls.
          Hide
          Sanjay Radia added a comment -

          Section 2.8 Recovery Algorithm
          You list 4 steps and have comments that steps 2 and 4 correspond with the the two phases of paxos. However you add step 3 in the middle which is not part of paxos. Hence I am not sure if the informal arguments of correctness that you mention in the document apply. Another concern is that this design has more steps than paxos which is generally considered complicated to get right.

          Show
          Sanjay Radia added a comment - Section 2.8 Recovery Algorithm You list 4 steps and have comments that steps 2 and 4 correspond with the the two phases of paxos. However you add step 3 in the middle which is not part of paxos. Hence I am not sure if the informal arguments of correctness that you mention in the document apply. Another concern is that this design has more steps than paxos which is generally considered complicated to get right.
          Hide
          Todd Lipcon added a comment -

          What I dislike is that in example in 2.10.6, recovery completed sucessfully, the new writer starts writing normally and then fails - you delete a segment that was successfully finalized at quorum JNs - this is the only example of successful transaction being deleted. In this situation won't the normal processing select 151 to be the common transaction and hence result into segment edits_151_151? Or is it that you can't distinguish between this case and the case where the quorum failed?

          I think maybe the description in 2.10.6 is unclear/confusing. It isn't deleting any segment which has been finalized on a quorum of JNs. Here's the full sequence of events:

          1. NN1 is writing, and successfully calls sendEdits(150) to write txid 150. All nodes ACK.
          2. NN1 sends finalizeSegment(1-150). All nodes ACK.
          3. NN1 sends startLogSegment(151). All nodes ACK and create files called edits_inprogress_151, which at this point are actually empty.
          4. NN1 sends sendEdits(151-153) with the first batch of edits for the new log file. It only reaches JN1 since NN1 crashes. Therefore transactions 151-153 were not "committed", and may either be recovered or not recovered.
          5. NN2 starts recovery. JN2 and JN3 at this point have the empty log file edits_inprogress_151. Because it's empty, they delete it. This is the same behavior that a NameNode takes at startup today in branch-2 – if there is an entirely empty edit log file, it is removed at startup.
          6. NN2 for some reason does not talk to JN1 here (most likely because JN1 was located on the same node as NN1 which just crashed). So, it sees that there were no valid transactions after txid 150, and does not need to perform any recovery.
          7. NN2 starts writing again at txid=151. It's able to successfully write because it can speak to JN2 and JN3 still. Imagine that it writes just one txn (151) and then crashes.
          8. At this point, we are in the state referred to by the second table in Section 2.10.6: all three nodes have edits_inprogress_151, and JN1 has more transactions than JN2 and JN3. Yet we should use JN2 or JN3 as the recovery source, since it is from a newer writer (ie the committed txns are on those nodes, not on JN1).

          Is this explanation a little clearer? If so I will amend the doc. If I'm misunderstanding your question, can you point out the situation where you think it might lose a committed transaction?

          Show
          Todd Lipcon added a comment - What I dislike is that in example in 2.10.6, recovery completed sucessfully, the new writer starts writing normally and then fails - you delete a segment that was successfully finalized at quorum JNs - this is the only example of successful transaction being deleted. In this situation won't the normal processing select 151 to be the common transaction and hence result into segment edits_151_151? Or is it that you can't distinguish between this case and the case where the quorum failed? I think maybe the description in 2.10.6 is unclear/confusing. It isn't deleting any segment which has been finalized on a quorum of JNs. Here's the full sequence of events: 1. NN1 is writing, and successfully calls sendEdits(150) to write txid 150. All nodes ACK. 2. NN1 sends finalizeSegment(1-150). All nodes ACK. 3. NN1 sends startLogSegment(151). All nodes ACK and create files called edits_inprogress_151, which at this point are actually empty. 4. NN1 sends sendEdits(151-153) with the first batch of edits for the new log file. It only reaches JN1 since NN1 crashes. Therefore transactions 151-153 were not "committed", and may either be recovered or not recovered. 5. NN2 starts recovery. JN2 and JN3 at this point have the empty log file edits_inprogress_151. Because it's empty, they delete it. This is the same behavior that a NameNode takes at startup today in branch-2 – if there is an entirely empty edit log file, it is removed at startup. 6. NN2 for some reason does not talk to JN1 here (most likely because JN1 was located on the same node as NN1 which just crashed). So, it sees that there were no valid transactions after txid 150, and does not need to perform any recovery. 7. NN2 starts writing again at txid=151. It's able to successfully write because it can speak to JN2 and JN3 still. Imagine that it writes just one txn (151) and then crashes. 8. At this point, we are in the state referred to by the second table in Section 2.10.6: all three nodes have edits_inprogress_151, and JN1 has more transactions than JN2 and JN3. Yet we should use JN2 or JN3 as the recovery source, since it is from a newer writer (ie the committed txns are on those nodes, not on JN1). Is this explanation a little clearer? If so I will amend the doc. If I'm misunderstanding your question, can you point out the situation where you think it might lose a committed transaction?
          Hide
          Sanjay Radia added a comment -

          Section 2.10.6

          Call finalizeSegment(1-150) on all JNs, they all succeed
          Call startLogSegment(151) on all JNs, they all succeed
          Call logEdits(151-153), but it only goes to JN1 before crashing

          Based on your description, 2.10.6 should be labelled as "Inconsistency on first batch of log - prior quorum succeeded.

          I prefer to think of recovery as having one job: closing off the latest log segment. At that point, the writer continues on with writing the next segment using the usual APIs.

          I don't have a problem with the newSegment creating the no-op transaction and then the writer continuing using the normal APIs.
          What I dislike is that in example in 2.10.6, recovery completed sucessfully, the new writer starts writing normally and then fails - you delete a segment that was successfully finalized at quorum JNs - this is the only example of successful transaction being deleted. In this situation won't the normal processing select 151 to be the common transaction and hence result into segment edits_151_151? Or is it that you can't distinguish between this case and the case where the quorum failed?

          Show
          Sanjay Radia added a comment - Section 2.10.6 Call finalizeSegment(1-150) on all JNs, they all succeed Call startLogSegment(151) on all JNs, they all succeed Call logEdits(151-153), but it only goes to JN1 before crashing Based on your description, 2.10.6 should be labelled as "Inconsistency on first batch of log - prior quorum succeeded. I prefer to think of recovery as having one job: closing off the latest log segment. At that point, the writer continues on with writing the next segment using the usual APIs. I don't have a problem with the newSegment creating the no-op transaction and then the writer continuing using the normal APIs. What I dislike is that in example in 2.10.6, recovery completed sucessfully, the new writer starts writing normally and then fails - you delete a segment that was successfully finalized at quorum JNs - this is the only example of successful transaction being deleted. In this situation won't the normal processing select 151 to be the common transaction and hence result into segment edits_151_151? Or is it that you can't distinguish between this case and the case where the quorum failed?
          Hide
          Todd Lipcon added a comment -

          BTW, regarding field debuggability, there are several decisions to help that:
          1) The protobuf-serialized files (paxos accepted recovery files) include the protobuf in serialized binary form followed by the actual text content of them, so you can read them with a standard tool like "strings" or "less"
          2) The epoch/promise files are text so you can read them with standard tools.

          Combined with the log messages that are output on any new segment, you can match that against file ctimes easily as well.

          Show
          Todd Lipcon added a comment - BTW, regarding field debuggability, there are several decisions to help that: 1) The protobuf-serialized files (paxos accepted recovery files) include the protobuf in serialized binary form followed by the actual text content of them, so you can read them with a standard tool like "strings" or "less" 2) The epoch/promise files are text so you can read them with standard tools. Combined with the log messages that are output on any new segment, you can match that against file ctimes easily as well.
          Hide
          Todd Lipcon added a comment -

          Persisting epoc number and recovery state.
          You store this in the local disk separate from the segment. The epoc number (and also the ip address of the writer) should be stored in the segment's first record. I believe ZK's txId contains the epoch number; I am not suggesting extending the txId, but putting the Epoch number in the first transaction is roughly equivalent. This avoids the extra local disk data structures, is useful for debugging in the field.

          This would involve a new edit log format for the JournalNode's edit storage, distinct from the edit log format used on the local disk. I wanted to avoid that, so that you could freely copy edit log files to/from JournalNodes to old-style shared storage, run existing tools like the offline edits viewer, etc. We can't simply put it in the first transaction, because that would imply that the JournalManager sets the transaction's contents, breaking the abstraction that this is just edits storage (per my earlier comment)

          Show
          Todd Lipcon added a comment - Persisting epoc number and recovery state. You store this in the local disk separate from the segment. The epoc number (and also the ip address of the writer) should be stored in the segment's first record. I believe ZK's txId contains the epoch number; I am not suggesting extending the txId, but putting the Epoch number in the first transaction is roughly equivalent. This avoids the extra local disk data structures, is useful for debugging in the field. This would involve a new edit log format for the JournalNode's edit storage, distinct from the edit log format used on the local disk. I wanted to avoid that, so that you could freely copy edit log files to/from JournalNodes to old-style shared storage, run existing tools like the offline edits viewer, etc. We can't simply put it in the first transaction, because that would imply that the JournalManager sets the transaction's contents, breaking the abstraction that this is just edits storage (per my earlier comment)
          Hide
          Sanjay Radia added a comment -

          Persisting epoc number and recovery state.
          You store this in the local disk separate from the segment. The epoc number (and also the ip address of the writer) should be stored in the segment's first record. I believe ZK's txId contains the epoch number; I am not suggesting extending the txId, but putting the Epoch number in the first transaction is roughly equivalent. This avoids the extra local disk data structures, is useful for debugging in the field.

          Show
          Sanjay Radia added a comment - Persisting epoc number and recovery state. You store this in the local disk separate from the segment. The epoc number (and also the ip address of the writer) should be stored in the segment's first record. I believe ZK's txId contains the epoch number; I am not suggesting extending the txId, but putting the Epoch number in the first transaction is roughly equivalent. This avoids the extra local disk data structures, is useful for debugging in the field.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12547258/qjournal-design.pdf
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3250//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12547258/qjournal-design.pdf against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3250//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Updated rev per above

          Show
          Todd Lipcon added a comment - Updated rev per above
          Hide
          Todd Lipcon added a comment -

          The use of the term "master" can be confused with "recovery coordinator". The master JN is the source for the journal synchronization. The section title uses the word "recovery" - this word is also used section 2.8.

          clarify this in the section

          use another term: "journal-sync-master(s)" or "journal-sycn-source(s)" - I prefer the word "source"

          Good call. I changed "master" to "source" throughout.

          Q. is Synchronization needed before proceeding when you have a Quorum of JNs that have all the transaction. That is, in that case does the "acceptRecovery" operation (section 2.8) force the last log segment in each of the JNs to be consistently finalized? Either, way please clarify this in section 2.9. (Clearly all unsync'ed JNs have to sync with one of the other JNs.) I think you are using the design and code of HDFS-3092 but at times I am not sure when i read this section.

          I added the following:

          Note that there may be multiple segments (and respective JournalNodes) that are determined
          to be equally good sources by the above rules. For example, if all JournalNodes committed
          the most recent transaction and no further transactions were partially proposed, all
          JournalNodes would have identical states.
          
          In this case, the current implementation chooses the recovery source arbitrarily between
          the equal options. When a JournalNode receives an {\tt acceptRecovery()} RPC for a segment
          and sees that it already has an identical segment stored on its disk, it does not waste
          any effort in downloading the log from the remote node. So, in such a case that all
          JournalNodes have equal segments, no log data need be transferred for recovery.
          

          The reason why the recovery protocol is still followed when all candidates are equal is that
          not all JNs may have responded. So, even if two JNs reply with equal segments, there may be
          a third JN (crashed) which has a different segment length. Using a consistent recovery protocol
          handles this case without any special-casing, so that a future recovery won't conflict.


          Section 2.10.6

          How can JN1 get new transactions (151, 152, 153) till finalization has been achieved on a quorum JNs?

          Or do you mean that finalize succeeded and all JNs created "edits-inprogress-151" and then "edits-inprogress-151" got deleted from JN2 and JN3 because they had no transactions in them as described in 2.10.5?

          Yep. The scenario is:

          • Call finalizeSegment(1-150) on all JNs, they all succeed
          • Call startLogSegment(151) on all JNs, they all succeed
          • Call logEdits(151-153), but it only goes to JN1 before crashing

          At the end of recovery, can we guarantee that a new open segment is created with one no-op transaction in it?

          I think this actually complicates things, because then we have more edge conditions to consider – we have all the failures in this additional write. I prefer to think of recovery as having one job: closing off the latest log segment. At that point, the writer continues on with writing the next segment using the usual APIs.

          If we had the recovery protocol actually insert a no-op segment on its own, that would break the abstraction here that the JournalManager is just in charge of storage. It never generates transactions itself.

          BTW I thought that with HDFS-1073 each segment has an initial no-op transaction (BTW did we have a similar close-segment transaction in HDFS-1073?); did this change as part of HDFS-3077?

          Yes, it does have an initial no-op transaction, but the API is such that there are two separate calls made on the JournalManager: startLogSegment() which opens the file, and logEdit(START_LOG_SEGMENT) which writes that no-op transaction. Really the "startLogSegment" has no semantic value itself, which is why I chose to just roll it back during recovery if the JournalNode has an entirely empty segment (ie it crashed between startLogSegment and the first no-op transaction).

          Show
          Todd Lipcon added a comment - The use of the term "master" can be confused with "recovery coordinator". The master JN is the source for the journal synchronization. The section title uses the word "recovery" - this word is also used section 2.8. clarify this in the section use another term: "journal-sync-master(s)" or "journal-sycn-source(s)" - I prefer the word "source" Good call. I changed "master" to "source" throughout. Q. is Synchronization needed before proceeding when you have a Quorum of JNs that have all the transaction. That is, in that case does the "acceptRecovery" operation (section 2.8) force the last log segment in each of the JNs to be consistently finalized? Either, way please clarify this in section 2.9. (Clearly all unsync'ed JNs have to sync with one of the other JNs.) I think you are using the design and code of HDFS-3092 but at times I am not sure when i read this section. I added the following: Note that there may be multiple segments (and respective JournalNodes) that are determined to be equally good sources by the above rules. For example, if all JournalNodes committed the most recent transaction and no further transactions were partially proposed, all JournalNodes would have identical states. In this case , the current implementation chooses the recovery source arbitrarily between the equal options. When a JournalNode receives an {\tt acceptRecovery()} RPC for a segment and sees that it already has an identical segment stored on its disk, it does not waste any effort in downloading the log from the remote node. So, in such a case that all JournalNodes have equal segments, no log data need be transferred for recovery. The reason why the recovery protocol is still followed when all candidates are equal is that not all JNs may have responded. So, even if two JNs reply with equal segments, there may be a third JN (crashed) which has a different segment length. Using a consistent recovery protocol handles this case without any special-casing, so that a future recovery won't conflict. Section 2.10.6 How can JN1 get new transactions (151, 152, 153) till finalization has been achieved on a quorum JNs? Or do you mean that finalize succeeded and all JNs created "edits-inprogress-151" and then "edits-inprogress-151" got deleted from JN2 and JN3 because they had no transactions in them as described in 2.10.5? Yep. The scenario is: Call finalizeSegment(1-150) on all JNs, they all succeed Call startLogSegment(151) on all JNs, they all succeed Call logEdits(151-153), but it only goes to JN1 before crashing At the end of recovery, can we guarantee that a new open segment is created with one no-op transaction in it? I think this actually complicates things, because then we have more edge conditions to consider – we have all the failures in this additional write. I prefer to think of recovery as having one job: closing off the latest log segment. At that point, the writer continues on with writing the next segment using the usual APIs. If we had the recovery protocol actually insert a no-op segment on its own, that would break the abstraction here that the JournalManager is just in charge of storage. It never generates transactions itself. BTW I thought that with HDFS-1073 each segment has an initial no-op transaction (BTW did we have a similar close-segment transaction in HDFS-1073 ?); did this change as part of HDFS-3077 ? Yes, it does have an initial no-op transaction, but the API is such that there are two separate calls made on the JournalManager: startLogSegment() which opens the file, and logEdit(START_LOG_SEGMENT) which writes that no-op transaction. Really the "startLogSegment" has no semantic value itself, which is why I chose to just roll it back during recovery if the JournalNode has an entirely empty segment (ie it crashed between startLogSegment and the first no-op transaction).
          Hide
          Sanjay Radia added a comment -

          Section 2.10.6

          • How can JN1 get new transactions (151, 152, 153) till finalization has been achieved on a quorum JNs?
            Or do you mean that finalize succeeded and all JNs created "edits-inprogress-151" and then "edits-inprogress-151" got deleted from JN2 and JN3 because they had no transactions in them as described in 2.10.5?
          • At the end of recovery, can we guarantee that a new open segment is created with one no-op transaction in it?
            BTW I thought that with HDFS-1073 each segment has an initial no-op transaction (BTW did we have a similar close-segment transaction in HDFS-1073?); did this change as part of HDFS-3077?
          Show
          Sanjay Radia added a comment - Section 2.10.6 How can JN1 get new transactions (151, 152, 153) till finalization has been achieved on a quorum JNs? Or do you mean that finalize succeeded and all JNs created "edits-inprogress-151" and then "edits-inprogress-151" got deleted from JN2 and JN3 because they had no transactions in them as described in 2.10.5? At the end of recovery, can we guarantee that a new open segment is created with one no-op transaction in it? BTW I thought that with HDFS-1073 each segment has an initial no-op transaction (BTW did we have a similar close-segment transaction in HDFS-1073 ?); did this change as part of HDFS-3077 ?
          Hide
          Sanjay Radia added a comment -

          Section 2.9
          The use of the term "master" can be confused with "recovery coordinator". The master JN is the source for the journal synchronization. The section title uses the word "recovery" - this word is also used section 2.8.

          • clarify this in the section
          • the journal can be sync'ed from multiple sources - shouldn't this be a list of "Master(s)"?
          • use another term: "journal-sync-master(s)" or "journal-sycn-source(s)" - I prefer the word "source"
          • Change title of the section to be "Journal Synchronization - Choosing the Source" (or Master?).

          Q. is Synchronization needed before proceeding when you have a Quorum of JNs that have all the transaction. That is, in that case does the "acceptRecovery" operation (section 2.8) force the last log segment in each of the JNs to be consistently finalized? Either, way please clarify this in section 2.9. (Clearly all unsync'ed JNs have to sync with one of the other JNs.) I think you are using the design and code of HDFS-3092 but at times I am not sure when i read this section.

          Show
          Sanjay Radia added a comment - Section 2.9 The use of the term "master" can be confused with "recovery coordinator". The master JN is the source for the journal synchronization. The section title uses the word "recovery" - this word is also used section 2.8. clarify this in the section the journal can be sync'ed from multiple sources - shouldn't this be a list of "Master(s)"? use another term: "journal-sync-master(s)" or "journal-sycn-source(s)" - I prefer the word "source" Change title of the section to be "Journal Synchronization - Choosing the Source" (or Master?). Q. is Synchronization needed before proceeding when you have a Quorum of JNs that have all the transaction. That is, in that case does the "acceptRecovery" operation (section 2.8) force the last log segment in each of the JNs to be consistently finalized? Either, way please clarify this in section 2.9. (Clearly all unsync'ed JNs have to sync with one of the other JNs.) I think you are using the design and code of HDFS-3092 but at times I am not sure when i read this section.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12546920/qjournal-design.pdf
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3245//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12546920/qjournal-design.pdf against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3245//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Attached new rev of design doc per above comments.

          Show
          Todd Lipcon added a comment - Attached new rev of design doc per above comments.
          Hide
          Todd Lipcon added a comment -

          err... my last paragraph got truncated:

          I think some of the above comments may explain this - in particular the reason why you need the idea of accepting recovery prior to committing it. Otherwise, I'll turn the question on its head: why do you think you can get away with so few steps? Perhaps it's possible in a system that requires every write to go to all nodes, but I don't think it's possible in a quorum write system, since at recovery time, you may end with several different length logs that need to be reconciled, and reconciled consistently regardless of how many writers attempt to do recovery.

          Show
          Todd Lipcon added a comment - err... my last paragraph got truncated: I think some of the above comments may explain this - in particular the reason why you need the idea of accepting recovery prior to committing it. Otherwise, I'll turn the question on its head: why do you think you can get away with so few steps? Perhaps it's possible in a system that requires every write to go to all nodes, but I don't think it's possible in a quorum write system, since at recovery time, you may end with several different length logs that need to be reconciled, and reconciled consistently regardless of how many writers attempt to do recovery.
          Hide
          Todd Lipcon added a comment -

          "Henceforth we will refer to these nodes as replicas." Please use a different term as replicas is heavily used in the context of block replica in HDFS. Perhaps Journal Replicas may be a better name.

          Fixed

          "Before taking action in response to any RPC, the JournalNode checks the requester's epoch number against its lastPromisedEpoch variable. If the requester's epoch is lower, then it will reject the request". This is only true for all the RPCs other than newEpoch. Further it should say if the requester's epoch is not equal to lastPromisedEpoch the request is rejected.

          Fixed. Actually, if any request comes with a higher epoch than lastPromisedEpoch, then the JN accepts the request and also updates lastPromisedEpoch. This allows a JournalNode to join back into the quorum properly even if it was down when the writer became active.

          In step 3, you mean newEpoch is sent to "JNs" and not QJMs. Rest of the description should also read "JNs" instead of "QJMs".

          Thanks, fixed.

          In step 4. "Otherwise, it aborts the attempt to become the active writer." What is the state of QJM after this at the namenode? More details needed.

          Clarified:

           Otherwise, it aborts the attempt to become the active writer by throwing
           an IOException. This will be handled by the NameNode in the same fashion as a failure to write
           to an NFS mount -- if the QJM is being used as a shared edits volume, it will cause the NN to
           abort.
          

          Section 2.6, bullet 3 - is synchronization on quorum nodes done for only the last segments or all the segments (required for a given fsimage?). Based on the answer, section 2.8 might require updates.

          It only synchronizes the last log segment. Any earlier segments are already guaranteed to be finalized on a quorum of nodes (either by a postcondition of the recovery process, or by the fact that a new segment is not started by a writer until the previous one is finalized on a quorum).

          In the future, we might have a background thread synchronizing earlier log segments to "minority JNs" who were down when they were written, but we already have a guarantee that a quorum has every segment.

          Say a new JN is added or an older JN came backup during restart of the cluster. I think you may achieve quorum without the overlap of a node that was part of previous quorum write. This could result in loading stale journal. How do we handle this? Is set of JNs that the system was configured/working with?

          The JNs don't auto-format themselves, so if you bring up a new one with no data, or otherwise end up contacting one that wasn't part of the previous quorum, then it won't be able to respond to the newEpoch() call. It will throw a "JournalNotFormattedException".

          As for adding new journals, the process today would be:
          a) shut down HDFS cleanly
          b) rsync one of the JN directories to the new nodes
          c) add new nodes to the qjournal URI
          d) restart HDFS

          As I understand it, this is how new nodes are added to ZooKeeper quorums as well. In the future we might add a feature to help admins with this, but it's a really rare circumstance, so I think it's better to eschew the complexity in the initial release. (ZK itself is only adding online quorum reconfiguration now)

          The JNs also keep track of the namespace ID and will reject requests from a writer if his nsid doesn't match, which prevents accidental "mixing" of nodes between clusters.

          What is the effect of newEpoch from another writer on a JournalNode that is performing recovery, especially when it is performing AcceptRecovery? It would be good to cover what happens in other states as well.

          Since all of the RPC calls are synchronized, there are no race conditions during the RPC. If a new writer performs newEpoch before acceptRecovery, then the acceptRecovery call will fail. If the new writer performs newEpoch after acceptRecovery, then the new one will get word of the previous writer's recovery proposal when it calls prepareRecovery().

          This part follows Paxos pretty closely, and I didn't want to digress too much into explaining Paxos in the design doc. I'm happy to add an Appendix with a couple of these examples, though, if you think that would be useful.

          In "Prepare Recovery RPC", how does writer use previously accepted recovery proposal?

          Per above, this follows Paxos. If there are previously accepted proposals, then the new writer chooses them preferentially even if there are other segments which might be longer – see section 3.9 point 3.a.

          Does accept recovery wait till journal segments are downloaded? How does the timeout work for this?

          Yep, it downloads the new segment, then atomically renames the segment from its temporary location and records the accepted recovery. The timeout here is the same as "dfs.image.transfer.timeout" (default 60sec). If it times out, then it will throw an exception and not accept the recovery. If the writer performing recovery doesn't succeed on a majority of nodes, then it will fail at this step.

          Section 2.9 - "For each logger, calculate maxSeenEpoch as the greater of that logger's lastWriterEpoch and the epoch number corresponding to any previously accepted recovery proposal." Can you explain in section 2.10.6 why previously accepted recovery proposal needs to be considered?

          This is necessary in case a writer fails in the middle of recovery. Here's an example, which I'll also add to the design doc:

          Assume we have failed with the three JNs at different lengths, as in Example 2.10.2:

          JN segment last txid acceptedInEpoch lastWriterEpoch
          JN1 edits_inprogress_101 150
          1
          JN2 edits_inprogress_101 153
          1
          JN3 edits_inprogress_101 125
          1

          Now assume that the first recovery attempt only contacts JN1 and JN3. It decides that length 150 is the
          correct recovery length, and calls acceptRecovery(150) on JN1 and JN3, followed by
          {{ finalizeLogSegment(101-150) }}. But, it crashes before the finalizeLogSegment call reaches JN1.
          The state now is:

          JN segment last txid acceptedInEpoch lastWriterEpoch
          JN1 edits_inprogress_101 150 2 1
          JN2 edits_inprogress_101 153
          1
          JN3 edits_101-150 150
          1

          When a new NN now begins recovery, assume it talks only to JN1 and JN2. If it did not consider
          acceptedInEpoch, it would incorrectly decide to finalize to txid 153, which would break
          the invariant that finalized log segments beginning at the same transaction ID must have the
          same length. Because of Rule 3.b, it will instead choose JN1 again as the
          recovery master, and properly finalize JN1 and JN2 to txid 150 instead of 153, which match
          the now-crashed JN3.

          Section 3 - since a reader can read from any JN, if the JN it is reading from gets disconnected from active, does the reader know about it? How does this work especially in the context of standby namenode?

          Though the SBN can read each segment from any one of the JNs, it actually sends the "getEditLogManifest" to all of the JNs. Then, it takes the results, and merges them using RedundantEditLogInputStream. So, if two JNs are up which have a certain segment, then both are available for reading. If, then, in the middle of the read, it crashes, the SBN can "fail over" to reading from the other JN that had this same segment.

          Following additional things would be good to cover in the design:

          Cover boot strapping of JournalNode and how it is formatted

          Added a section to the design doc on bootstrap and format

          Section 2.8 "replacing any current copy of the log segment". Need more details here. Is it possible that we delete a segment and due to correlated failures, we lose the journal data in the process. So replacing must perhaps keep the old log segment until the segment recovery completes.

          Can you give a specific example of what you mean here? We don't delete the existing segment except when we are moving a new one on top of it – and the new one has already been determined to be a "valid recovery". The download process via HTTP also uses FileChannel.force() after downloading to be sure that the new file is fully on disk before it is moved into place.

          How is addition, deletion and JN becoming live again from the previous state of dead/very slow handled?

          On each segment roll, the client will again retry writing to all of the JNs, even those that had been marked "out of sync" during the previous log segment. If it's just lagging a bit, then the queueing in the IPCLoggerChannel handles that (it'll start the new segment a bit behind the other nodes, but that's fine). Is there a psecific example I can explain that would make this clearer?

          I am still concerned (see my previous comments about epochs using JNs) that a NN that does not hold the ZK lock can still cause service interruption. This is could be considered later as an enhancement. This probably is a bigger discussion.

          Yea, I agree this is worth a separate discussion. There's no real way to tie a ZK lock to anything except for ZK data - you can always think you have the lock, but by the time you take action, not have it anymore.

          I saw couple of white space/empty line changes

          Will take care of these, sorry.

          Also moving some of the documentation around can be done in trunk, or that particular change can be merged to trunk to keep this patch smaller.

          It seems wrong to merge the docs change to trunk when the code it's documenting isn't there, yet. Aaron posted some helpful diffs with the docs on HDFS-3926if you want to review the diff without all the extra diff caused by the moving.

          An additional comment - in 3092 design during recovery we had just fence (newEpoch() here) and roll. I am not sure why recovery needs to have so many steps - prepare, accept and roll. Can you please describe what I am missing?

          I think some of the above comments may explain this - in particular the reason why you need the idea of accepting recovery prior to committing it. Otherwise, I'll turn the question on its head: why do you think you can get away with so few steps? Perhaps it's possible in a system that requires every write to go

          Show
          Todd Lipcon added a comment - "Henceforth we will refer to these nodes as replicas." Please use a different term as replicas is heavily used in the context of block replica in HDFS. Perhaps Journal Replicas may be a better name. Fixed "Before taking action in response to any RPC, the JournalNode checks the requester's epoch number against its lastPromisedEpoch variable. If the requester's epoch is lower, then it will reject the request". This is only true for all the RPCs other than newEpoch. Further it should say if the requester's epoch is not equal to lastPromisedEpoch the request is rejected. Fixed. Actually, if any request comes with a higher epoch than lastPromisedEpoch, then the JN accepts the request and also updates lastPromisedEpoch. This allows a JournalNode to join back into the quorum properly even if it was down when the writer became active. In step 3, you mean newEpoch is sent to "JNs" and not QJMs. Rest of the description should also read "JNs" instead of "QJMs". Thanks, fixed. In step 4. "Otherwise, it aborts the attempt to become the active writer." What is the state of QJM after this at the namenode? More details needed. Clarified: Otherwise, it aborts the attempt to become the active writer by throwing an IOException. This will be handled by the NameNode in the same fashion as a failure to write to an NFS mount -- if the QJM is being used as a shared edits volume, it will cause the NN to abort. Section 2.6, bullet 3 - is synchronization on quorum nodes done for only the last segments or all the segments (required for a given fsimage?). Based on the answer, section 2.8 might require updates. It only synchronizes the last log segment. Any earlier segments are already guaranteed to be finalized on a quorum of nodes (either by a postcondition of the recovery process, or by the fact that a new segment is not started by a writer until the previous one is finalized on a quorum). In the future, we might have a background thread synchronizing earlier log segments to "minority JNs" who were down when they were written, but we already have a guarantee that a quorum has every segment. Say a new JN is added or an older JN came backup during restart of the cluster. I think you may achieve quorum without the overlap of a node that was part of previous quorum write. This could result in loading stale journal. How do we handle this? Is set of JNs that the system was configured/working with? The JNs don't auto-format themselves, so if you bring up a new one with no data, or otherwise end up contacting one that wasn't part of the previous quorum, then it won't be able to respond to the newEpoch() call. It will throw a "JournalNotFormattedException". As for adding new journals, the process today would be: a) shut down HDFS cleanly b) rsync one of the JN directories to the new nodes c) add new nodes to the qjournal URI d) restart HDFS As I understand it, this is how new nodes are added to ZooKeeper quorums as well. In the future we might add a feature to help admins with this, but it's a really rare circumstance, so I think it's better to eschew the complexity in the initial release. (ZK itself is only adding online quorum reconfiguration now) The JNs also keep track of the namespace ID and will reject requests from a writer if his nsid doesn't match, which prevents accidental "mixing" of nodes between clusters. What is the effect of newEpoch from another writer on a JournalNode that is performing recovery, especially when it is performing AcceptRecovery? It would be good to cover what happens in other states as well. Since all of the RPC calls are synchronized, there are no race conditions during the RPC. If a new writer performs newEpoch before acceptRecovery, then the acceptRecovery call will fail. If the new writer performs newEpoch after acceptRecovery, then the new one will get word of the previous writer's recovery proposal when it calls prepareRecovery(). This part follows Paxos pretty closely, and I didn't want to digress too much into explaining Paxos in the design doc. I'm happy to add an Appendix with a couple of these examples, though, if you think that would be useful. In "Prepare Recovery RPC", how does writer use previously accepted recovery proposal? Per above, this follows Paxos. If there are previously accepted proposals, then the new writer chooses them preferentially even if there are other segments which might be longer – see section 3.9 point 3.a. Does accept recovery wait till journal segments are downloaded? How does the timeout work for this? Yep, it downloads the new segment, then atomically renames the segment from its temporary location and records the accepted recovery. The timeout here is the same as "dfs.image.transfer.timeout" (default 60sec). If it times out, then it will throw an exception and not accept the recovery. If the writer performing recovery doesn't succeed on a majority of nodes, then it will fail at this step. Section 2.9 - "For each logger, calculate maxSeenEpoch as the greater of that logger's lastWriterEpoch and the epoch number corresponding to any previously accepted recovery proposal." Can you explain in section 2.10.6 why previously accepted recovery proposal needs to be considered? This is necessary in case a writer fails in the middle of recovery. Here's an example, which I'll also add to the design doc: Assume we have failed with the three JNs at different lengths, as in Example 2.10.2: JN segment last txid acceptedInEpoch lastWriterEpoch JN1 edits_inprogress_101 150 1 JN2 edits_inprogress_101 153 1 JN3 edits_inprogress_101 125 1 Now assume that the first recovery attempt only contacts JN1 and JN3. It decides that length 150 is the correct recovery length, and calls acceptRecovery(150) on JN1 and JN3, followed by {{ finalizeLogSegment(101-150) }}. But, it crashes before the finalizeLogSegment call reaches JN1. The state now is: JN segment last txid acceptedInEpoch lastWriterEpoch JN1 edits_inprogress_101 150 2 1 JN2 edits_inprogress_101 153 1 JN3 edits_101-150 150 1 When a new NN now begins recovery, assume it talks only to JN1 and JN2. If it did not consider acceptedInEpoch , it would incorrectly decide to finalize to txid 153, which would break the invariant that finalized log segments beginning at the same transaction ID must have the same length. Because of Rule 3.b, it will instead choose JN1 again as the recovery master, and properly finalize JN1 and JN2 to txid 150 instead of 153, which match the now-crashed JN3. Section 3 - since a reader can read from any JN, if the JN it is reading from gets disconnected from active, does the reader know about it? How does this work especially in the context of standby namenode? Though the SBN can read each segment from any one of the JNs, it actually sends the "getEditLogManifest" to all of the JNs. Then, it takes the results, and merges them using RedundantEditLogInputStream. So, if two JNs are up which have a certain segment, then both are available for reading. If, then, in the middle of the read, it crashes, the SBN can "fail over" to reading from the other JN that had this same segment. Following additional things would be good to cover in the design: Cover boot strapping of JournalNode and how it is formatted Added a section to the design doc on bootstrap and format Section 2.8 "replacing any current copy of the log segment". Need more details here. Is it possible that we delete a segment and due to correlated failures, we lose the journal data in the process. So replacing must perhaps keep the old log segment until the segment recovery completes. Can you give a specific example of what you mean here? We don't delete the existing segment except when we are moving a new one on top of it – and the new one has already been determined to be a "valid recovery". The download process via HTTP also uses FileChannel.force() after downloading to be sure that the new file is fully on disk before it is moved into place. How is addition, deletion and JN becoming live again from the previous state of dead/very slow handled? On each segment roll, the client will again retry writing to all of the JNs, even those that had been marked "out of sync" during the previous log segment. If it's just lagging a bit, then the queueing in the IPCLoggerChannel handles that (it'll start the new segment a bit behind the other nodes, but that's fine). Is there a psecific example I can explain that would make this clearer? I am still concerned (see my previous comments about epochs using JNs) that a NN that does not hold the ZK lock can still cause service interruption. This is could be considered later as an enhancement. This probably is a bigger discussion. Yea, I agree this is worth a separate discussion. There's no real way to tie a ZK lock to anything except for ZK data - you can always think you have the lock, but by the time you take action, not have it anymore. I saw couple of white space/empty line changes Will take care of these, sorry. Also moving some of the documentation around can be done in trunk, or that particular change can be merged to trunk to keep this patch smaller. It seems wrong to merge the docs change to trunk when the code it's documenting isn't there, yet. Aaron posted some helpful diffs with the docs on HDFS-3926 if you want to review the diff without all the extra diff caused by the moving. An additional comment - in 3092 design during recovery we had just fence (newEpoch() here) and roll. I am not sure why recovery needs to have so many steps - prepare, accept and roll. Can you please describe what I am missing? I think some of the above comments may explain this - in particular the reason why you need the idea of accepting recovery prior to committing it. Otherwise, I'll turn the question on its head: why do you think you can get away with so few steps? Perhaps it's possible in a system that requires every write to go
          Hide
          Suresh Srinivas added a comment -

          An additional comment - in 3092 design during recovery we had just fence (newEpoch() here) and roll. I am not sure why recovery needs to have so many steps - prepare, accept and roll. Can you please describe what I am missing?

          Show
          Suresh Srinivas added a comment - An additional comment - in 3092 design during recovery we had just fence (newEpoch() here) and roll. I am not sure why recovery needs to have so many steps - prepare, accept and roll. Can you please describe what I am missing?
          Hide
          Suresh Srinivas added a comment -

          Finally read through the design .

          Design document comments:

          1. "Henceforth we will refer to these nodes as replicas." Please use a different term as replicas is heavily used in the context of block replica in HDFS. Perhaps Journal Replicas may be a better name.
          2. "Before taking action in response to any RPC, the JournalNode checks the requester's epoch number
            against its lastPromisedEpoch variable. If the requester's epoch is lower, then it will reject the request". This is only true for all the RPCs other than newEpoch. Further it should say if the requester's epoch is not equal to lastPromisedEpoch the request is rejected.
            Ensure
          3. In Generating epoch numbers section
            • In step 3, you mean newEpoch is sent to "JNs" and not QJMs. Rest of the description should also read "JNs" instead of "QJMs".
            • In step 4. "Otherwise, it aborts the attempt to become the active writer." What is the state of QJM after this at the namenode? More details needed.
          4. Section 2.6, bullet 3 - is synchronization on quorum nodes done for only the last segments or all the segments (required for a given fsimage?). Based on the answer, section 2.8 might require updates.
          5. Say a new JN is added or an older JN came backup during restart of the cluster. I think you may achieve quorum without the overlap of a node that was part of previous quorum write. This could result in loading stale journal. How do we handle this? Is set of JNs that the system was configured/working with?
          6. What is the effect of newEpoch from another writer on a JournalNode that is performing recovery, especially when it is performing AcceptRecovery? It would be good to cover what happens in other states as well.
          7. In "Prepare Recovery RPC", how does writer use previously accepted recovery proposal?
          8. Does accept recovery wait till journal segments are downloaded? How does the timeout work for this?
          9. Section 2.9 - "For each logger, calculate maxSeenEpoch as the greater of that logger's lastWriterEpoch and the epoch number corresponding to any previously accepted recovery proposal." Can you explain in section 2.10.6 why previously accepted recovery proposal needs to be considered?
          10. Section 3 - since a reader can read from any JN, if the JN it is reading from gets disconnected from active, does the reader know about it? How does this work especially in the context of standby namenode?
          11. Following additional things would be good to cover in the design:
            • Cover boot strapping of JournalNode and how it is formatted
            • Section 2.8 "replacing any current copy of the log segment". Need more details here. Is it possible that we delete a segment and due to correlated failures, we lose the journal data in the process. So replacing must perhaps keep the old log segment until the segment recovery completes.
            • How is addition, deletion and JN becoming live again from the previous state of dead/very slow handled?
          12. I am still concerned (see my previous comments about epochs using JNs) that a NN that does not hold the ZK lock can still cause service interruption. This is could be considered later as an enhancement. This probably is a bigger discussion.

          As regards to code changes:

          1. I saw couple of white space/empty line changes
          2. Also moving some of the documentation around can be done in trunk, or that particular change can be merged to trunk to keep this patch smaller.

          I will continue with the review of code.

          Show
          Suresh Srinivas added a comment - Finally read through the design . Design document comments: "Henceforth we will refer to these nodes as replicas." Please use a different term as replicas is heavily used in the context of block replica in HDFS. Perhaps Journal Replicas may be a better name. "Before taking action in response to any RPC, the JournalNode checks the requester's epoch number against its lastPromisedEpoch variable. If the requester's epoch is lower, then it will reject the request". This is only true for all the RPCs other than newEpoch. Further it should say if the requester's epoch is not equal to lastPromisedEpoch the request is rejected. Ensure In Generating epoch numbers section In step 3, you mean newEpoch is sent to "JNs" and not QJMs. Rest of the description should also read "JNs" instead of "QJMs". In step 4. "Otherwise, it aborts the attempt to become the active writer." What is the state of QJM after this at the namenode? More details needed. Section 2.6, bullet 3 - is synchronization on quorum nodes done for only the last segments or all the segments (required for a given fsimage?). Based on the answer, section 2.8 might require updates. Say a new JN is added or an older JN came backup during restart of the cluster. I think you may achieve quorum without the overlap of a node that was part of previous quorum write. This could result in loading stale journal. How do we handle this? Is set of JNs that the system was configured/working with? What is the effect of newEpoch from another writer on a JournalNode that is performing recovery, especially when it is performing AcceptRecovery? It would be good to cover what happens in other states as well. In "Prepare Recovery RPC", how does writer use previously accepted recovery proposal? Does accept recovery wait till journal segments are downloaded? How does the timeout work for this? Section 2.9 - "For each logger, calculate maxSeenEpoch as the greater of that logger's lastWriterEpoch and the epoch number corresponding to any previously accepted recovery proposal." Can you explain in section 2.10.6 why previously accepted recovery proposal needs to be considered? Section 3 - since a reader can read from any JN, if the JN it is reading from gets disconnected from active, does the reader know about it? How does this work especially in the context of standby namenode? Following additional things would be good to cover in the design: Cover boot strapping of JournalNode and how it is formatted Section 2.8 "replacing any current copy of the log segment". Need more details here. Is it possible that we delete a segment and due to correlated failures, we lose the journal data in the process. So replacing must perhaps keep the old log segment until the segment recovery completes. How is addition, deletion and JN becoming live again from the previous state of dead/very slow handled? I am still concerned (see my previous comments about epochs using JNs) that a NN that does not hold the ZK lock can still cause service interruption. This is could be considered later as an enhancement. This probably is a bigger discussion. As regards to code changes: I saw couple of white space/empty line changes Also moving some of the documentation around can be done in trunk, or that particular change can be merged to trunk to keep this patch smaller. I will continue with the review of code.
          Hide
          Todd Lipcon added a comment -

          -1 javac. The applied patch generated 2056 javac compiler warnings (more than the trunk's current 2055 warnings).

          This is due to the addition of a new JSP page, which for some reason always results in a new javac warning.

          -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings.

          The new findbugs warnings are in trunk due to HADOOP-8805.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site:

          All of these failures have been seen elsewhere and do not reproduce for me. None of them should have any interaction with the modified code paths.

          I'll send a status update on this branch to the hdfs-dev list in a few minutes.

          Show
          Todd Lipcon added a comment - -1 javac. The applied patch generated 2056 javac compiler warnings (more than the trunk's current 2055 warnings). This is due to the addition of a new JSP page, which for some reason always results in a new javac warning. -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings. The new findbugs warnings are in trunk due to HADOOP-8805 . -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site: All of these failures have been seen elsewhere and do not reproduce for me. None of them should have any interaction with the modified code paths. I'll send a status update on this branch to the hdfs-dev list in a few minutes.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12545668/hdfs-3077-test-merge.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 21 new or modified test files.

          -1 javac. The applied patch generated 2056 javac compiler warnings (more than the trunk's current 2055 warnings).

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site:

          org.apache.hadoop.ha.TestZKFailoverController
          org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics
          org.apache.hadoop.hdfs.server.datanode.TestBPOfferService
          org.apache.hadoop.hdfs.TestPersistBlocks

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3209//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3209//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html
          Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3209//artifact/trunk/patchprocess/diffJavacWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3209//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12545668/hdfs-3077-test-merge.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 21 new or modified test files. -1 javac. The applied patch generated 2056 javac compiler warnings (more than the trunk's current 2055 warnings). +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site: org.apache.hadoop.ha.TestZKFailoverController org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics org.apache.hadoop.hdfs.server.datanode.TestBPOfferService org.apache.hadoop.hdfs.TestPersistBlocks +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3209//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3209//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3209//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3209//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Attaching a merge patch vs trunk to run through the automated QA bot. I'm not yet calling a merge vote - I just want to verify that there aren't any unexpected issues on Jenkins with the current state of the branch.

          Show
          Todd Lipcon added a comment - Attaching a merge patch vs trunk to run through the automated QA bot. I'm not yet calling a merge vote - I just want to verify that there aren't any unexpected issues on Jenkins with the current state of the branch.
          Hide
          Todd Lipcon added a comment -

          Attaching new rev of the design document which is up-to-date with the implementation.

          Show
          Todd Lipcon added a comment - Attaching new rev of the design document which is up-to-date with the implementation.
          Hide
          Todd Lipcon added a comment -

          In your scenario, NN1 has never written to the logs? ie it is in the process of becoming active but hasn't actually become active (because it hasn't opened its edit log for write)?

          If this happened, then when NN1 unpaused and established an epoch, that would cause NN2 to get fenced when it next tried to write, yes. But then NN2 would abort, the FC would notice that it disappeared, rescind the ZK lock, and auto-failback to NN1.

          Show
          Todd Lipcon added a comment - In your scenario, NN1 has never written to the logs? ie it is in the process of becoming active but hasn't actually become active (because it hasn't opened its edit log for write)? If this happened, then when NN1 unpaused and established an epoch, that would cause NN2 to get fenced when it next tried to write, yes. But then NN2 would abort, the FC would notice that it disappeared, rescind the ZK lock, and auto-failback to NN1.
          Hide
          Suresh Srinivas added a comment -

          This code has been committed on the branch for about 2 months, and the relevant patch was first on this JIRA on April 2nd. I think it's a bit late to consider this fundamental of a re-structure now.

          I have not had time to look at the code. But at least, based on my earlier comment, that JournalNode and protocol independent of this code is still possible.

          The example I was thinking was slightly different:
          1. NN1 is active. NN2 is standby
          <long pause>
          2. NN2 detects loss of active and becomes active. It then establishes its epoch.
          3. NN1 unpauses and continues to becomes primary by establishing new epoch before realizing it is no longer active.

          Is this possible?

          Show
          Suresh Srinivas added a comment - This code has been committed on the branch for about 2 months, and the relevant patch was first on this JIRA on April 2nd. I think it's a bit late to consider this fundamental of a re-structure now. I have not had time to look at the code. But at least, based on my earlier comment, that JournalNode and protocol independent of this code is still possible. The example I was thinking was slightly different: 1. NN1 is active. NN2 is standby <long pause> 2. NN2 detects loss of active and becomes active. It then establishes its epoch. 3. NN1 unpauses and continues to becomes primary by establishing new epoch before realizing it is no longer active. Is this possible?
          Hide
          Todd Lipcon added a comment -

          >> given we already have the journal daemons, it's trivial to generate unique increasing sequence IDs
          > But may still be unnecessary. May be during the code review I might find indeed it is trivial.

          This code has been committed on the branch for about 2 months, and the relevant patch was first on this JIRA on April 2nd. I think it's a bit late to consider this fundamental of a re-structure now.

          In this case, you have leader/active(to loosely to put it) elected at zk and then active has to establish epoch at znodes to become primary. Both of this needs to be complete before an active becomes functional. Given the "two things" that needs to happen, is a situation possible when one NN is active at zk while not the primary at the journal nodes and the other NN is not active at zk while is a primary at journal nodes

          No, this is not possible, since NNs don't try to "re-acquire writer status" (i.e start a new epoch) once they've lost it. So, even if a node thinks it is active, if another node is actually active, the first node will fail the next time it tries to write. This will cause it to abort, regardless of whether ZK has told it to be active or not.

          Since I think it's clearer to explain with a couple examples:

          Example 1: manual failover (simplest case, doesn't depend on ZK at all)

          1. NN1 is active. NN2 is standby.
          2. Admin issues a "failover" command, but for some reason the admin is partitioned from NN1. So, NN1 remains in Active mode, while NN2 also enters active mode.
          3. NN2, upon entering active mode, starts a new epoch on the JournalNodes.
          4. NN1, upon the next time it tries to perform a write, gets back an exception from a quorum of nodes that its epoch is too old. Since it could not logSync() and the shared edits dir is marked "required", it aborts.

          Example 2: automatic failover with ZK and network partitions

          1. NN1 is active. NN2 is standby.
          2. NN1 becomes partitioned from ZooKeeper. Thus, it receives a ZooKeeper "Disconnected" event. Because "Disconnected" is not the same as "Expired", NN1 does not immediately transition to standby. Instead, it stays in its current state (active). Because it can still reach the JNs, it can continue writing.
          3. NN2 is still connected to ZK, and thus sees that NN1's ephemeral node has disappeared (after the ZK session timeout elapses). It then transitions itself to active.
          4. NN2, upon becoming active, starts a new epoch at the JournalNodes. As soon as this happens, NN1 may no longer write, and aborts.

          Note that in both cases, even though NN1 can still reach a quorum of JNs, it doesn't try to start a new epoch after it has been fenced.

          Does that address the concern?

          Show
          Todd Lipcon added a comment - >> given we already have the journal daemons, it's trivial to generate unique increasing sequence IDs > But may still be unnecessary. May be during the code review I might find indeed it is trivial. This code has been committed on the branch for about 2 months, and the relevant patch was first on this JIRA on April 2nd. I think it's a bit late to consider this fundamental of a re-structure now. In this case, you have leader/active(to loosely to put it) elected at zk and then active has to establish epoch at znodes to become primary. Both of this needs to be complete before an active becomes functional. Given the "two things" that needs to happen, is a situation possible when one NN is active at zk while not the primary at the journal nodes and the other NN is not active at zk while is a primary at journal nodes No, this is not possible, since NNs don't try to "re-acquire writer status" (i.e start a new epoch) once they've lost it. So, even if a node thinks it is active, if another node is actually active, the first node will fail the next time it tries to write. This will cause it to abort, regardless of whether ZK has told it to be active or not. Since I think it's clearer to explain with a couple examples: Example 1: manual failover (simplest case, doesn't depend on ZK at all) 1. NN1 is active. NN2 is standby. 2. Admin issues a "failover" command, but for some reason the admin is partitioned from NN1. So, NN1 remains in Active mode, while NN2 also enters active mode. 3. NN2, upon entering active mode, starts a new epoch on the JournalNodes. 4. NN1, upon the next time it tries to perform a write, gets back an exception from a quorum of nodes that its epoch is too old. Since it could not logSync() and the shared edits dir is marked "required", it aborts. Example 2: automatic failover with ZK and network partitions 1. NN1 is active. NN2 is standby. 2. NN1 becomes partitioned from ZooKeeper. Thus, it receives a ZooKeeper "Disconnected" event. Because "Disconnected" is not the same as "Expired", NN1 does not immediately transition to standby. Instead, it stays in its current state (active). Because it can still reach the JNs, it can continue writing. 3. NN2 is still connected to ZK, and thus sees that NN1's ephemeral node has disappeared (after the ZK session timeout elapses). It then transitions itself to active. 4. NN2, upon becoming active, starts a new epoch at the JournalNodes. As soon as this happens, NN1 may no longer write, and aborts. Note that in both cases, even though NN1 can still reach a quorum of JNs, it doesn't try to start a new epoch after it has been fenced. Does that address the concern?
          Hide
          Suresh Srinivas added a comment -

          Thanks for the detailed comments.

          given we already have the journal daemons, it's trivial to generate unique increasing sequence IDs

          But may still be unnecessary. May be during the code review I might find indeed it is trivial.

          The third thing I'll mention is what I informally call the "two things" problem

          There is always "two things" problem

          In this case, you have leader/active(to loosely to put it) elected at zk and then active has to establish epoch at znodes to become primary. Both of this needs to be complete before an active becomes functional. Given the "two things" that needs to happen, is a situation possible when one NN is active at zk while not the primary at the journal nodes and the other NN is not active at zk while is a primary at journal nodes. How will this be handled? Would this require shutting down/fencing the other NN to prevent it from taking over as primary at the journal nodes?

          Show
          Suresh Srinivas added a comment - Thanks for the detailed comments. given we already have the journal daemons, it's trivial to generate unique increasing sequence IDs But may still be unnecessary. May be during the code review I might find indeed it is trivial. The third thing I'll mention is what I informally call the "two things" problem There is always "two things" problem In this case, you have leader/active(to loosely to put it) elected at zk and then active has to establish epoch at znodes to become primary. Both of this needs to be complete before an active becomes functional. Given the "two things" that needs to happen, is a situation possible when one NN is active at zk while not the primary at the journal nodes and the other NN is not active at zk while is a primary at journal nodes. How will this be handled? Would this require shutting down/fencing the other NN to prevent it from taking over as primary at the journal nodes?
          Hide
          Todd Lipcon added a comment -

          Hey Todd, I have not looked at the work in this branch in a while. One thing I wanted to ask you about is, why are we using journal daemons to decide on an epoch? Could zookeeper be used for doing the same? What are the advantages of using journal daemons instead of zk? Adding this information to the document might also be useful.

          Certainly you could use ZK to generate an increasing sequence ID to decide on an epoch. But, given we already have the journal daemons, it's trivial to generate unique increasing sequence IDs without using an external dependency. The protocol is very simple:

          • ask each of the JNs for their highest epoch seen
          • set local epoch to one higher than the highest seen from any JN
          • ask JNs to promise you that epoch

          If you succeed on a quorum, then no one else can successfully achieve a quorum on the same epoch number. If you don't succeed, that means you raced with some other writer. At that point you could either retry or just fail.

          There is a stress test that verifies that this protocol works correctly - please see TestEpochsAreUnique.

          As for the advantages of not depending on ZooKeeper, my experience working with ZK in the context of the HBase Master has convinced me that it's not a panacea for situations like this. One of the biggest issues we've had in the HBase Master design is loss of synchronization between what is the truth in ZooKeeper vs what the individual participants think is the truth. ZooKeeper's consistency semantics are that different clients, when connected to different nodes in the quorum, may be arbitrarily "behind" in their view of the data. This means that, even if we update an epoch number in ZooKeeper, for example, one of the JNs may not receive the update for some number of seconds, and can continue to accept writes from previous writers. So, we still have to deal with fencing and all of these quorum protocols on our own, and I don't think ZK provides much for us.

          The other advantage of building this as a self-contained system is that it's easier for us to test and debug. For example, the randomized test cases have been set up so that the entire system runs single-threaded and, given a random seed, can reproduce a given set of dropped messages. This would be very hard to implement on top of ZooKeeper where all of the messaging is opaque to our purposes.

          The third thing I'll mention is what I informally call the "two things" problem: when you have some data in ZK, and some data on the JNs, it's possible that the two could get out of sync. For example, if an administrator accidentally reformats ZooKeeper, our fencing guarantees will become screwed up. So, we have to guard against this, add code to re-format ZK safely, etc. Another example situation is to consider what happens when a NN is partitioned from the majority of the ZooKeeper nodes but not partitioned from a majority of the JournalNodes. Should it stop writing? If the other NN can reach a quorum of ZK but not a quorum of JN, should it begin writing? Or should the whole system stop in its tracks? If the whole system stops, then we have introduced an availability dependency on ZooKeeper such that no edits may be made while ZK is down. This is worse off than we are today: we can continue operating while ZK is down (though we can't process a new failover).

          So, to summarize, while I think ZK can reduce complexity for a lot of applications, in this case I prefer the control from "doing it ourselves". We already have to build all of the quorum counting infrastructure, etc, and don't see what there is to gain from the extra dependency. Hope all of the above makes sense!

          Show
          Todd Lipcon added a comment - Hey Todd, I have not looked at the work in this branch in a while. One thing I wanted to ask you about is, why are we using journal daemons to decide on an epoch? Could zookeeper be used for doing the same? What are the advantages of using journal daemons instead of zk? Adding this information to the document might also be useful. Certainly you could use ZK to generate an increasing sequence ID to decide on an epoch. But, given we already have the journal daemons, it's trivial to generate unique increasing sequence IDs without using an external dependency. The protocol is very simple: ask each of the JNs for their highest epoch seen set local epoch to one higher than the highest seen from any JN ask JNs to promise you that epoch If you succeed on a quorum, then no one else can successfully achieve a quorum on the same epoch number. If you don't succeed, that means you raced with some other writer. At that point you could either retry or just fail. There is a stress test that verifies that this protocol works correctly - please see TestEpochsAreUnique. As for the advantages of not depending on ZooKeeper, my experience working with ZK in the context of the HBase Master has convinced me that it's not a panacea for situations like this. One of the biggest issues we've had in the HBase Master design is loss of synchronization between what is the truth in ZooKeeper vs what the individual participants think is the truth. ZooKeeper's consistency semantics are that different clients, when connected to different nodes in the quorum, may be arbitrarily "behind" in their view of the data. This means that, even if we update an epoch number in ZooKeeper, for example, one of the JNs may not receive the update for some number of seconds, and can continue to accept writes from previous writers. So, we still have to deal with fencing and all of these quorum protocols on our own, and I don't think ZK provides much for us. The other advantage of building this as a self-contained system is that it's easier for us to test and debug. For example, the randomized test cases have been set up so that the entire system runs single-threaded and, given a random seed, can reproduce a given set of dropped messages. This would be very hard to implement on top of ZooKeeper where all of the messaging is opaque to our purposes. The third thing I'll mention is what I informally call the "two things" problem: when you have some data in ZK, and some data on the JNs, it's possible that the two could get out of sync. For example, if an administrator accidentally reformats ZooKeeper, our fencing guarantees will become screwed up. So, we have to guard against this, add code to re-format ZK safely, etc. Another example situation is to consider what happens when a NN is partitioned from the majority of the ZooKeeper nodes but not partitioned from a majority of the JournalNodes. Should it stop writing? If the other NN can reach a quorum of ZK but not a quorum of JN, should it begin writing? Or should the whole system stop in its tracks? If the whole system stops, then we have introduced an availability dependency on ZooKeeper such that no edits may be made while ZK is down. This is worse off than we are today: we can continue operating while ZK is down (though we can't process a new failover). So, to summarize, while I think ZK can reduce complexity for a lot of applications, in this case I prefer the control from "doing it ourselves". We already have to build all of the quorum counting infrastructure, etc, and don't see what there is to gain from the extra dependency. Hope all of the above makes sense!
          Hide
          Suresh Srinivas added a comment -

          Hey Todd, I have not looked at the work in this branch in a while. One thing I wanted to ask you about is, why are we using journal daemons to decide on an epoch? Could zookeeper be used for doing the same? What are the advantages of using journal daemons instead of zk? Adding this information to the document might also be useful.

          Show
          Suresh Srinivas added a comment - Hey Todd, I have not looked at the work in this branch in a while. One thing I wanted to ask you about is, why are we using journal daemons to decide on an epoch? Could zookeeper be used for doing the same? What are the advantages of using journal daemons instead of zk? Adding this information to the document might also be useful.
          Hide
          Chao Shi added a comment -

          I feel that we can throw a special kind of fatal exception rather than a ordinary IOException, if any inconsistent states are found (e.g. a JN's epoch > QJM's epoch). A fatal exception means that QJM must immediately stop its work. This may be caused by mis-configuration or software bugs. Because that journal is so critical to HDFS clusters, we should try the best to detect any possible mistakes/bugs.

          I think it over again today and find my example "JN's epoch > QJM's epoch" may be wrong, because it is the normal case that an old writer is fenced. When a InvariantViolatedException is thrown, we expect that someone on call should be paged and go to check the cluster immediately. So false alarming would be annoying anyway.

          Show
          Chao Shi added a comment - I feel that we can throw a special kind of fatal exception rather than a ordinary IOException, if any inconsistent states are found (e.g. a JN's epoch > QJM's epoch). A fatal exception means that QJM must immediately stop its work. This may be caused by mis-configuration or software bugs. Because that journal is so critical to HDFS clusters, we should try the best to detect any possible mistakes/bugs. I think it over again today and find my example "JN's epoch > QJM's epoch" may be wrong, because it is the normal case that an old writer is fenced. When a InvariantViolatedException is thrown, we expect that someone on call should be paged and go to check the cluster immediately. So false alarming would be annoying anyway.
          Hide
          Chao Shi added a comment -

          Does that make sense?

          Yes, that's exactly what I mean.

          I'll file follow-up JIRAs for the above. Any interest in working on them?

          Of course. I can create some tests trying to break these invariants.

          Show
          Chao Shi added a comment - Does that make sense? Yes, that's exactly what I mean. I'll file follow-up JIRAs for the above. Any interest in working on them? Of course. I can create some tests trying to break these invariants.
          Hide
          Todd Lipcon added a comment -

          I feel that we can throw a special kind of fatal exception rather than a ordinary IOException, if any inconsistent states are found (e.g. a JN's epoch > QJM's epoch). A fatal exception means that QJM must immediately stop its work. This may be caused by mis-configuration or software bugs. Because that journal is so critical to HDFS clusters, we should try the best to detect any possible mistakes/bugs.

          Great idea. Perhaps something like InvariantViolatedException? I've been using AssertionError for this purpose up to this point, but a more clear exception, with explicit abort on the client side makes sense. I am absolutely in agreement that we should prioritize correctness over availability, and if we get into an unexpected state that violates assumptions made by the code, it's better to shut down HDFS than lose data.

          Besides that, I also suggest to store the last seen txid along with the epoch for each JN (maybe periodically), so that txid never decrease and we can have a double check for that. Because the algorithm to sync unclosed log section is complex, it would be nice to have such a simple approach to verify it.

          I think here you mean we should periodically store the "last committed txid" rather than the "last seen txid", right? It's possible that one JN will see some edits which are later discarded by the recovery process if the edits didn't reach a quorum of nodes. However, given that we only have one "batch" of edits outstanding at once in the current design, each new journal() RPC acts as an implicit commit for all previous transactions. So we could periodically write down the committed txid as a sanity check.

          Does that make sense?

          I'll file follow-up JIRAs for the above. Any interest in working on them?

          Show
          Todd Lipcon added a comment - I feel that we can throw a special kind of fatal exception rather than a ordinary IOException, if any inconsistent states are found (e.g. a JN's epoch > QJM's epoch). A fatal exception means that QJM must immediately stop its work. This may be caused by mis-configuration or software bugs. Because that journal is so critical to HDFS clusters, we should try the best to detect any possible mistakes/bugs. Great idea. Perhaps something like InvariantViolatedException ? I've been using AssertionError for this purpose up to this point, but a more clear exception, with explicit abort on the client side makes sense. I am absolutely in agreement that we should prioritize correctness over availability, and if we get into an unexpected state that violates assumptions made by the code, it's better to shut down HDFS than lose data. Besides that, I also suggest to store the last seen txid along with the epoch for each JN (maybe periodically), so that txid never decrease and we can have a double check for that. Because the algorithm to sync unclosed log section is complex, it would be nice to have such a simple approach to verify it. I think here you mean we should periodically store the "last committed txid" rather than the "last seen txid", right? It's possible that one JN will see some edits which are later discarded by the recovery process if the edits didn't reach a quorum of nodes. However, given that we only have one "batch" of edits outstanding at once in the current design, each new journal() RPC acts as an implicit commit for all previous transactions. So we could periodically write down the committed txid as a sanity check. Does that make sense? I'll file follow-up JIRAs for the above. Any interest in working on them?
          Hide
          Chao Shi added a comment -

          Hi Todd,

          It's really cool that you're implementing a quorum-based journaling mechanism without extra dependencies. I just read the design doc and some pieces of your code. I feel that we can throw a special kind of fatal exception rather than a ordinary IOException, if any inconsistent states are found (e.g. a JN's epoch > QJM's epoch). A fatal exception means that QJM must immediately stop its work. This may be caused by mis-configuration or software bugs. Because that journal is so critical to HDFS clusters, we should try the best to detect any possible mistakes/bugs.

          Besides that, I also suggest to store the last seen txid along with the epoch for each JN (maybe periodically), so that txid never decrease and we can have a double check for that. Because the algorithm to sync unclosed log section is complex, it would be nice to have such a simple approach to verify it.

          Show
          Chao Shi added a comment - Hi Todd, It's really cool that you're implementing a quorum-based journaling mechanism without extra dependencies. I just read the design doc and some pieces of your code. I feel that we can throw a special kind of fatal exception rather than a ordinary IOException, if any inconsistent states are found (e.g. a JN's epoch > QJM's epoch). A fatal exception means that QJM must immediately stop its work. This may be caused by mis-configuration or software bugs. Because that journal is so critical to HDFS clusters, we should try the best to detect any possible mistakes/bugs. Besides that, I also suggest to store the last seen txid along with the epoch for each JN (maybe periodically), so that txid never decrease and we can have a double check for that. Because the algorithm to sync unclosed log section is complex, it would be nice to have such a simple approach to verify it.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12537271/hdfs-3077.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 13 new or modified test files.

          -1 javac. The applied patch generated 2007 javac compiler warnings (more than the trunk's current 2006 warnings).

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.namenode.TestBackupNode

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2873//testReport/
          Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2873//artifact/trunk/patchprocess/diffJavacWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2873//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12537271/hdfs-3077.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 13 new or modified test files. -1 javac. The applied patch generated 2007 javac compiler warnings (more than the trunk's current 2006 warnings). +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.TestBackupNode +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2873//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2873//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2873//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12537257/hdfs-3077.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 13 new or modified test files.

          -1 javac. The applied patch generated 2007 javac compiler warnings (more than the trunk's current 2006 warnings).

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.ha.TestZKFailoverController
          org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2870//testReport/
          Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2870//artifact/trunk/patchprocess/diffJavacWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2870//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12537257/hdfs-3077.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 13 new or modified test files. -1 javac. The applied patch generated 2007 javac compiler warnings (more than the trunk's current 2006 warnings). +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.ha.TestZKFailoverController org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2870//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2870//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2870//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          I committed this main patch to an HDFS-3077 branch in SVN.

          I will leave this JIRA open as the "umbrella" for followup subtasks and to track the eventual merge-to-trunk.

          Show
          Todd Lipcon added a comment - I committed this main patch to an HDFS-3077 branch in SVN. I will leave this JIRA open as the "umbrella" for followup subtasks and to track the eventual merge-to-trunk.
          Hide
          Todd Lipcon added a comment -

          Attached patch addresses ATM's comments above.

          I'm going to create a branch for this and commit it, and work on improvements in followups. I haven't forgotten about trying to support distinct client policies, but I want to work on that in parallel with getting the first client policy (quorums) into QA phase.

          Show
          Todd Lipcon added a comment - Attached patch addresses ATM's comments above. I'm going to create a branch for this and commit it, and work on improvements in followups. I haven't forgotten about trying to support distinct client policies, but I want to work on that in parallel with getting the first client policy (quorums) into QA phase.
          Hide
          Aaron T. Myers added a comment -

          The latest patch looks good to me. Just a few small nits. +1 to commit to a branch once these are addressed. Please also file follow-up JIRAs for the TODOs that are deliberately being left in this patch.

          1. Style nit - there are a few places where you unnecessarily put method arguments on separate lines, even if the line isn't approaching 80 chars, e.g.:
            +      public GetEditLogManifestResponseProto call() throws IOException {
            +        return getProxy().getEditLogManifest(
            +            journalId,
            +            fromTxnId);
            +      }
            
          2. Instead of System#nanoTime you should use Time#monotonicNow.
          3. s/Journal service/Journal Node/g: "This class is used by the lagging Journal service to retrieve edit file from another Journal service for sync up."
          4. Some odd formatting:
            +      response
            +          .sendError(HttpServletResponse.SC_FORBIDDEN,
            +              "Only Namenode and another Journal service may access this servlet");
            
          Show
          Aaron T. Myers added a comment - The latest patch looks good to me. Just a few small nits. +1 to commit to a branch once these are addressed. Please also file follow-up JIRAs for the TODOs that are deliberately being left in this patch. Style nit - there are a few places where you unnecessarily put method arguments on separate lines, even if the line isn't approaching 80 chars, e.g.: + public GetEditLogManifestResponseProto call() throws IOException { + return getProxy().getEditLogManifest( + journalId, + fromTxnId); + } Instead of System#nanoTime you should use Time#monotonicNow. s/Journal service/Journal Node/g: "This class is used by the lagging Journal service to retrieve edit file from another Journal service for sync up." Some odd formatting: + response + .sendError(HttpServletResponse.SC_FORBIDDEN, + "Only Namenode and another Journal service may access this servlet" );
          Hide
          Todd Lipcon added a comment -

          TestNNWithQJM failed because the edit log manifest had an out-of-order log segment. The fix is simply to add a sort() call in the code which prepares the edit log manifest, since it's machine-specific what order the directory listing will come back in.

          Show
          Todd Lipcon added a comment - TestNNWithQJM failed because the edit log manifest had an out-of-order log segment. The fix is simply to add a sort() call in the code which prepares the edit log manifest, since it's machine-specific what order the directory listing will come back in.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12537236/hdfs-3077.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 13 new or modified test files.

          -1 javac. The applied patch generated 2067 javac compiler warnings (more than the trunk's current 2066 warnings).

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.TestDatanodeBlockScanner
          org.apache.hadoop.hdfs.qjournal.TestNNWithQJM

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2869//testReport/
          Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2869//artifact/trunk/patchprocess/diffJavacWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2869//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12537236/hdfs-3077.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 13 new or modified test files. -1 javac. The applied patch generated 2067 javac compiler warnings (more than the trunk's current 2066 warnings). +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestDatanodeBlockScanner org.apache.hadoop.hdfs.qjournal.TestNNWithQJM +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2869//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2869//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2869//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          I've been testing this on a couple different clusters locally and mostly working well, modulo the existing cases where there are TODOs. There's more work to do, but keeping this patch mostly as-is so it can be reviewed and checked into a branch.

          This new rev is a fairly small delta from the previous one:

          • Fix bin/hdfs script - elif vs if typo in previous rev
          • Remove an empty format() stub in FSEditLog (added in a previous rev, but unused)
          • revert a spurious FileSystem change which should not be in 3077
          • Add TODO about close() hanging when remote side is down (noticed while manual testing)
          • improve TODO AssertionError for empty logs to include the path of the empty log
          • Add a TODO about null segments during recovery, a case which occurred during manual testing
          Show
          Todd Lipcon added a comment - I've been testing this on a couple different clusters locally and mostly working well, modulo the existing cases where there are TODOs. There's more work to do, but keeping this patch mostly as-is so it can be reviewed and checked into a branch. This new rev is a fairly small delta from the previous one: Fix bin/hdfs script - elif vs if typo in previous rev Remove an empty format() stub in FSEditLog (added in a previous rev, but unused) revert a spurious FileSystem change which should not be in 3077 Add TODO about close() hanging when remote side is down (noticed while manual testing) improve TODO AssertionError for empty logs to include the path of the empty log Add a TODO about null segments during recovery, a case which occurred during manual testing
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12536930/hdfs-3077.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 13 new or modified test files.

          -1 javac. The applied patch generated 2067 javac compiler warnings (more than the trunk's current 2066 warnings).

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2849//testReport/
          Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2849//artifact/trunk/patchprocess/diffJavacWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2849//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12536930/hdfs-3077.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 13 new or modified test files. -1 javac. The applied patch generated 2067 javac compiler warnings (more than the trunk's current 2066 warnings). +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2849//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2849//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2849//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          If the following is supposed to be a list of host:port pairs, I suggest we call it something other than "*.edits.dir". Also, if the default is just a path, is it really supposed to be a list of host:port pairs? Or is this comment supposed to be referring to DFS_JOURNALNODE_RPC_ADDRESS_KEY?

          Yea, the comment was misplaced. Fixed up the comments here.

          Could use a class comment and method comments in AsyncLogger.

          Missing an @param comment for AsyncLoggerSet#createNewUniqueEpoch.

          Fixed.

          I think this won't substitute in the correct hostname in a multi-node setup with host-based principal names:...

          Didn't fix this yet, but added a TODO. I'd like to test and fix any security-related bugs in a second pass / follow-up JIRA, since I imagine this won't be the only one.

          In IPCLoggerChannel, I wonder if you also shouldn't ensure that httpPort is not yet set here:

          I ended up renaming getEpochInfo to getJournalStatus and moving the httpPort assignment to happen there, which I think makes more sense. I don't think it has to assert that it's never been set before, since we will probably eventually use this to "reconnect" to a journal after the node has restarted, in which case the http port may have changed if it's ephemeral.

          Is there no need for IPCLoggerChannel to have a way of closing its associated proxy?

          Added a close() function. findbugs also caught the fact that I wasn't actually assigning the proxy member! Oops. Fixed that, too, and wired it into the QuorumJournalManager.close() function.

          Seems a little odd that JNStorage relies on a few static functions of NNStorage. Is there some better place those functions could live?

          I would have liked to share the code in a better way, but I wasn't able to find a good way. The issue is that JNStorage can't inherit from NNStorage, since the NNStorage class itself has a lot of code related to image storage, not just edits. Making a deeper inheritance hierarchy seemed too messy to handle. So without a much bigger refactor to split NNStorage in two, I figured the duplication of these simple functions was the better route. Does that seem reasonable?

          I don't understand why JNStorage#analyzeStorage locks the storage directory after formatting it. What, if anything, relies on that behavior? Where is it unlocked? Might want to add a comment explaining it.

          Clarified comment.

          Patch needs to be rebased on trunk, e.g. PersistentLong was renamed to PersistentLongFile.

          Done

          This line kind of creeps me out in the constructor of the Journal class. Maybe make a no-args version of Storage#getStorageDir that asserts there's only one dir?

          Done - added Storage.getSingularStorageDir

          In general this patch seems to be mixing in protobufs in a few places where non-proto classes seem more appropriate, notably in the Journal and JournalNodeRpcServer classes. Perhaps we should create non-proto analogs for these protos and add translator methods?

          Yea, I would like to address that in a follow-up. As the RPCs have been changing, avoiding translators in most places has made it a lot easier to evolve stuff without having to change everything in 6 places.

          This seems really goofy. Just make another non-proto class and use a translator?

          I managed to avoid the goofiness with the movement of the setHttpPort stuff into the getJournalStatus call at startup.

          I notice that there's a few TODOs left in this patch. It would be useful to know which of these you think need to be fixed before we commit this for real, versus those you'd like to leave in and do as follow-ups.

          If we commit to a branch, I think we can commit with most of these TODOs in place. As is it works for the non-HA case, which is a reasonable proof of concept.

          Instead of putting all of these classes in the o.a.h.hdfs.qjournal packages, I recommend you try to separate these out into o.a.h.hdfs.qjoural.client, which implements the NN side of things, and o.a.h.hdfs.qjournal.server, which implements the JN side of things. I think doing so would make it easier to navigate the code.

          Done. I had to make a few things public with the InterfaceAudience.Private annotation, but I agree it is an improvement.

          Could definitely use some method comments in the Journal class.

          Done

          Recommend renaming Journal#journal to something like Journal#logEdits or Journal#writeEdits.

          I kept it the same for consistency with JournalProtocol.

          In JournalNode#getOrCreateJournal, this log message could be more helpful: LOG.info("logDir: " + logDir);

          Seems like all of the timeouts in QuorumJournalManager should be configurable.

          Fixed.

          I think you already have the config key to address this TODO in QJournalProtocolPB: // TODO: need to add a new principal for loggers

          Fixed. Also see above about addressing security testing and bug fixes as a follow-on JIRA.

          s/BackupNode/JournalNode/g:

          Fixed.

          Use an HTML comment in journalstatus.jsp, instead of Java comments within a code block.

          Could use some more content for the journalstatus.jsp page.

          Would like to address these as follow-on (this code is from the currently committed HDFS-3092 branch)

          A few spots in the tests you catch expected IOEs, but don't verify that you received the IOE you actually expect.

          Got most of these. There's one in which we expect any multitude of different errors, but everywhere else I now check the string.

          Really solid tests overall, but how about one that actually works with HA? You currently have a test for two entirely separate NNs, but not one that uses an HA mini cluster.

          I'd like to address actually enabling HA with this JournalManager in a separate JIRA. There's also a bunch of other tests I'd like to add.


          I also made a few other changes I wanted to mention since the last patch ATM reviewed:

          • Added the beginnings of plumbing an md5hash through the synchronization protocol, to make sure we don't accidentally end up copying around corrupt data if an HTTP transfer fails. The md5s aren't yet calculated, but I added to the RPC.
          • Removed some attempt at fancy synchronization for synchronizing logs. I was previously trying to move the actual downloading of the log from the other host outside of the lock, but I'd rather add it back when if it turns out to be necessary.
          • Small addition to ExitUtil in the trunk code, so that test cases can reset the tracked exception. Needed this in order to have the tests pass properly.
          • Addressed the findbugs and test failures from the previous QA bot run. The test failures were mostly just a couple places I'd forgotten to make trivial updates due to my other changes.

          My next task is to work on improving the tests. I'm hoping to write a randomized test that will trigger all of the existing "TODO" assertions. If a randomized test can hit these corner cases, then it's likely to find other corner cases we didn't think through in the design as well. (Whereas a targeted test would only cover the corner cases we already identified).

          I'm also going to continue to look into merging the journal protocols as mentioned to Suresh above.

          Show
          Todd Lipcon added a comment - If the following is supposed to be a list of host:port pairs, I suggest we call it something other than "*.edits.dir". Also, if the default is just a path, is it really supposed to be a list of host:port pairs? Or is this comment supposed to be referring to DFS_JOURNALNODE_RPC_ADDRESS_KEY? Yea, the comment was misplaced. Fixed up the comments here. Could use a class comment and method comments in AsyncLogger. Missing an @param comment for AsyncLoggerSet#createNewUniqueEpoch. Fixed. I think this won't substitute in the correct hostname in a multi-node setup with host-based principal names:... Didn't fix this yet, but added a TODO. I'd like to test and fix any security-related bugs in a second pass / follow-up JIRA, since I imagine this won't be the only one. In IPCLoggerChannel, I wonder if you also shouldn't ensure that httpPort is not yet set here: I ended up renaming getEpochInfo to getJournalStatus and moving the httpPort assignment to happen there, which I think makes more sense. I don't think it has to assert that it's never been set before, since we will probably eventually use this to "reconnect" to a journal after the node has restarted, in which case the http port may have changed if it's ephemeral. Is there no need for IPCLoggerChannel to have a way of closing its associated proxy? Added a close() function. findbugs also caught the fact that I wasn't actually assigning the proxy member! Oops. Fixed that, too, and wired it into the QuorumJournalManager.close() function. Seems a little odd that JNStorage relies on a few static functions of NNStorage. Is there some better place those functions could live? I would have liked to share the code in a better way, but I wasn't able to find a good way. The issue is that JNStorage can't inherit from NNStorage, since the NNStorage class itself has a lot of code related to image storage, not just edits. Making a deeper inheritance hierarchy seemed too messy to handle. So without a much bigger refactor to split NNStorage in two, I figured the duplication of these simple functions was the better route. Does that seem reasonable? I don't understand why JNStorage#analyzeStorage locks the storage directory after formatting it. What, if anything, relies on that behavior? Where is it unlocked? Might want to add a comment explaining it. Clarified comment. Patch needs to be rebased on trunk, e.g. PersistentLong was renamed to PersistentLongFile. Done This line kind of creeps me out in the constructor of the Journal class. Maybe make a no-args version of Storage#getStorageDir that asserts there's only one dir? Done - added Storage.getSingularStorageDir In general this patch seems to be mixing in protobufs in a few places where non-proto classes seem more appropriate, notably in the Journal and JournalNodeRpcServer classes. Perhaps we should create non-proto analogs for these protos and add translator methods? Yea, I would like to address that in a follow-up. As the RPCs have been changing, avoiding translators in most places has made it a lot easier to evolve stuff without having to change everything in 6 places. This seems really goofy. Just make another non-proto class and use a translator? I managed to avoid the goofiness with the movement of the setHttpPort stuff into the getJournalStatus call at startup. I notice that there's a few TODOs left in this patch. It would be useful to know which of these you think need to be fixed before we commit this for real, versus those you'd like to leave in and do as follow-ups. If we commit to a branch, I think we can commit with most of these TODOs in place. As is it works for the non-HA case, which is a reasonable proof of concept. Instead of putting all of these classes in the o.a.h.hdfs.qjournal packages, I recommend you try to separate these out into o.a.h.hdfs.qjoural.client, which implements the NN side of things, and o.a.h.hdfs.qjournal.server, which implements the JN side of things. I think doing so would make it easier to navigate the code. Done. I had to make a few things public with the InterfaceAudience.Private annotation, but I agree it is an improvement. Could definitely use some method comments in the Journal class. Done Recommend renaming Journal#journal to something like Journal#logEdits or Journal#writeEdits. I kept it the same for consistency with JournalProtocol . In JournalNode#getOrCreateJournal, this log message could be more helpful: LOG.info("logDir: " + logDir); Seems like all of the timeouts in QuorumJournalManager should be configurable. Fixed. I think you already have the config key to address this TODO in QJournalProtocolPB: // TODO: need to add a new principal for loggers Fixed. Also see above about addressing security testing and bug fixes as a follow-on JIRA. s/BackupNode/JournalNode/g: Fixed. Use an HTML comment in journalstatus.jsp, instead of Java comments within a code block. Could use some more content for the journalstatus.jsp page. Would like to address these as follow-on (this code is from the currently committed HDFS-3092 branch) A few spots in the tests you catch expected IOEs, but don't verify that you received the IOE you actually expect. Got most of these. There's one in which we expect any multitude of different errors, but everywhere else I now check the string. Really solid tests overall, but how about one that actually works with HA? You currently have a test for two entirely separate NNs, but not one that uses an HA mini cluster. I'd like to address actually enabling HA with this JournalManager in a separate JIRA. There's also a bunch of other tests I'd like to add. I also made a few other changes I wanted to mention since the last patch ATM reviewed: Added the beginnings of plumbing an md5hash through the synchronization protocol, to make sure we don't accidentally end up copying around corrupt data if an HTTP transfer fails. The md5s aren't yet calculated, but I added to the RPC. Removed some attempt at fancy synchronization for synchronizing logs. I was previously trying to move the actual downloading of the log from the other host outside of the lock, but I'd rather add it back when if it turns out to be necessary. Small addition to ExitUtil in the trunk code, so that test cases can reset the tracked exception. Needed this in order to have the tests pass properly. Addressed the findbugs and test failures from the previous QA bot run. The test failures were mostly just a couple places I'd forgotten to make trivial updates due to my other changes. My next task is to work on improving the tests. I'm hoping to write a randomized test that will trigger all of the existing "TODO" assertions. If a randomized test can hit these corner cases, then it's likely to find other corner cases we didn't think through in the design as well. (Whereas a targeted test would only cover the corner cases we already identified). I'm also going to continue to look into merging the journal protocols as mentioned to Suresh above.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12536781/hdfs-3077.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 13 new or modified test files.

          -1 javac. The applied patch generated 2067 javac compiler warnings (more than the trunk's current 2066 warnings).

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 18 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.ha.TestZKFailoverController
          org.apache.hadoop.io.file.tfile.TestTFileByteArrays
          org.apache.hadoop.io.file.tfile.TestTFileJClassComparatorByteArrays
          org.apache.hadoop.hdfs.qjournal.client.TestQuorumJournalManagerUnit
          org.apache.hadoop.hdfs.qjournal.client.TestEpochsAreUnique
          org.apache.hadoop.hdfs.qjournal.TestMiniJournalCluster
          org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2841//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2841//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
          Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2841//artifact/trunk/patchprocess/diffJavacWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2841//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12536781/hdfs-3077.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 13 new or modified test files. -1 javac. The applied patch generated 2067 javac compiler warnings (more than the trunk's current 2066 warnings). +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 18 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.ha.TestZKFailoverController org.apache.hadoop.io.file.tfile.TestTFileByteArrays org.apache.hadoop.io.file.tfile.TestTFileJClassComparatorByteArrays org.apache.hadoop.hdfs.qjournal.client.TestQuorumJournalManagerUnit org.apache.hadoop.hdfs.qjournal.client.TestEpochsAreUnique org.apache.hadoop.hdfs.qjournal.TestMiniJournalCluster org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2841//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2841//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2841//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2841//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          New rev that addresses much of the commentary above. Still more work to do, and a few of ATM's comments haven't yet been addressed, but I want to get a Hudson and findbugs run from the QA environment overnight.

          Show
          Todd Lipcon added a comment - New rev that addresses much of the commentary above. Still more work to do, and a few of ATM's comments haven't yet been addressed, but I want to get a Hudson and findbugs run from the QA environment overnight.
          Hide
          Todd Lipcon added a comment -

          Thanks, Suresh and Aaron for your comments. I'm working on updating the patch and doing a bit more cleanup as well. I'll also see what I can do to make the server side a little more generic, if possible. I think it's impossible to share an IPC protocol with the BackupNode, but maybe it's possible to support both client-side policies for the standalone journal usecase like Suresh suggests above. I should have something in a couple days - been moving apartments the last couple weeks so a little less productive than usual.

          Show
          Todd Lipcon added a comment - Thanks, Suresh and Aaron for your comments. I'm working on updating the patch and doing a bit more cleanup as well. I'll also see what I can do to make the server side a little more generic, if possible. I think it's impossible to share an IPC protocol with the BackupNode, but maybe it's possible to support both client-side policies for the standalone journal usecase like Suresh suggests above. I should have something in a couple days - been moving apartments the last couple weeks so a little less productive than usual.
          Hide
          Suresh Srinivas added a comment - - edited

          What do you mean by "paxos-style". How does it relate to ZAB?

          Saw the updated design doc update for paxos-y recovery protocol.

          Show
          Suresh Srinivas added a comment - - edited What do you mean by "paxos-style". How does it relate to ZAB? Saw the updated design doc update for paxos-y recovery protocol .
          Hide
          Suresh Srinivas added a comment -

          Todd, I have not had time to look into the comments or the patch. Will try to get this done in next few days.

          As I said earlier, keeping JournalProtocol without adding Quorum semantics allows writers that have different policy. Perhaps the protocols should be different and may be JournalProtocol from 3092 can remain as is. Again this is an early thought - will spend time on this in next few days.

          Quick comment:

          I disagree with this statement. The commit protocol is strongly intertwined with the way in which the server has to behave. For example, the "new epoch" command needs to provide back certain information about the current state of the journals and previous paxos-style 'accepted' decisions. Trying to shoehorn it into a generic protocol doesn't make much sense to me.

          What do you mean by "paxos-style". How does it relate to ZAB?

          Show
          Suresh Srinivas added a comment - Todd, I have not had time to look into the comments or the patch. Will try to get this done in next few days. As I said earlier, keeping JournalProtocol without adding Quorum semantics allows writers that have different policy. Perhaps the protocols should be different and may be JournalProtocol from 3092 can remain as is. Again this is an early thought - will spend time on this in next few days. Quick comment: I disagree with this statement. The commit protocol is strongly intertwined with the way in which the server has to behave. For example, the "new epoch" command needs to provide back certain information about the current state of the journals and previous paxos-style 'accepted' decisions. Trying to shoehorn it into a generic protocol doesn't make much sense to me. What do you mean by "paxos-style". How does it relate to ZAB?
          Hide
          Aaron T. Myers added a comment -

          I just finished a review of the latest patch. Overall it looks really good. Great test coverage, too.

          Some comments:

          1. If the following is supposed to be a list of host:port pairs, I suggest we call it something other than "*.edits.dir". Also, if the default is just a path, is it really supposed to be a list of host:port pairs? Or is this comment supposed to be referring to DFS_JOURNALNODE_RPC_ADDRESS_KEY?
            +  // This is a comma separated host:port list of addresses hosting the journal service
            +  public static final String  DFS_JOURNALNODE_EDITS_DIR_KEY = "dfs.journalnode.edits.dir";
            +  public static final String  DFS_JOURNALNODE_EDITS_DIR_DEFAULT = "/tmp/hadoop/dfs/journalnode/";
            
          2. Could use a class comment and method comments in AsyncLogger.
          3. Missing an @param comment for AsyncLoggerSet#createNewUniqueEpoch.
          4. I think this won't substitute in the correct hostname in a multi-node setup with host-based principal names:
            +        SecurityUtil.getServerPrincipal(conf
            +            .get(DFSConfigKeys.DFS_JOURNALNODE_USER_NAME_KEY),
            +            NameNode.getAddress(conf).getHostName()) };
            
          5. In IPCLoggerChannel, I wonder if you also shouldn't ensure that httpPort is not yet set here:
            // Fill in HTTP port. TODO: is there a more elegant place to put this?
             httpPort = ret.getHttpPort();
            
          6. Is there no need for IPCLoggerChannel to have a way of closing its associated proxy?
          7. Could use some comments in JNStorage.
          8. Seems a little odd that JNStorage relies on a few static functions of NNStorage. Is there some better place those functions could live?
          9. I don't understand why JNStorage#analyzeStorage locks the storage directory after formatting it. What, if anything, relies on that behavior? Where is it unlocked? Might want to add a comment explaining it.
          10. Patch needs to be rebased on trunk, e.g. PersistentLong was renamed to PersistentLongFile.
          11. This line kind of creeps me out in the constructor of the Journal class. Maybe make a no-args version of Storage#getStorageDir that asserts there's only one dir?
            File currentDir = storage.getStorageDir(0).getCurrentDir();
            
          12. In general this patch seems to be mixing in protobufs in a few places where non-proto classes seem more appropriate, notably in the Journal and JournalNodeRpcServer classes. Perhaps we should create non-proto analogs for these protos and add translator methods?
          13. This seems really goofy. Just make another non-proto class and use a translator?
            // Return the partial builder instead of the proto, since
            
          14. I notice that there's a few TODOs left in this patch. It would be useful to know which of these you think need to be fixed before we commit this for real, versus those you'd like to leave in and do as follow-ups.
          15. Instead of putting all of these classes in the o.a.h.hdfs.qjournal packages, I recommend you try to separate these out into o.a.h.hdfs.qjoural.client, which implements the NN side of things, and o.a.h.hdfs.qjournal.server, which implements the JN side of things. I think doing so would make it easier to navigate the code.
          16. Could definitely use some method comments in the Journal class.
          17. Recommend renaming Journal#journal to something like Journal#logEdits or Journal#writeEdits.
          18. In JournalNode#getOrCreateJournal, this log message could be more helpful: LOG.info("logDir: " + logDir);
          19. Seems like all of the timeouts in QuorumJournalManager should be configurable.
          20. I think you already have the config key to address this TODO in QJournalProtocolPB: // TODO: need to add a new principal for loggers
          21. s/BackupNode/JournalNode/g:
            + * Protocol used to journal edits to a remote node. Currently,
            + * this is used to publish edits from the NameNode to a BackupNode.
            
          22. Use an HTML comment in journalstatus.jsp, instead of Java comments within a code block.
          23. Could use some more content for the journalstatus.jsp page.
          24. A few spots in the tests you catch expected IOEs, but don't verify that you received the IOE you
            actually expect.
          25. Really solid tests overall, but how about one that actually works with HA? You currently have a test for two entirely separate NNs, but not one that uses an HA mini cluster.
          Show
          Aaron T. Myers added a comment - I just finished a review of the latest patch. Overall it looks really good. Great test coverage, too. Some comments: If the following is supposed to be a list of host:port pairs, I suggest we call it something other than "*.edits.dir". Also, if the default is just a path, is it really supposed to be a list of host:port pairs? Or is this comment supposed to be referring to DFS_JOURNALNODE_RPC_ADDRESS_KEY? + // This is a comma separated host:port list of addresses hosting the journal service + public static final String DFS_JOURNALNODE_EDITS_DIR_KEY = "dfs.journalnode.edits.dir" ; + public static final String DFS_JOURNALNODE_EDITS_DIR_DEFAULT = "/tmp/hadoop/dfs/journalnode/" ; Could use a class comment and method comments in AsyncLogger. Missing an @param comment for AsyncLoggerSet#createNewUniqueEpoch. I think this won't substitute in the correct hostname in a multi-node setup with host-based principal names: + SecurityUtil.getServerPrincipal(conf + .get(DFSConfigKeys.DFS_JOURNALNODE_USER_NAME_KEY), + NameNode.getAddress(conf).getHostName()) }; In IPCLoggerChannel, I wonder if you also shouldn't ensure that httpPort is not yet set here: // Fill in HTTP port. TODO: is there a more elegant place to put this ? httpPort = ret.getHttpPort(); Is there no need for IPCLoggerChannel to have a way of closing its associated proxy? Could use some comments in JNStorage. Seems a little odd that JNStorage relies on a few static functions of NNStorage. Is there some better place those functions could live? I don't understand why JNStorage#analyzeStorage locks the storage directory after formatting it. What, if anything, relies on that behavior? Where is it unlocked? Might want to add a comment explaining it. Patch needs to be rebased on trunk, e.g. PersistentLong was renamed to PersistentLongFile. This line kind of creeps me out in the constructor of the Journal class. Maybe make a no-args version of Storage#getStorageDir that asserts there's only one dir? File currentDir = storage.getStorageDir(0).getCurrentDir(); In general this patch seems to be mixing in protobufs in a few places where non-proto classes seem more appropriate, notably in the Journal and JournalNodeRpcServer classes. Perhaps we should create non-proto analogs for these protos and add translator methods? This seems really goofy. Just make another non-proto class and use a translator? // Return the partial builder instead of the proto, since I notice that there's a few TODOs left in this patch. It would be useful to know which of these you think need to be fixed before we commit this for real, versus those you'd like to leave in and do as follow-ups. Instead of putting all of these classes in the o.a.h.hdfs.qjournal packages, I recommend you try to separate these out into o.a.h.hdfs.qjoural.client, which implements the NN side of things, and o.a.h.hdfs.qjournal.server, which implements the JN side of things. I think doing so would make it easier to navigate the code. Could definitely use some method comments in the Journal class. Recommend renaming Journal#journal to something like Journal#logEdits or Journal#writeEdits. In JournalNode#getOrCreateJournal, this log message could be more helpful: LOG.info("logDir: " + logDir); Seems like all of the timeouts in QuorumJournalManager should be configurable. I think you already have the config key to address this TODO in QJournalProtocolPB: // TODO: need to add a new principal for loggers s/BackupNode/JournalNode/g: + * Protocol used to journal edits to a remote node. Currently, + * this is used to publish edits from the NameNode to a BackupNode. Use an HTML comment in journalstatus.jsp, instead of Java comments within a code block. Could use some more content for the journalstatus.jsp page. A few spots in the tests you catch expected IOEs, but don't verify that you received the IOE you actually expect. Really solid tests overall, but how about one that actually works with HA? You currently have a test for two entirely separate NNs, but not one that uses an HA mini cluster.
          Hide
          Todd Lipcon added a comment -

          Updated patch with some improvements:

          • Move some logic back from AsyncLoggerSet into QJM
          • improve comments, inline some methods which only had one call-site
          • separate out timeout constants for different quorums
          • add TODO for another test case
          • Rejigger URL generation for synchronization into AsyncLogger interface,
          • set NSInfo when creating the logger instead of taking it as a parameter to newEpoch()
          • Add a test case to make sure the paxos behavior works during recovery (aborted recovery, then a new NN does recovery)
          • Clean up logs to use shorter debug strings for protobufs
          Show
          Todd Lipcon added a comment - Updated patch with some improvements: Move some logic back from AsyncLoggerSet into QJM improve comments, inline some methods which only had one call-site separate out timeout constants for different quorums add TODO for another test case Rejigger URL generation for synchronization into AsyncLogger interface, set NSInfo when creating the logger instead of taking it as a parameter to newEpoch() Add a test case to make sure the paxos behavior works during recovery (aborted recovery, then a new NN does recovery) Clean up logs to use shorter debug strings for protobufs
          Hide
          Todd Lipcon added a comment -

          The following are the diffs introduced by the HDFS-3092 branch as of 1346682

          todd@todd-w510:~/git/hadoop-common/hadoop-hdfs-project$ git diff --stat origin/trunk..HDFS-3092
           .../hadoop-hdfs/CHANGES.HDFS-3092.txt              |   42 +++
           hadoop-hdfs-project/hadoop-hdfs/pom.xml            |   23 ++
           .../java/org/apache/hadoop/hdfs/DFSConfigKeys.java |   12 +-
           .../src/main/webapps/journal/index.html            |   29 ++
           .../src/main/webapps/journal/journalstatus.jsp     |   42 +++
           .../src/main/webapps/proto-journal-web.xml         |   17 +
           .../main/java/org/apache/hadoop/hdfs/DFSUtil.java  |   58 ++++
           .../java/org/apache/hadoop/hdfs/TestDFSUtil.java   |   41 +++
           .../hdfs/protocol/UnregisteredNodeException.java   |    4 +
           .../hdfs/protocolPB/JournalSyncProtocolPB.java     |   41 +++
           .../JournalSyncProtocolServerSideTranslatorPB.java |   60 ++++
           .../JournalSyncProtocolTranslatorPB.java           |   79 +++++
           .../server/protocol/JournalServiceProtocols.java   |   27 ++
           .../hdfs/server/protocol/JournalSyncProtocol.java  |   58 ++++
           .../src/main/proto/JournalSyncProtocol.proto       |   57 +++
           .../journalservice/GetJournalEditServlet.java      |  177 ++++++++++
           .../hadoop/hdfs/server/journalservice/Journal.java |  130 +++++++
           .../server/journalservice/JournalDiskWriter.java   |   61 ++++
           .../server/journalservice/JournalHttpServer.java   |  172 ++++++++++
           .../server/journalservice/JournalListener.java     |    4 +-
           .../hdfs/server/journalservice/JournalService.java |  359 +++++++++++++++++---
           .../hadoop/hdfs/server/namenode/FSEditLog.java     |   65 ++++-
           .../hadoop/hdfs/server/namenode/FSImage.java       |    4 +-
           .../hdfs/server/namenode/GetImageServlet.java      |   10 +-
           .../hadoop/hdfs/server/namenode/NNStorage.java     |    4 +-
           .../hdfs/server/namenode/TransferFsImage.java      |    4 +-
           .../hadoop/hdfs/server/namenode/JournalSet.java    |   34 ++
           .../org/apache/hadoop/hdfs/MiniDFSCluster.java     |    9 +
           .../hdfs/server/journalservice/TestJournal.java    |   71 ++++
           .../journalservice/TestJournalHttpServer.java      |  311 +++++++++++++++++
           .../server/journalservice/TestJournalService.java  |  150 +++++++--
           31 files changed, 2071 insertions(+), 84 deletions(-)
          

          For each of these, I'll explain how the code was incorporated in the 3077 work. Or, in the case that the code was not carried over, explain why it doesn't make sense.

           hadoop-hdfs-project/hadoop-hdfs/pom.xml            |   23 ++
           .../java/org/apache/hadoop/hdfs/DFSConfigKeys.java |   12 +-
           .../src/main/webapps/journal/index.html            |   29 ++
           .../src/main/webapps/journal/journalstatus.jsp     |   42 +++
           .../src/main/webapps/proto-journal-web.xml         |   17 +
          

          These are carried over, with the exception of the HTTPS-related keys. Trunk has moved away from Kerberized HTTPS and towards SPNEGO for authenticated HTTP.

           .../main/java/org/apache/hadoop/hdfs/DFSUtil.java  |   58 ++++
           .../java/org/apache/hadoop/hdfs/TestDFSUtil.java   |   41 +++
          

          These diffs provided a way to parse out the list of journal nodes from the Configuration object. But, it makes more sense to provide this list in the actual URI of the edits logs, for consistency with the existing BKJM implementation. So, the equivalent method now exists as QuorumJournalManager.getLoggerAddresses(URI). Moving it from DFSUtil to QJM also made sense so that all of the new code was self-contained in its own package (per the spirit of Journal Managers being pluggable components, I don't think we should refer to them from the main DFS code)

           .../hdfs/protocol/UnregisteredNodeException.java   |    4 +
          

          Given that the NN is configured with a static list of Journal Managers to write to, and that membership is static, there's no need for a registration concept with the JNs – the NN sends RPCs to the JNs, rather than the other way around.

           .../hdfs/protocolPB/JournalSyncProtocolPB.java     |   41 +++
           .../JournalSyncProtocolServerSideTranslatorPB.java |   60 ++++
           .../JournalSyncProtocolTranslatorPB.java           |   79 +++++
           .../server/protocol/JournalServiceProtocols.java   |   27 ++
           .../hdfs/server/protocol/JournalSyncProtocol.java  |   58 ++++
           .../src/main/proto/JournalSyncProtocol.proto       |   57 +++
          

          I merged this protocol in with the other RPC protocol, since it reused the same types anyway. If there's a strong motivation to have this as a separate protocol, I could be convinced, but I think we need to make use of JN-specific items like epoch ID in here, which wouldn't make sense in the context of a NN or other edits storage.

           .../journalservice/GetJournalEditServlet.java      |  177 ++++++++++
          

          This has been moved into the qjournal package. It is otherwise mostly the same. The one salient difference is that it now only accepts the startTxId of the segment to be downloaded. This was necessary because, during the recovery step, the source node for synchronization may finalize its log segment while other nodes are in the process of synchronizing from it. So, we look for either in-progress or finalized logs. I can explain this in further detail if necessary.

           .../hadoop/hdfs/server/journalservice/Journal.java |  130 +++++++
           .../server/journalservice/JournalDiskWriter.java   |   61 ++++
          

          These have been merged and moved to the qjournal package, and simplified to work directly with a single FileJournalManager instead of an FSEditLog and NNStorage. The reasoning is that a journal's storage is not the same as a NameNode's storage, nor is it a fully general-purpose wrapper around edit logs. For example, we are not trying to support running a JournalNode which itself logs to a pluggable backend log. Additionally, at this point, we can only correctly support a single directory under a quorum participant, or else there are a lot more edge cases to consider where a node may renege on its promises if its set of directories changes during a restart.

           .../server/journalservice/JournalHttpServer.java   |  172 ++++++++++
          

          This has been moved to the qjournal package, and otherwise mostly the same. The one difference is that I switched to SPNEGO instead of the kerberized SSL, to match trunk.

           .../server/journalservice/JournalListener.java     |    4 +-
          

          The change here in the 3092 branch was just adding a new exception to a method, which didn't turn out to be necessary anymore.

           .../hdfs/server/journalservice/JournalService.java |  359 +++++++++++++++++---
          

          This class is called JournalNode now. We no longer have a state machine here - the recovery process/syncing process is coordinated by the NameNode/client side, and the commit protocol ensures that every segment is always finalized on a majority of nodes. So the state machine isn't necessary to ensure that all the log segments are replicated successfully.

           .../hadoop/hdfs/server/namenode/FSEditLog.java     |   65 ++++-
           .../hadoop/hdfs/server/namenode/FSImage.java       |    4 +-
          

          The new functions added here were necessary for the synchronization process above, but not necessary for the recovery process implemented by 3077.

           .../hdfs/server/namenode/GetImageServlet.java      |   10 +-
           .../hadoop/hdfs/server/namenode/NNStorage.java     |    4 +-
           .../hdfs/server/namenode/TransferFsImage.java      |    4 +-
          

          Just changes to make things public – same changes in 3077 patch.

           .../hadoop/hdfs/server/namenode/JournalSet.java    |   34 ++
          

          Since the Journal talks directly to FJM in 3077, the getFinalizedSegments function added here is instead fulfilled by making the existing function FileJournalManager.getLogFiles() public.

           .../org/apache/hadoop/hdfs/MiniDFSCluster.java     |    9 +
          

          The 3077 branch has a MiniJournalCluster which can start/stop/restart JournalNodes. The change here in the 3092 branch appears to be incomplete – it adds functions which are never called.

           .../hdfs/server/journalservice/TestJournal.java    |   71 ++++
           .../journalservice/TestJournalHttpServer.java      |  311 +++++++++++++++++
           .../server/journalservice/TestJournalService.java  |  150 +++++++--
          

          Several of these tests are carried over into similarly named tests in 3077. Others didn't make sense due to changes mentioned above. The test coverage in the patch attached here is comparable to the coverage in 3092, and I'm working on adding a lot more tests currently.

          So, to summarize the key similarities and differences:

          • most of the HTTP server, RPC server, node wrapper code, JSP pages, journal HTTP servlet carried over
            • SPNEGO support instead of KSSL
            • Lifecycle code a little different in order to work with MiniJournalCluster
          • the edits "synchronization" code is mostly removed, since the synchronization is now NN-led. If we feel strongly that the JNs should synchronize "old" log segments, instead of just ensuring that every segment is stored on a quorum, we can bring some of this back. But we're already guaranteed that every segment has two replicas by this design.
          • new JNStorage class, since the JN doesn't actually share most of its storage code with the NameNode (eg we have no images, no checkpoints, no distributed upgrade coordination, etc
          • Journal updated to use above JNStorage class instead of NNStorage and FSEditLog
          • Not attempting to share the RPC protocol with BackupNode or NameNode, since we need quorum-specific information like - Not attempting to share the RPC protocol with BackupNode or NameNode, since we need quorum-specific information like epoch numbers in every RPC, and those don't make sense in the other contexts.
          Show
          Todd Lipcon added a comment - The following are the diffs introduced by the HDFS-3092 branch as of 1346682 todd@todd-w510:~/git/hadoop-common/hadoop-hdfs-project$ git diff --stat origin/trunk..HDFS-3092 .../hadoop-hdfs/CHANGES.HDFS-3092.txt | 42 +++ hadoop-hdfs-project/hadoop-hdfs/pom.xml | 23 ++ .../java/org/apache/hadoop/hdfs/DFSConfigKeys.java | 12 +- .../src/main/webapps/journal/index.html | 29 ++ .../src/main/webapps/journal/journalstatus.jsp | 42 +++ .../src/main/webapps/proto-journal-web.xml | 17 + .../main/java/org/apache/hadoop/hdfs/DFSUtil.java | 58 ++++ .../java/org/apache/hadoop/hdfs/TestDFSUtil.java | 41 +++ .../hdfs/protocol/UnregisteredNodeException.java | 4 + .../hdfs/protocolPB/JournalSyncProtocolPB.java | 41 +++ .../JournalSyncProtocolServerSideTranslatorPB.java | 60 ++++ .../JournalSyncProtocolTranslatorPB.java | 79 +++++ .../server/protocol/JournalServiceProtocols.java | 27 ++ .../hdfs/server/protocol/JournalSyncProtocol.java | 58 ++++ .../src/main/proto/JournalSyncProtocol.proto | 57 +++ .../journalservice/GetJournalEditServlet.java | 177 ++++++++++ .../hadoop/hdfs/server/journalservice/Journal.java | 130 +++++++ .../server/journalservice/JournalDiskWriter.java | 61 ++++ .../server/journalservice/JournalHttpServer.java | 172 ++++++++++ .../server/journalservice/JournalListener.java | 4 +- .../hdfs/server/journalservice/JournalService.java | 359 +++++++++++++++++--- .../hadoop/hdfs/server/namenode/FSEditLog.java | 65 ++++- .../hadoop/hdfs/server/namenode/FSImage.java | 4 +- .../hdfs/server/namenode/GetImageServlet.java | 10 +- .../hadoop/hdfs/server/namenode/NNStorage.java | 4 +- .../hdfs/server/namenode/TransferFsImage.java | 4 +- .../hadoop/hdfs/server/namenode/JournalSet.java | 34 ++ .../org/apache/hadoop/hdfs/MiniDFSCluster.java | 9 + .../hdfs/server/journalservice/TestJournal.java | 71 ++++ .../journalservice/TestJournalHttpServer.java | 311 +++++++++++++++++ .../server/journalservice/TestJournalService.java | 150 +++++++-- 31 files changed, 2071 insertions(+), 84 deletions(-) For each of these, I'll explain how the code was incorporated in the 3077 work. Or, in the case that the code was not carried over, explain why it doesn't make sense. hadoop-hdfs-project/hadoop-hdfs/pom.xml | 23 ++ .../java/org/apache/hadoop/hdfs/DFSConfigKeys.java | 12 +- .../src/main/webapps/journal/index.html | 29 ++ .../src/main/webapps/journal/journalstatus.jsp | 42 +++ .../src/main/webapps/proto-journal-web.xml | 17 + These are carried over, with the exception of the HTTPS-related keys. Trunk has moved away from Kerberized HTTPS and towards SPNEGO for authenticated HTTP. .../main/java/org/apache/hadoop/hdfs/DFSUtil.java | 58 ++++ .../java/org/apache/hadoop/hdfs/TestDFSUtil.java | 41 +++ These diffs provided a way to parse out the list of journal nodes from the Configuration object. But, it makes more sense to provide this list in the actual URI of the edits logs, for consistency with the existing BKJM implementation. So, the equivalent method now exists as QuorumJournalManager.getLoggerAddresses(URI) . Moving it from DFSUtil to QJM also made sense so that all of the new code was self-contained in its own package (per the spirit of Journal Managers being pluggable components, I don't think we should refer to them from the main DFS code) .../hdfs/protocol/UnregisteredNodeException.java | 4 + Given that the NN is configured with a static list of Journal Managers to write to, and that membership is static, there's no need for a registration concept with the JNs – the NN sends RPCs to the JNs, rather than the other way around. .../hdfs/protocolPB/JournalSyncProtocolPB.java | 41 +++ .../JournalSyncProtocolServerSideTranslatorPB.java | 60 ++++ .../JournalSyncProtocolTranslatorPB.java | 79 +++++ .../server/protocol/JournalServiceProtocols.java | 27 ++ .../hdfs/server/protocol/JournalSyncProtocol.java | 58 ++++ .../src/main/proto/JournalSyncProtocol.proto | 57 +++ I merged this protocol in with the other RPC protocol, since it reused the same types anyway. If there's a strong motivation to have this as a separate protocol, I could be convinced, but I think we need to make use of JN-specific items like epoch ID in here, which wouldn't make sense in the context of a NN or other edits storage. .../journalservice/GetJournalEditServlet.java | 177 ++++++++++ This has been moved into the qjournal package. It is otherwise mostly the same. The one salient difference is that it now only accepts the startTxId of the segment to be downloaded. This was necessary because, during the recovery step, the source node for synchronization may finalize its log segment while other nodes are in the process of synchronizing from it. So, we look for either in-progress or finalized logs. I can explain this in further detail if necessary. .../hadoop/hdfs/server/journalservice/Journal.java | 130 +++++++ .../server/journalservice/JournalDiskWriter.java | 61 ++++ These have been merged and moved to the qjournal package, and simplified to work directly with a single FileJournalManager instead of an FSEditLog and NNStorage. The reasoning is that a journal's storage is not the same as a NameNode's storage, nor is it a fully general-purpose wrapper around edit logs. For example, we are not trying to support running a JournalNode which itself logs to a pluggable backend log. Additionally, at this point, we can only correctly support a single directory under a quorum participant, or else there are a lot more edge cases to consider where a node may renege on its promises if its set of directories changes during a restart. .../server/journalservice/JournalHttpServer.java | 172 ++++++++++ This has been moved to the qjournal package, and otherwise mostly the same. The one difference is that I switched to SPNEGO instead of the kerberized SSL, to match trunk. .../server/journalservice/JournalListener.java | 4 +- The change here in the 3092 branch was just adding a new exception to a method, which didn't turn out to be necessary anymore. .../hdfs/server/journalservice/JournalService.java | 359 +++++++++++++++++--- This class is called JournalNode now. We no longer have a state machine here - the recovery process/syncing process is coordinated by the NameNode/client side, and the commit protocol ensures that every segment is always finalized on a majority of nodes. So the state machine isn't necessary to ensure that all the log segments are replicated successfully. .../hadoop/hdfs/server/namenode/FSEditLog.java | 65 ++++- .../hadoop/hdfs/server/namenode/FSImage.java | 4 +- The new functions added here were necessary for the synchronization process above, but not necessary for the recovery process implemented by 3077. .../hdfs/server/namenode/GetImageServlet.java | 10 +- .../hadoop/hdfs/server/namenode/NNStorage.java | 4 +- .../hdfs/server/namenode/TransferFsImage.java | 4 +- Just changes to make things public – same changes in 3077 patch. .../hadoop/hdfs/server/namenode/JournalSet.java | 34 ++ Since the Journal talks directly to FJM in 3077, the getFinalizedSegments function added here is instead fulfilled by making the existing function FileJournalManager.getLogFiles() public. .../org/apache/hadoop/hdfs/MiniDFSCluster.java | 9 + The 3077 branch has a MiniJournalCluster which can start/stop/restart JournalNodes. The change here in the 3092 branch appears to be incomplete – it adds functions which are never called. .../hdfs/server/journalservice/TestJournal.java | 71 ++++ .../journalservice/TestJournalHttpServer.java | 311 +++++++++++++++++ .../server/journalservice/TestJournalService.java | 150 +++++++-- Several of these tests are carried over into similarly named tests in 3077. Others didn't make sense due to changes mentioned above. The test coverage in the patch attached here is comparable to the coverage in 3092, and I'm working on adding a lot more tests currently. So, to summarize the key similarities and differences: most of the HTTP server, RPC server, node wrapper code, JSP pages, journal HTTP servlet carried over SPNEGO support instead of KSSL Lifecycle code a little different in order to work with MiniJournalCluster the edits "synchronization" code is mostly removed, since the synchronization is now NN-led. If we feel strongly that the JNs should synchronize "old" log segments, instead of just ensuring that every segment is stored on a quorum, we can bring some of this back. But we're already guaranteed that every segment has two replicas by this design. new JNStorage class, since the JN doesn't actually share most of its storage code with the NameNode (eg we have no images, no checkpoints, no distributed upgrade coordination, etc Journal updated to use above JNStorage class instead of NNStorage and FSEditLog Not attempting to share the RPC protocol with BackupNode or NameNode, since we need quorum-specific information like - Not attempting to share the RPC protocol with BackupNode or NameNode, since we need quorum-specific information like epoch numbers in every RPC, and those don't make sense in the other contexts.
          Hide
          Todd Lipcon added a comment -

          I'm working on a document which explains the correspondence of the code in HDFS-3092 to the code in this patch. I think this will make it clearer that this indeed does make use of a lot of that work where it makes sense, and explain better where some pieces got dropped along the way due to different client-side semantic requirements. Unless you have specific questions on this code, let's hold off on discussion until I can post this document – hoping it gives us a more constructive way to frame the discussion.

          Show
          Todd Lipcon added a comment - I'm working on a document which explains the correspondence of the code in HDFS-3092 to the code in this patch. I think this will make it clearer that this indeed does make use of a lot of that work where it makes sense, and explain better where some pieces got dropped along the way due to different client-side semantic requirements. Unless you have specific questions on this code, let's hold off on discussion until I can post this document – hoping it gives us a more constructive way to frame the discussion.
          Hide
          Todd Lipcon added a comment -

          Quorum is a semantics on the client/writer side and not the server side policy. Hence the protocol for journaling should be generic enough. So lets not call it QJournalProtocol and make it generic, allowing other types of clients/writers.

          I disagree with this statement. The commit protocol is strongly intertwined with the way in which the server has to behave. For example, the "new epoch" command needs to provide back certain information about the current state of the journals and previous paxos-style 'accepted' decisions. Trying to shoehorn it into a generic protocol doesn't make much sense to me.

          If you see some functionality missing in 3092, lets discuss and add it there, instead of copying code and changing it separately

          3092's "log syncing" stuff doesn't fit with the recovery protocol needed for correct operation in a quorum commit setting. 3092's method of the JNs "registering" with the NN doesn't make sense either in this system, since group membership changes are not yet designed for and are quite complex. So it's not a matter of adding functionality to 3092, it's a matter of removing a lot of the functionality which just doesn't fit with this commit protocol.

          Also 3092 has been in development in open, in incremental fashion. I think we should follow this, instead of attaching a big patch from github.

          I made a best effort to do it in the open and incrementally, but didn't get any responses from the community. See HDFS-3188 and HDFS-3189 for example, both of which I posted back in April. I remember in the same discussions you referenced above that you said you'd take a look at these in the spirit of incremental progress. I understand you got busy with other things, but I wasn't going to stop working on the project in the meantime. So, work progressed and now there's a more fully baked implementation here.

          Don't be fooled by the big size of the patch - the majority of the lines of code are essentially boiler-plate – protobuf translators, simple code to start/stop RPC and HTTP servers, etc. I don't think this is unreasonably large to review.

          Show
          Todd Lipcon added a comment - Quorum is a semantics on the client/writer side and not the server side policy. Hence the protocol for journaling should be generic enough. So lets not call it QJournalProtocol and make it generic, allowing other types of clients/writers. I disagree with this statement. The commit protocol is strongly intertwined with the way in which the server has to behave. For example, the "new epoch" command needs to provide back certain information about the current state of the journals and previous paxos-style 'accepted' decisions. Trying to shoehorn it into a generic protocol doesn't make much sense to me. If you see some functionality missing in 3092, lets discuss and add it there, instead of copying code and changing it separately 3092's "log syncing" stuff doesn't fit with the recovery protocol needed for correct operation in a quorum commit setting. 3092's method of the JNs "registering" with the NN doesn't make sense either in this system, since group membership changes are not yet designed for and are quite complex. So it's not a matter of adding functionality to 3092, it's a matter of removing a lot of the functionality which just doesn't fit with this commit protocol. Also 3092 has been in development in open, in incremental fashion. I think we should follow this, instead of attaching a big patch from github. I made a best effort to do it in the open and incrementally, but didn't get any responses from the community. See HDFS-3188 and HDFS-3189 for example, both of which I posted back in April. I remember in the same discussions you referenced above that you said you'd take a look at these in the spirit of incremental progress. I understand you got busy with other things, but I wasn't going to stop working on the project in the meantime. So, work progressed and now there's a more fully baked implementation here. Don't be fooled by the big size of the patch - the majority of the lines of code are essentially boiler-plate – protobuf translators, simple code to start/stop RPC and HTTP servers, etc. I don't think this is unreasonably large to review.
          Hide
          Suresh Srinivas added a comment -

          Based on previous discussions, HDFS-3092 was decided to be the server side for JournalProtocol and would provide JournalNode capability. HDFS-3077 would be the client side implementation of Quorum based Journal Writer. In the first phase we decided to go with the Quorum based journal writer and decided to abandon the BK style writer proposed in 3092, for the time being. But it left the window open for writing other types of writes based on 3092 server side. See - https://issues.apache.org/jira/browse/HDFS-3092?focusedCommentId=13271098&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13271098

          I have not looked at the patch deeply enough. But here are my high level comments:

          1. 3092 added the capabilities to fence and epoch semantics to JournalProtocol. I believe adding support for muliple namespaces should be done as an additional change and not all in one big patch. This could be added to 3092 or to trunk post merging 3092.
          2. Quorum is a semantics on the client/writer side and not the server side policy. Hence the protocol for journaling should be generic enough. So lets not call it QJournalProtocol and make it generic, allowing other types of clients/writers.
          3. Considerable work has gone into HDFS-3092. If you see some functionality missing in 3092, lets discuss and add it there, instead of copying code and changing it separately.

          My preference is to see HDFS-3092 committed as a generic journal node functionality and 3077 enable Quorum write mechanism. Also 3092 has been in development in open, in incremental fashion. I think we should follow this, instead of attaching a big patch from github.

          Show
          Suresh Srinivas added a comment - Based on previous discussions, HDFS-3092 was decided to be the server side for JournalProtocol and would provide JournalNode capability. HDFS-3077 would be the client side implementation of Quorum based Journal Writer. In the first phase we decided to go with the Quorum based journal writer and decided to abandon the BK style writer proposed in 3092, for the time being. But it left the window open for writing other types of writes based on 3092 server side. See - https://issues.apache.org/jira/browse/HDFS-3092?focusedCommentId=13271098&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13271098 I have not looked at the patch deeply enough. But here are my high level comments: 3092 added the capabilities to fence and epoch semantics to JournalProtocol. I believe adding support for muliple namespaces should be done as an additional change and not all in one big patch. This could be added to 3092 or to trunk post merging 3092. Quorum is a semantics on the client/writer side and not the server side policy. Hence the protocol for journaling should be generic enough. So lets not call it QJournalProtocol and make it generic, allowing other types of clients/writers. Considerable work has gone into HDFS-3092 . If you see some functionality missing in 3092, lets discuss and add it there, instead of copying code and changing it separately. My preference is to see HDFS-3092 committed as a generic journal node functionality and 3077 enable Quorum write mechanism. Also 3092 has been in development in open, in incremental fashion. I think we should follow this, instead of attaching a big patch from github.
          Hide
          Todd Lipcon added a comment -

          This suffers from a TOCTTOU (time of check to time of use) race, since another client can come along and delete the file in between the file.exists() and new FileReader(). So since you'll need to handle the FileNotFoundException, I think you can then drop the file.exists() check.

          The problem is that Java throws FileNotFoundException for permissions errors or IO errors, too. And we don't want to return the default value in those cases. So I'd rather have the TOCTOU race, given that the assumption here is that the only process changing these files is the one calling this code. BTW, that code is from HDFS-3190 - probably best to comment on the JIRA associated with each piece of code.

          Show
          Todd Lipcon added a comment - This suffers from a TOCTTOU (time of check to time of use) race, since another client can come along and delete the file in between the file.exists() and new FileReader(). So since you'll need to handle the FileNotFoundException, I think you can then drop the file.exists() check. The problem is that Java throws FileNotFoundException for permissions errors or IO errors, too. And we don't want to return the default value in those cases. So I'd rather have the TOCTOU race, given that the assumption here is that the only process changing these files is the one calling this code. BTW, that code is from HDFS-3190 - probably best to comment on the JIRA associated with each piece of code.
          Hide
          Andy Isaacson added a comment -
          +  public static long readFile(File file, long defaultVal) throws IOException {
          +    long val = defaultVal;
          +    if (file.exists()) {
          +      BufferedReader br = new BufferedReader(new FileReader(file));
          

          This suffers from a TOCTTOU (time of check to time of use) race, since another client can come along and delete the file in between the file.exists() and new FileReader(). So since you'll need to handle the FileNotFoundException, I think you can then drop the file.exists() check.

          Show
          Andy Isaacson added a comment - + public static long readFile(File file, long defaultVal) throws IOException { + long val = defaultVal; + if (file.exists()) { + BufferedReader br = new BufferedReader( new FileReader(file)); This suffers from a TOCTTOU (time of check to time of use) race, since another client can come along and delete the file in between the file.exists() and new FileReader(). So since you'll need to handle the FileNotFoundException, I think you can then drop the file.exists() check.
          Hide
          Todd Lipcon added a comment -

          I want to understand how the changes can be reconciled with 3092. Currently BackupNode is being updated to used the JournalService.

          The journal interface as exposed by the quorum-capable Journal Node looks different enough from the BackupNode that I don't see any merit to combining the IPC protocols. It only muddies the interaction, IMO. For example, the QJournalProtocol has the concept of a "journal ID" so that each JournalNode can host journals for multiple namespaces at once, as well as the epoch concept which makes no sense in a BackupNode scenario. If we wanted to extend HDFS to act more like a true quorum-driven system (a la ZooKeeper) where each of the nodes maintains a full namespace as equal peers, we'd need to do more work on the commit protocol (eg adding an explicit "commit" RPC distinct from "journal"). That kind of change hasn't been proposed anywhere that I'm aware of, so I didn't want to complicate this design by considering it.

          At this point I would advocate removing the BackupNode entirely, as I don't know of a single person using it for the last ~2 years since it was introduced. But, that's a separate discussion.

          Once this is done, we were planning to merge 3092 into trunk. How should we proceed to merge 3077 and 3092 to trunk?

          I used a bunch of the HDFS-3092 branch code and design in development of this JIRA, so I would consider it to be "incorporated" into the 3077 branch already. So, I would advocate abandoning the current 3092 branch as a stepping stone (server-side-only) along the way to the full solution (server and client side implementation). Of course I'll make sure that Brandon and Hari are given their due credit as co-authors of this patch.

          Is code review going to be based off of this or code changes into a branch on Apache Hadoop code base?

          I posted the git branch just for reference, since some contributors find it easier to do a git pull rather than manually apply the patches locally for review. But the link above is to the exact same code I've attached to the JIRA. Feel free to review by looking at the patch or at the branch. Would it be helpful for me to make a branch in SVN and push the pre-review patch series there for review instead of the external github? Let me know.

          Show
          Todd Lipcon added a comment - I want to understand how the changes can be reconciled with 3092. Currently BackupNode is being updated to used the JournalService. The journal interface as exposed by the quorum-capable Journal Node looks different enough from the BackupNode that I don't see any merit to combining the IPC protocols. It only muddies the interaction, IMO. For example, the QJournalProtocol has the concept of a "journal ID" so that each JournalNode can host journals for multiple namespaces at once, as well as the epoch concept which makes no sense in a BackupNode scenario. If we wanted to extend HDFS to act more like a true quorum-driven system (a la ZooKeeper) where each of the nodes maintains a full namespace as equal peers, we'd need to do more work on the commit protocol (eg adding an explicit "commit" RPC distinct from "journal"). That kind of change hasn't been proposed anywhere that I'm aware of, so I didn't want to complicate this design by considering it. At this point I would advocate removing the BackupNode entirely, as I don't know of a single person using it for the last ~2 years since it was introduced. But, that's a separate discussion. Once this is done, we were planning to merge 3092 into trunk. How should we proceed to merge 3077 and 3092 to trunk? I used a bunch of the HDFS-3092 branch code and design in development of this JIRA, so I would consider it to be "incorporated" into the 3077 branch already. So, I would advocate abandoning the current 3092 branch as a stepping stone (server-side-only) along the way to the full solution (server and client side implementation). Of course I'll make sure that Brandon and Hari are given their due credit as co-authors of this patch. Is code review going to be based off of this or code changes into a branch on Apache Hadoop code base? I posted the git branch just for reference, since some contributors find it easier to do a git pull rather than manually apply the patches locally for review. But the link above is to the exact same code I've attached to the JIRA. Feel free to review by looking at the patch or at the branch. Would it be helpful for me to make a branch in SVN and push the pre-review patch series there for review instead of the external github? Let me know.
          Hide
          Suresh Srinivas added a comment -

          I would have liked to use the code exactly as it was, but the differences in design made it too difficult to try to reconcile, and I ended up copy-pasting and modifying rather than patching against that branch.

          Todd, 3092 focused mainly on the server side. Some of the client side, we abandoned given the work in 3077. I want to understand how the changes can be reconciled with 3092. Currently BackupNode is being updated to used the JournalService. Once this is done, we were planning to merge 3092 into trunk. How should we proceed to merge 3077 and 3092 to trunk?

          Andy asked me to post a link to the corresponding github branch: https://github.com/toddlipcon/hadoop-common/commits/qjm-patchseries

          Is code review going to be based off of this or code changes into a branch on Apache Hadoop code base?

          Show
          Suresh Srinivas added a comment - I would have liked to use the code exactly as it was, but the differences in design made it too difficult to try to reconcile, and I ended up copy-pasting and modifying rather than patching against that branch. Todd, 3092 focused mainly on the server side. Some of the client side, we abandoned given the work in 3077. I want to understand how the changes can be reconciled with 3092. Currently BackupNode is being updated to used the JournalService. Once this is done, we were planning to merge 3092 into trunk. How should we proceed to merge 3077 and 3092 to trunk? Andy asked me to post a link to the corresponding github branch: https://github.com/toddlipcon/hadoop-common/commits/qjm-patchseries Is code review going to be based off of this or code changes into a branch on Apache Hadoop code base?
          Hide
          Todd Lipcon added a comment -

          Andy asked me to post a link to the corresponding github branch: https://github.com/toddlipcon/hadoop-common/commits/qjm-patchseries
          I'm also going to try to write up a brief "code tour" of how it might make sense to look through this (in addition to improving the javadoc/comments a bit further in the next rev)

          Show
          Todd Lipcon added a comment - Andy asked me to post a link to the corresponding github branch: https://github.com/toddlipcon/hadoop-common/commits/qjm-patchseries I'm also going to try to write up a brief "code tour" of how it might make sense to look through this (in addition to improving the javadoc/comments a bit further in the next rev)
          Hide
          Todd Lipcon added a comment -

          Here is an initial patch with the implementation of this design. It is not complete, but I'm posting it here as it's already grown large, and I'd like to start the review process while I continue to add test coverage and iron out various TODOs which are littered around the code.

          As it is, the code can be run, and I can successfully start/restart NNs, fail JNs, etc, and it mostly "works as advertised". There are known deficiencies which I'm working on addressing, and these should mostly be marked by TODOs.

          This patch is on top of the following:

          ffcfc55 HDFS-3190. 1: Extract code to atomically write a file containing a long
          025759c HDFS-3571. Add URL support to EditLogFileInputStream
          707a309 HDFS-3572. Clean up init of SPNEGO
          d84516f HDFS-3573. Change instantiation of journal managers to have NSInfo
          f61dc7d HDFS-3574. Fix race in GetImageServlet where file is removed during header-setting
          (and those on top of trunk).

          I did not end up basing this on the HDFS-3092 branch as I originally planned, though there's a bunch of code borrowed from the early work done on that branch by Brandon and Hari. I would have liked to use the code exactly as it was, but the differences in design made it too difficult to try to reconcile, and I ended up copy-pasting and modifying rather than patching against that branch. (for example, all of the RPCs in this design go through an async queue in order to do quorum writes)

          Of course there will be follow-up work to create a test plan, add substantially more tests, add docs, etc. But my hope is that, after review, we can commit this (and the prereq patches) either to trunk or a branch and work from there to fix the remaining work items, test, etc.

          Show
          Todd Lipcon added a comment - Here is an initial patch with the implementation of this design. It is not complete, but I'm posting it here as it's already grown large, and I'd like to start the review process while I continue to add test coverage and iron out various TODOs which are littered around the code. As it is, the code can be run, and I can successfully start/restart NNs, fail JNs, etc, and it mostly "works as advertised". There are known deficiencies which I'm working on addressing, and these should mostly be marked by TODOs. This patch is on top of the following: ffcfc55 HDFS-3190 . 1: Extract code to atomically write a file containing a long 025759c HDFS-3571 . Add URL support to EditLogFileInputStream 707a309 HDFS-3572 . Clean up init of SPNEGO d84516f HDFS-3573 . Change instantiation of journal managers to have NSInfo f61dc7d HDFS-3574 . Fix race in GetImageServlet where file is removed during header-setting (and those on top of trunk). I did not end up basing this on the HDFS-3092 branch as I originally planned, though there's a bunch of code borrowed from the early work done on that branch by Brandon and Hari. I would have liked to use the code exactly as it was, but the differences in design made it too difficult to try to reconcile, and I ended up copy-pasting and modifying rather than patching against that branch. (for example, all of the RPCs in this design go through an async queue in order to do quorum writes) Of course there will be follow-up work to create a test plan, add substantially more tests, add docs, etc. But my hope is that, after review, we can commit this (and the prereq patches) either to trunk or a branch and work from there to fix the remaining work items, test, etc.
          Hide
          Brandon Li added a comment -

          So, the recovery process should complete in far less than 1 second given the transfer time for such a segment would be <~50ms. Hence the chances of a crash during this timeframe are vanishingly small.

          JN can be inaccessible to another JN for many reasons, such as network partition. But, I agree a list of URL in step2 is good enough for now.

          Show
          Brandon Li added a comment - So, the recovery process should complete in far less than 1 second given the transfer time for such a segment would be <~50ms. Hence the chances of a crash during this timeframe are vanishingly small. JN can be inaccessible to another JN for many reasons, such as network partition. But, I agree a list of URL in step2 is good enough for now.
          Hide
          Todd Lipcon added a comment -

          if the JournalNode who already received a prepare RPC with higher newEpoch number (possible?) can inform the writer(proposer), the writer can exit earlier in step 2 "Choosing a recovery".

          Yes, it would reject such RPCs. Once it receives newEpoch(N), then it won't accept any RPCs from any writer with a lower epoch number (the normal Paxos promise guarantee)

          In step 3 "Accept RPC", I assume the URL that the writer sends to all the JNs is the URL of one JN which responded in step2. If that JN becomes inaccessible immediately, and thus other JNs can't sync themselves by downloading the finalized segment from that JN, the recovery process could be stuck?

          More or less, yes. It's a potential improvement to actually send a list of URLs here – and include any who already have the correct segment in sync.

          IMO, recovery needs to be tolerant of missing nodes, but I don't think we need it to be tolerant of nodes crashing in the middle of the recovery process – it's OK for the NN to bail out of startup in that case, so long as they don't leave the system in an unrecoverable state. The "retry" would be done by trying again to start the NN. I'll try to address this in the next rev of the document. It would be future work to add a retry loop around the recovery process so that it would be tolerant of this.

          My thinking for the above is that log segments tend to be short – in an HA setup we roll every 2 minutes. So even in a heavily loaded cluster, segments tend to be quite small. I just looked on a 100-node QA cluster here running HA and a constant workload of terasort, gridmix, etc, and the largest edit log is 3.1MB. So, the recovery process should complete in far less than 1 second given the transfer time for such a segment would be <~50ms. Hence the chances of a crash during this timeframe are vanishingly small.

          If it could be stuck in step 3, an alternate way to sync lagging JNs is to let them contact other quorum number JNs to download the finalized segments. Given the requirement "All loggers must finalize the segment to the same length and contents", all the finalized segment with the same name should be identical in all JNs. Therefore, the lagging JN can download it from any other JN as long as that JN has the file.

          Yep, per above I think we can extend to a list of all possible JNs which are known to be up-to-date with the sync txid. But I think it's a good future enhancement rather than a requirement for right now. Does my reasoning make sense?

          Show
          Todd Lipcon added a comment - if the JournalNode who already received a prepare RPC with higher newEpoch number (possible?) can inform the writer(proposer), the writer can exit earlier in step 2 "Choosing a recovery". Yes, it would reject such RPCs. Once it receives newEpoch(N), then it won't accept any RPCs from any writer with a lower epoch number (the normal Paxos promise guarantee) In step 3 "Accept RPC", I assume the URL that the writer sends to all the JNs is the URL of one JN which responded in step2. If that JN becomes inaccessible immediately, and thus other JNs can't sync themselves by downloading the finalized segment from that JN, the recovery process could be stuck? More or less, yes. It's a potential improvement to actually send a list of URLs here – and include any who already have the correct segment in sync. IMO, recovery needs to be tolerant of missing nodes, but I don't think we need it to be tolerant of nodes crashing in the middle of the recovery process – it's OK for the NN to bail out of startup in that case, so long as they don't leave the system in an unrecoverable state. The "retry" would be done by trying again to start the NN. I'll try to address this in the next rev of the document. It would be future work to add a retry loop around the recovery process so that it would be tolerant of this. My thinking for the above is that log segments tend to be short – in an HA setup we roll every 2 minutes. So even in a heavily loaded cluster, segments tend to be quite small. I just looked on a 100-node QA cluster here running HA and a constant workload of terasort, gridmix, etc, and the largest edit log is 3.1MB. So, the recovery process should complete in far less than 1 second given the transfer time for such a segment would be <~50ms. Hence the chances of a crash during this timeframe are vanishingly small. If it could be stuck in step 3, an alternate way to sync lagging JNs is to let them contact other quorum number JNs to download the finalized segments. Given the requirement "All loggers must finalize the segment to the same length and contents", all the finalized segment with the same name should be identical in all JNs. Therefore, the lagging JN can download it from any other JN as long as that JN has the file. Yep, per above I think we can extend to a list of all possible JNs which are known to be up-to-date with the sync txid. But I think it's a good future enhancement rather than a requirement for right now. Does my reasoning make sense?
          Hide
          Brandon Li added a comment -

          Hi Todd,
          I just read the new section but not the implementation yet. Please correct me if I am wrong. Looks like it could be a possible improvement: if the JournalNode who already received a prepare RPC with higher newEpoch number (possible?) can inform the writer(proposer), the writer can exit earlier in step 2 "Choosing a recovery".

          In step 3 "Accept RPC", I assume the URL that the writer sends to all the JNs is the URL of one JN which responded in step2. If that JN becomes inaccessible immediately, and thus other JNs can't sync themselves by downloading the finalized segment from that JN, the recovery process could be stuck?

          If it could be stuck in step 3, an alternate way to sync lagging JNs is to let them contact other quorum number JNs to download the finalized segments. Given the requirement "All loggers must finalize the segment to the same length and contents", all the finalized segment with the same name should be identical in all JNs. Therefore, the lagging JN can download it from any other JN as long as that JN has the file.

          Show
          Brandon Li added a comment - Hi Todd, I just read the new section but not the implementation yet. Please correct me if I am wrong. Looks like it could be a possible improvement: if the JournalNode who already received a prepare RPC with higher newEpoch number (possible?) can inform the writer(proposer), the writer can exit earlier in step 2 "Choosing a recovery". In step 3 "Accept RPC", I assume the URL that the writer sends to all the JNs is the URL of one JN which responded in step2. If that JN becomes inaccessible immediately, and thus other JNs can't sync themselves by downloading the finalized segment from that JN, the recovery process could be stuck? If it could be stuck in step 3, an alternate way to sync lagging JNs is to let them contact other quorum number JNs to download the finalized segments. Given the requirement "All loggers must finalize the segment to the same length and contents", all the finalized segment with the same name should be identical in all JNs. Therefore, the lagging JN can download it from any other JN as long as that JN has the file.
          Hide
          Todd Lipcon added a comment -

          Attaching new rev of the design doc. This has more detail on the recovery process, in particular.

          I have a prototype of this implemented on which I can successfully run and restart an NN, tolerate loss of a JournalNode, etc. I'm continuing to clean up the code and address some shortcuts/TODOs I took along the way of the prototype, but I hope to start uploading coherent patches for review at some point next week.

          Interested parties can check out a git branch here:
          https://github.com/toddlipcon/hadoop-common/tree/auto-failover-and-qjournal

          Show
          Todd Lipcon added a comment - Attaching new rev of the design doc. This has more detail on the recovery process, in particular. I have a prototype of this implemented on which I can successfully run and restart an NN, tolerate loss of a JournalNode, etc. I'm continuing to clean up the code and address some shortcuts/TODOs I took along the way of the prototype, but I hope to start uploading coherent patches for review at some point next week. Interested parties can check out a git branch here: https://github.com/toddlipcon/hadoop-common/tree/auto-failover-and-qjournal
          Hide
          Suresh Srinivas added a comment -

          For section 2.5.x, the document posted needs to consider different sets of quorums that become available during recovery. See the newly added appendix to the design in HDFS-3092.

          Show
          Suresh Srinivas added a comment - For section 2.5.x, the document posted needs to consider different sets of quorums that become available during recovery. See the newly added appendix to the design in HDFS-3092 .
          Hide
          Suresh Srinivas added a comment -

          so all nodes are now taking part in the quorum. We could optionally at this point have JN3 copy over the edits_1-120 segment from one of the other nodes, but that copy can be asynchronous. It's a repair operation, but given we already have 2 valid replicas, we aren't in any imminent danger of data loss.

          The proposal in HDFS-3092 is to make the JN3 part of the quorum, only when it has caught up with other JNs. Having this simplify some boundary conditions.

          Show
          Suresh Srinivas added a comment - so all nodes are now taking part in the quorum. We could optionally at this point have JN3 copy over the edits_1-120 segment from one of the other nodes, but that copy can be asynchronous. It's a repair operation, but given we already have 2 valid replicas, we aren't in any imminent danger of data loss. The proposal in HDFS-3092 is to make the JN3 part of the quorum, only when it has caught up with other JNs. Having this simplify some boundary conditions.
          Hide
          Suresh Srinivas added a comment -

          How can step 3 in section 2.4 be completed independent of quorum? Step 4 indicates that it requires a quorum of nodes to respond successfully to the newEpoch message. Here's an example:

          What I meant was at each JN, step 3 completes. Hence the example Hari was giving.

          Show
          Suresh Srinivas added a comment - How can step 3 in section 2.4 be completed independent of quorum? Step 4 indicates that it requires a quorum of nodes to respond successfully to the newEpoch message. Here's an example: What I meant was at each JN, step 3 completes. Hence the example Hari was giving.
          Hide
          Todd Lipcon added a comment -

          Hi Bikas. Thanks for bringing up this scenario. I do need to add a section to the doc about failure handling and re-adding failed journals.

          My thinking is that the granularity of "membership" is the log segment. This is similar to what we do on local disks today - when we roll the edit log, we attempt to re-add any disks that previously failed. Similarly, when we start a new log segment, we give all of the JNs a chance to pick back up following along with the quorum.

          To try to map to your example, we'd have the following:
          JN1: writing edits_inprogress_1 (@txn 100)
          JN2: writing edits_inprogress_1 (@txn 100)
          JN3: has been reformatted, comes back online

          At this point, the QJM can try to write txns to all three, but JN3 won't accept transactions because it doesn't have a currently open log segment. Currently it will just reject them. I can imagine a future optimization in which it would return a special exception, and the QJM could notify the NN that it would like to roll ASAP if possible.

          Let's say we write another 20 txns, and then roll logs. On the next startLogSegment call, we'd end up with the following:

          JN1: edits_1-120, edits_inprogress_121
          JN2: edits_1-120, edits_inprogress_121
          JN3: edits_inprogress_121

          so all nodes are now taking part in the quorum. We could optionally at this point have JN3 copy over the edits_1-120 segment from one of the other nodes, but that copy can be asynchronous. It's a repair operation, but given we already have 2 valid replicas, we aren't in any imminent danger of data loss.

          Show
          Todd Lipcon added a comment - Hi Bikas. Thanks for bringing up this scenario. I do need to add a section to the doc about failure handling and re-adding failed journals. My thinking is that the granularity of "membership" is the log segment. This is similar to what we do on local disks today - when we roll the edit log, we attempt to re-add any disks that previously failed. Similarly, when we start a new log segment, we give all of the JNs a chance to pick back up following along with the quorum. To try to map to your example, we'd have the following: JN1: writing edits_inprogress_1 (@txn 100) JN2: writing edits_inprogress_1 (@txn 100) JN3: has been reformatted, comes back online At this point, the QJM can try to write txns to all three, but JN3 won't accept transactions because it doesn't have a currently open log segment. Currently it will just reject them. I can imagine a future optimization in which it would return a special exception, and the QJM could notify the NN that it would like to roll ASAP if possible. Let's say we write another 20 txns, and then roll logs. On the next startLogSegment call, we'd end up with the following: JN1: edits_1-120, edits_inprogress_121 JN2: edits_1-120, edits_inprogress_121 JN3: edits_inprogress_121 so all nodes are now taking part in the quorum. We could optionally at this point have JN3 copy over the edits_1-120 segment from one of the other nodes, but that copy can be asynchronous. It's a repair operation, but given we already have 2 valid replicas, we aren't in any imminent danger of data loss.
          Hide
          Bikas Saha added a comment -

          I have a question around syncing journal nodes and quorum based writes. There will always be a case that a lost journal node comes back up and is syncing its state - the extreme example of which is replacement of a broken journal node with a new node.
          While it is doing this, will it be part of the quorum when a quorum number of writes must succeed?
          Say we have 3 journals with the following txids
          JN1-100, JN2-100, JN3-0 (JN3 just joined)
          Now say some stuff got written to JN2 and JN3 (quorum commit with JN1 in flight records in the queue because JN1 is slow)
          JN1-100, JN2-110, JN3-110+syncing_holes
          At this point something terrible happens and when we recover, we can only access JN1 and JN3
          JN1-100, JN3-110+syncing holes
          At this point of time how do we resolve the ground truth about the journal state and edit logs?

          Show
          Bikas Saha added a comment - I have a question around syncing journal nodes and quorum based writes. There will always be a case that a lost journal node comes back up and is syncing its state - the extreme example of which is replacement of a broken journal node with a new node. While it is doing this, will it be part of the quorum when a quorum number of writes must succeed? Say we have 3 journals with the following txids JN1-100, JN2-100, JN3-0 (JN3 just joined) Now say some stuff got written to JN2 and JN3 (quorum commit with JN1 in flight records in the queue because JN1 is slow) JN1-100, JN2-110, JN3-110+syncing_holes At this point something terrible happens and when we recover, we can only access JN1 and JN3 JN1-100, JN3-110+syncing holes At this point of time how do we resolve the ground truth about the journal state and edit logs?
          Hide
          Suresh Srinivas added a comment -

          Suresh seemed to think doing it on a branch would be counter-productive to code sharing

          There is a branch already created for 3092. We could use that.

          Show
          Suresh Srinivas added a comment - Suresh seemed to think doing it on a branch would be counter-productive to code sharing There is a branch already created for 3092. We could use that.
          Hide
          Bikas Saha added a comment -

          Nope, the thinking is that all of the new code will be encapsulated by QuorumJournalManager. So, from the NN's perspective, there is only a single edit log. It happens that that edit log is distributed and fault-tolerant underneath, but the NN would see it as a single "required" journal, and crash if it fails to sync.

          Got it. So local edits and remote edits would be replaced by a single qjournaledits.

          Show
          Bikas Saha added a comment - Nope, the thinking is that all of the new code will be encapsulated by QuorumJournalManager. So, from the NN's perspective, there is only a single edit log. It happens that that edit log is distributed and fault-tolerant underneath, but the NN would see it as a single "required" journal, and crash if it fails to sync. Got it. So local edits and remote edits would be replaced by a single qjournaledits.
          Hide
          Todd Lipcon added a comment -

          I think it will help clarify the doc, if you add the explanation for Hari's example. Even though epoch 2 is persisted on JN1, its last log segment is still tied to epoch 1 and it needs to sync its last log segment with JN2/JN3. Are you proposing that JN1 drop its last edits in progress and pick up the corresponding finalized segment from JN1/JN2. Or is it TBD?

          Yes, I think it would see that its copy of the segment is "out of date" epoch-wise, delete it, and then copy the finalized segments from the other nodes later. I'll try to expand upon this portion of the doc in the coming days.

          I also have another idea which may be slightly simpler – Suresh got me thinking about it a bit. Basically the idea is that, instead of deleting empty edit logs, we could "fill them in" with a single NOOP transaction. Let me think on this for a little while and then update the design doc if it turns out to work.

          Btw, there is some new code here but there seems to be some code in existing NN that changes the sequential journal sync to parallel (based on reading your doc and not your patch).

          Nope, the thinking is that all of the new code will be encapsulated by QuorumJournalManager. So, from the NN's perspective, there is only a single edit log. It happens that that edit log is distributed and fault-tolerant underneath, but the NN would see it as a single "required" journal, and crash if it fails to sync.

          Are you planning on committing this to a branch or directly to trunk?

          I'm happy to do either. Suresh seemed to think doing it on a branch would be counter-productive to code sharing. In practice it's almost new code, so as long as we're clear to mark it "in-progress" or "experimental", I don't think it would be destabilizing to do in trunk. HDFS-3190 is the one place in which I've modified NN code, but only trivially.

          Show
          Todd Lipcon added a comment - I think it will help clarify the doc, if you add the explanation for Hari's example. Even though epoch 2 is persisted on JN1, its last log segment is still tied to epoch 1 and it needs to sync its last log segment with JN2/JN3. Are you proposing that JN1 drop its last edits in progress and pick up the corresponding finalized segment from JN1/JN2. Or is it TBD? Yes, I think it would see that its copy of the segment is "out of date" epoch-wise, delete it, and then copy the finalized segments from the other nodes later. I'll try to expand upon this portion of the doc in the coming days. I also have another idea which may be slightly simpler – Suresh got me thinking about it a bit. Basically the idea is that, instead of deleting empty edit logs, we could "fill them in" with a single NOOP transaction. Let me think on this for a little while and then update the design doc if it turns out to work. Btw, there is some new code here but there seems to be some code in existing NN that changes the sequential journal sync to parallel (based on reading your doc and not your patch). Nope, the thinking is that all of the new code will be encapsulated by QuorumJournalManager. So, from the NN's perspective, there is only a single edit log. It happens that that edit log is distributed and fault-tolerant underneath, but the NN would see it as a single "required" journal, and crash if it fails to sync. Are you planning on committing this to a branch or directly to trunk? I'm happy to do either. Suresh seemed to think doing it on a branch would be counter-productive to code sharing. In practice it's almost new code, so as long as we're clear to mark it "in-progress" or "experimental", I don't think it would be destabilizing to do in trunk. HDFS-3190 is the one place in which I've modified NN code, but only trivially.
          Hide
          Bikas Saha added a comment -

          Nice doc! Greatly sped up understanding the design instead of having to grok it from the patch

          I think it will help clarify the doc, if you add the explanation for Hari's example. Even though epoch 2 is persisted on JN1, its last log segment is still tied to epoch 1 and it needs to sync its last log segment with JN2/JN3. Are you proposing that JN1 drop its last edits in progress and pick up the corresponding finalized segment from JN1/JN2. Or is it TBD?

          Btw, there is some new code here but there seems to be some code in existing NN that changes the sequential journal sync to parallel (based on reading your doc and not your patch). I am guessing there will be other significant changes going forward. Are you planning on committing this to a branch or directly to trunk?

          Show
          Bikas Saha added a comment - Nice doc! Greatly sped up understanding the design instead of having to grok it from the patch I think it will help clarify the doc, if you add the explanation for Hari's example. Even though epoch 2 is persisted on JN1, its last log segment is still tied to epoch 1 and it needs to sync its last log segment with JN2/JN3. Are you proposing that JN1 drop its last edits in progress and pick up the corresponding finalized segment from JN1/JN2. Or is it TBD? Btw, there is some new code here but there seems to be some code in existing NN that changes the sequential journal sync to parallel (based on reading your doc and not your patch). I am guessing there will be other significant changes going forward. Are you planning on committing this to a branch or directly to trunk?
          Hide
          Suresh Srinivas added a comment -

          I prefer JournalNode because every other daemon we have is a *Node. If you're running it inside another process, I think we would just call it a "JournalService" – or an "embedded JournalNode". I think of a daemon as a standalone process.

          I think that is fine. I have initial JournalService that is implemented as a part of 3092. We will consolidate from your patch, this part of the code in HDFS-3178.

          OK. This part I have done in the patch attached here and works pretty well, so far. If you want, I'm happy to separate out the quorum completion code to commit it ASAP so we can share code here.

          This sounds good.

          I think the "standalone" nature of the approach outweighs what benefit we might get by reusing ZK.

          We can look into this in more detail. However, we will add a method called fence() in the JournalProtocol, with epoch number.

          I will get back to you on the last comment.

          Show
          Suresh Srinivas added a comment - I prefer JournalNode because every other daemon we have is a *Node. If you're running it inside another process, I think we would just call it a "JournalService" – or an "embedded JournalNode". I think of a daemon as a standalone process. I think that is fine. I have initial JournalService that is implemented as a part of 3092. We will consolidate from your patch, this part of the code in HDFS-3178 . OK. This part I have done in the patch attached here and works pretty well, so far. If you want, I'm happy to separate out the quorum completion code to commit it ASAP so we can share code here. This sounds good. I think the "standalone" nature of the approach outweighs what benefit we might get by reusing ZK. We can look into this in more detail. However, we will add a method called fence() in the JournalProtocol, with epoch number. I will get back to you on the last comment.
          Hide
          Todd Lipcon added a comment -

          Terminology - JournalDaemon or JournalNode. I prefer JournalDaemon because my plan was to run them in the same process space as the namenode. A JournalDeamon could also be stand-alone process.

          I prefer JournalNode because every other daemon we have is a *Node. If you're running it inside another process, I think we would just call it a "JournalService" – or an "embedded JournalNode". I think of a daemon as a standalone process.

          I like the idea of quorum writes and maintaining the queue. 3092 design currently uses timeout to declare a JD slow and fail it. We were planning to punting on it until we had first implementation.

          OK. This part I have done in the patch attached here and works pretty well, so far. If you want, I'm happy to separate out the quorum completion code to commit it ASAP so we can share code here.

          newEpoch() is called fence() in HDFS-3092. My preference is to use the name fence(). I was using version # which is called epoch. I think the name epoch sounds better. The key difference is that version # is generated from znode in HDFS-3092.

          As I had commented earlier on this ticket, I originally was planning to do something similar to you, bootstrapping off of ZK to generate epoch numbers. But then, when I got into coding, I realized that this algorithm is actually not so hard to implement, and adding a dependency on ZK actually adds to the combinatorics of things to think about. I think the "standalone" nature of the approach outweighs what benefit we might get by reusing ZK.

          So two namenodes cannot use the same epoch number. I think there is a bug with the approach you have described, stemming from the fact that two namenodes can use the same epoch and step 3 in 2.4 can be completed independent of quorum. This is shown in Hari's example.

          How can step 3 in section 2.4 be completed independent of quorum? Step 4 indicates that it requires a quorum of nodes to respond successfully to the newEpoch message. Here's an example:

          Initial state:

          Node lastPromisedEpoch
          JN1 1
          JN2 1
          JN3 1

          1. Two NNs (NN1 and NN2) enter step 1 concurrently. They both receive responses indicating lastPromisedEpoch==1 from all of the JNs.
          2. They both propose newEpoch(2). The behavior of the JN ensures that it will only respond success to either NN1 or NN2, but not both (since it will fail if the proposedEpoch <= lastPromisedEpoch)
          So, either NN1 or NN2 gets success from a majority. The other node will only get success from a minority, and thus will abort.

          Note that with message losses or failures, it's possible for neither of the nodes to get a quorum in the case of a race. That's OK, since we expect that an external leader election framework will eventually assist such that only one NN is trying to become active, and then that NN will win.

          Note that the epoch algorithm is cribbed from ZAB, see page 7 of Yahoo tech report YL-2010-0007. The mapping from ZAB terminology is:

          ZAB term QJournal term
          CEPOCH(e) Response to getLastPromisedEpoch()
          NEWEPOCH(e') newEpoch(proposedEpoch)
          ACK-E(...) success response to newEpoch()
          Show
          Todd Lipcon added a comment - Terminology - JournalDaemon or JournalNode. I prefer JournalDaemon because my plan was to run them in the same process space as the namenode. A JournalDeamon could also be stand-alone process. I prefer JournalNode because every other daemon we have is a *Node. If you're running it inside another process, I think we would just call it a "JournalService" – or an "embedded JournalNode". I think of a daemon as a standalone process. I like the idea of quorum writes and maintaining the queue. 3092 design currently uses timeout to declare a JD slow and fail it. We were planning to punting on it until we had first implementation. OK. This part I have done in the patch attached here and works pretty well, so far. If you want, I'm happy to separate out the quorum completion code to commit it ASAP so we can share code here. newEpoch() is called fence() in HDFS-3092 . My preference is to use the name fence(). I was using version # which is called epoch. I think the name epoch sounds better. The key difference is that version # is generated from znode in HDFS-3092 . As I had commented earlier on this ticket, I originally was planning to do something similar to you, bootstrapping off of ZK to generate epoch numbers. But then, when I got into coding, I realized that this algorithm is actually not so hard to implement, and adding a dependency on ZK actually adds to the combinatorics of things to think about. I think the "standalone" nature of the approach outweighs what benefit we might get by reusing ZK. So two namenodes cannot use the same epoch number. I think there is a bug with the approach you have described, stemming from the fact that two namenodes can use the same epoch and step 3 in 2.4 can be completed independent of quorum. This is shown in Hari's example. How can step 3 in section 2.4 be completed independent of quorum? Step 4 indicates that it requires a quorum of nodes to respond successfully to the newEpoch message. Here's an example: Initial state: Node lastPromisedEpoch JN1 1 JN2 1 JN3 1 1. Two NNs (NN1 and NN2) enter step 1 concurrently. They both receive responses indicating lastPromisedEpoch==1 from all of the JNs. 2. They both propose newEpoch(2) . The behavior of the JN ensures that it will only respond success to either NN1 or NN2, but not both (since it will fail if the proposedEpoch <= lastPromisedEpoch) So, either NN1 or NN2 gets success from a majority. The other node will only get success from a minority, and thus will abort. Note that with message losses or failures, it's possible for neither of the nodes to get a quorum in the case of a race. That's OK, since we expect that an external leader election framework will eventually assist such that only one NN is trying to become active, and then that NN will win. Note that the epoch algorithm is cribbed from ZAB, see page 7 of Yahoo tech report YL-2010-0007. The mapping from ZAB terminology is: ZAB term QJournal term CEPOCH(e) Response to getLastPromisedEpoch() NEWEPOCH(e') newEpoch(proposedEpoch) ACK-E(...) success response to newEpoch()
          Hide
          Todd Lipcon added a comment -

          So currently, state is epoch number is 2 on all the journals and J1, J2 and J3 are at 153. We have a problem since it is not possible to distinguish between log entries in J1 vs J2 and J3.

          Hey Hari. Thanks for taking a look in such good detail.

          I think the doc is currently unclear about the proposed solution described in 2.5.6 – the idea is not to use just the "lastPromisedEpoch" here to distinguish the JNs, but rather to attach the epoch number to each log segment, based on the epoch in which that segment was started. So, even though in your scenario NN1 sets J1.lastPromisedEpoch=2, the log segment will retain e=1. Once a segment's epoch is set, it is never changed (unless the segment is removed by a synchronization)

          Does that make sense? If so I will try to clarify the document.

          Show
          Todd Lipcon added a comment - So currently, state is epoch number is 2 on all the journals and J1, J2 and J3 are at 153. We have a problem since it is not possible to distinguish between log entries in J1 vs J2 and J3. Hey Hari. Thanks for taking a look in such good detail. I think the doc is currently unclear about the proposed solution described in 2.5.6 – the idea is not to use just the "lastPromisedEpoch" here to distinguish the JNs, but rather to attach the epoch number to each log segment, based on the epoch in which that segment was started. So, even though in your scenario NN1 sets J1.lastPromisedEpoch=2, the log segment will retain e=1. Once a segment's epoch is set, it is never changed (unless the segment is removed by a synchronization) Does that make sense? If so I will try to clarify the document.
          Hide
          Suresh Srinivas added a comment -

          Thanks for posting the design. Now I understand your comment that there is a lot of common things between this one and the approach in HDFS-3092. Here are some high level comments:

          1. Terminology - JournalDaemon or JournalNode. I prefer JournalDaemon because my plan was to run them in the same process space as the namenode. A JournalDeamon could also be stand-alone process.
          2. I like the idea of quorum writes and maintaining the queue. 3092 design currently uses timeout to declare a JD slow and fail it. We were planning to punting on it until we had first implementation.
          3. newEpoch() is called fence() in HDFS-3092. My preference is to use the name fence(). I was using version # which is called epoch. I think the name epoch sounds better. The key difference is that version # is generated from znode in HDFS-3092. So two namenodes cannot use the same epoch number. I think there is a bug with the approach you have described, stemming from the fact that two namenodes can use the same epoch and step 3 in 2.4 can be completed independent of quorum. This is shown in Hari's example.
          4. I prefer to record epoch in startLogSegment filler record. startLogSegment record was never part of the journal, which we had added for structural reasons. So adding epoch info to it should not matter. The way I see it is - journal belongs to a segment. Segment has single version # or epoch.
          5. In both proposals epoch or version # needs to be sent in all journal requests.

          We could certainly make a list of common work items and create jiras, so that many people can collaborate and wrap it up, like we did in HDFS-1623.

          Show
          Suresh Srinivas added a comment - Thanks for posting the design. Now I understand your comment that there is a lot of common things between this one and the approach in HDFS-3092 . Here are some high level comments: Terminology - JournalDaemon or JournalNode. I prefer JournalDaemon because my plan was to run them in the same process space as the namenode. A JournalDeamon could also be stand-alone process. I like the idea of quorum writes and maintaining the queue. 3092 design currently uses timeout to declare a JD slow and fail it. We were planning to punting on it until we had first implementation. newEpoch() is called fence() in HDFS-3092 . My preference is to use the name fence(). I was using version # which is called epoch. I think the name epoch sounds better. The key difference is that version # is generated from znode in HDFS-3092 . So two namenodes cannot use the same epoch number. I think there is a bug with the approach you have described, stemming from the fact that two namenodes can use the same epoch and step 3 in 2.4 can be completed independent of quorum. This is shown in Hari's example. I prefer to record epoch in startLogSegment filler record. startLogSegment record was never part of the journal, which we had added for structural reasons. So adding epoch info to it should not matter. The way I see it is - journal belongs to a segment. Segment has single version # or epoch. In both proposals epoch or version # needs to be sent in all journal requests. We could certainly make a list of common work items and create jiras, so that many people can collaborate and wrap it up, like we did in HDFS-1623 .
          Hide
          Hari Mankude added a comment -

          Todd,

          The doc is excellent. Had a comment on a potential issue which could result due to epochnumber with certain failure scenarios. Specifically, I am talking about the scenario in section 2.5.6

          J1 is at txid 153, J2 is at txid 150 and J3 is at txid 125. Epochnumber on all the journals is 1. Both NN1 and NN2 are trying to become_active() at the same time. NN1 talks to J1, J2 and sets the proposedEpoch to 2. NN2 talks to J2 and J3 and decides to set the proposedEpoch to 2.

          NN1 succeeds in setting newEpoch to 2 on J1 and fails on J2 and J3. NN1 dies since it does not have quorum.
          NN2 succeeds in setting newEpoch to 2 on J2 and J3 and has the quorum. NN2 cannot talk to J1. Similar to the scenario in 2.5.6, NN2 writes 151, 152,153 into J2 and J3 and then dies.

          So currently, state is epoch number is 2 on all the journals and J1, J2 and J3 are at 153. We have a problem since it is not possible to distinguish between log entries in J1 vs J2 and J3.

          Show
          Hari Mankude added a comment - Todd, The doc is excellent. Had a comment on a potential issue which could result due to epochnumber with certain failure scenarios. Specifically, I am talking about the scenario in section 2.5.6 J1 is at txid 153, J2 is at txid 150 and J3 is at txid 125. Epochnumber on all the journals is 1. Both NN1 and NN2 are trying to become_active() at the same time. NN1 talks to J1, J2 and sets the proposedEpoch to 2. NN2 talks to J2 and J3 and decides to set the proposedEpoch to 2. NN1 succeeds in setting newEpoch to 2 on J1 and fails on J2 and J3. NN1 dies since it does not have quorum. NN2 succeeds in setting newEpoch to 2 on J2 and J3 and has the quorum. NN2 cannot talk to J1. Similar to the scenario in 2.5.6, NN2 writes 151, 152,153 into J2 and J3 and then dies. So currently, state is epoch number is 2 on all the journals and J1, J2 and J3 are at 153. We have a problem since it is not possible to distinguish between log entries in J1 vs J2 and J3.
          Hide
          Todd Lipcon added a comment -

          Attached a design doc draft. Look forward to your comments.

          Show
          Todd Lipcon added a comment - Attached a design doc draft. Look forward to your comments.
          Hide
          Todd Lipcon added a comment -

          erg, sorry: I meant HDFS-2185

          Show
          Todd Lipcon added a comment - erg, sorry: I meant HDFS-2185
          Hide
          Tsz Wo Nicholas Sze added a comment -

          HDFS-3185 does not exist. Wrong number?

          Show
          Tsz Wo Nicholas Sze added a comment - HDFS-3185 does not exist. Wrong number?
          Hide
          Todd Lipcon added a comment -

          Todd, where is your design doc? It has been three weeks since your comment.

          I used up my design doc writing credits on the HDFS-3185 doc last week. Will write one here ASAP, sorry for the delay.

          Show
          Todd Lipcon added a comment - Todd, where is your design doc? It has been three weeks since your comment. I used up my design doc writing credits on the HDFS-3185 doc last week. Will write one here ASAP, sorry for the delay.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > I plan to post a more thorough design doc in the next week or two, but ...

          Todd, where is your design doc? It has been three weeks since your comment.

          Show
          Tsz Wo Nicholas Sze added a comment - > I plan to post a more thorough design doc in the next week or two, but ... Todd, where is your design doc? It has been three weeks since your comment .
          Hide
          Todd Lipcon added a comment -

          Here's a WIP patch. The recovery synchronization bits (admittedly the trickier parts) are still in-flight.

          Show
          Todd Lipcon added a comment - Here's a WIP patch. The recovery synchronization bits (admittedly the trickier parts) are still in-flight.
          Hide
          Suresh Srinivas added a comment -

          I have posted a design document to HDFS-3092. The solution uses multiple journal daemons and uses ZooKeeper for co-ordination. I believe it is much simpler than this proposal. Interested folks, please take a look at the document and provide your comments.

          Show
          Suresh Srinivas added a comment - I have posted a design document to HDFS-3092 . The solution uses multiple journal daemons and uses ZooKeeper for co-ordination. I believe it is much simpler than this proposal. Interested folks, please take a look at the document and provide your comments.
          Hide
          Flavio Junqueira added a comment -

          True, though BK takes effort to interleave all of the ledgers into a single sequential stream, while writing an index file to allow de-interleaving upon read.

          We have done this interleaving originally because the performance was affected even with just a handful of log files (ledgers). The solution we have currently apply to a few or tens of thousands of ledgers.

          If you're concerned about read performance, we have not observed any important reduction. About de-interleaving, we don't rearrange it or anything, if that's what you have in mind.

          Show
          Flavio Junqueira added a comment - True, though BK takes effort to interleave all of the ledgers into a single sequential stream, while writing an index file to allow de-interleaving upon read. We have done this interleaving originally because the performance was affected even with just a handful of log files (ledgers). The solution we have currently apply to a few or tens of thousands of ledgers. If you're concerned about read performance, we have not observed any important reduction. About de-interleaving, we don't rearrange it or anything, if that's what you have in mind.
          Hide
          Todd Lipcon added a comment -

          BK also has a notion of Ledger to support multiple clients - we are re-inventing a fair part of BK. One of the arguments was that BK is a general system while thre JD solution has only one client - the NN.

          True, though BK takes effort to interleave all of the ledgers into a single sequential stream, while writing an index file to allow de-interleaving upon read. This is to support hundreds or thousands of concurrent WALs. In contrast, I think even large HDFS installations would only run a few federated NNs with the current design. So it's a much simpler problem, IMO.

          One our design is fleshed out, we should compare with BK in an objective way.

          Absolutely.

          Show
          Todd Lipcon added a comment - BK also has a notion of Ledger to support multiple clients - we are re-inventing a fair part of BK. One of the arguments was that BK is a general system while thre JD solution has only one client - the NN. True, though BK takes effort to interleave all of the ledgers into a single sequential stream, while writing an index file to allow de-interleaving upon read. This is to support hundreds or thousands of concurrent WALs. In contrast, I think even large HDFS installations would only run a few federated NNs with the current design. So it's a much simpler problem, IMO. One our design is fleshed out, we should compare with BK in an objective way. Absolutely.
          Hide
          Sanjay Radia added a comment -

          > I was planning to have the journal daemons store the logs in a directory per namespace ID.
          BK also has a notion of Ledger to support multiple clients - we are re-inventing a fair part of BK. One of the arguments was that BK is a general system while thre JD solution has only one client - the NN. As soon as we support multiple NNs the JournalDaemon solution is effectively becoming more general.

          One our design is fleshed out, we should compare with BK in an objective way.

          Show
          Sanjay Radia added a comment - > I was planning to have the journal daemons store the logs in a directory per namespace ID. BK also has a notion of Ledger to support multiple clients - we are re-inventing a fair part of BK. One of the arguments was that BK is a general system while thre JD solution has only one client - the NN. As soon as we support multiple NNs the JournalDaemon solution is effectively becoming more general. One our design is fleshed out, we should compare with BK in an objective way.
          Hide
          Todd Lipcon added a comment -

          Makes sense. However, there will be many more Datanodes than metadata nodes, so finding new candidates to participate in a quorum protocol as others are lost or decommissioned would be less challenging given that larger pool.

          True. In the initial implementation, though, I don't plan to support online reconfiguration of the quorum participants, though. It would be a nice enhancement in the future.

          For each federated HDFS volume will we need 3 metadata nodes?

          I was planning to have the journal daemons store the logs in a directory per namespace ID. So, one set of nodes could handle the logs for several NNs (obviously at the expense of performance if there are lots and lots of them).

          Show
          Todd Lipcon added a comment - Makes sense. However, there will be many more Datanodes than metadata nodes, so finding new candidates to participate in a quorum protocol as others are lost or decommissioned would be less challenging given that larger pool. True. In the initial implementation, though, I don't plan to support online reconfiguration of the quorum participants, though. It would be a nice enhancement in the future. For each federated HDFS volume will we need 3 metadata nodes? I was planning to have the journal daemons store the logs in a directory per namespace ID. So, one set of nodes could handle the logs for several NNs (obviously at the expense of performance if there are lots and lots of them).
          Hide
          Andrew Purtell added a comment -

          Re: JournalDaemons or Bookies on Datanodes (Slave nodes) vs "Master" nodes

          Makes sense. However, there will be many more Datanodes than metadata nodes, so finding new candidates to participate in a quorum protocol as others are lost or decommissioned would be less challenging given that larger pool. For each federated HDFS volume will we need 3 metadata nodes?

          Show
          Andrew Purtell added a comment - Re: JournalDaemons or Bookies on Datanodes (Slave nodes) vs "Master" nodes Makes sense. However, there will be many more Datanodes than metadata nodes, so finding new candidates to participate in a quorum protocol as others are lost or decommissioned would be less challenging given that larger pool. For each federated HDFS volume will we need 3 metadata nodes?
          Hide
          Todd Lipcon added a comment -

          While this work will combine nicely with HDFS-3042 (auto failover) it is a separate project. A quorum-based edit log storage mechanism is useful even for manual failover – or even for non-HA environments where you want remote copies of the edit logs without deploying NFS.

          Show
          Todd Lipcon added a comment - While this work will combine nicely with HDFS-3042 (auto failover) it is a separate project. A quorum-based edit log storage mechanism is useful even for manual failover – or even for non-HA environments where you want remote copies of the edit logs without deploying NFS.
          Hide
          Bikas Saha added a comment -

          You might want to switch this jira to HDFS-3042 and take the discussion off trunk.

          Show
          Bikas Saha added a comment - You might want to switch this jira to HDFS-3042 and take the discussion off trunk.
          Hide
          Todd Lipcon added a comment -

          Thanks for your comments, Sanjay. I agree on all points above.

          Sorry to have not reported progress on this - I've been spending my time primarily on HDFS-3042 for the past two weeks. I hope to make more progress on this soon. The current status is that I have implemented the basic quorum protocol for writes, but still in progress implementing recovery after a switch of the active node.

          Show
          Todd Lipcon added a comment - Thanks for your comments, Sanjay. I agree on all points above. Sorry to have not reported progress on this - I've been spending my time primarily on HDFS-3042 for the past two weeks. I hope to make more progress on this soon. The current status is that I have implemented the basic quorum protocol for writes, but still in progress implementing recovery after a switch of the active node.
          Hide
          Sanjay Radia added a comment -

          Also one more thing

          • The master nodes have a spare disk that can be dedicated to JournalDaemon or Bookie, while a datanode does not have a spare disk to dedicate.
          Show
          Sanjay Radia added a comment - Also one more thing The master nodes have a spare disk that can be dedicated to JournalDaemon or Bookie, while a datanode does not have a spare disk to dedicate.
          Hide
          Sanjay Radia added a comment -

          JournalDaemons or Bookies on Datanodes (Slave nodes) vs "Master" nodes

          • Slave nodes such as Datanodes are decommissioned for various reasons. This is automatically handled and is
            a simple process that makes the operations of Hadoop simple. Hence if we add a Journal Daemon (or thread) or Bookie to the slave nodes it makes the system harder to manage.
          • It is preferable to run the Journal Daemons or Bookies on the master nodes - NN, ZK, JT etc.
          Show
          Sanjay Radia added a comment - JournalDaemons or Bookies on Datanodes (Slave nodes) vs "Master" nodes Slave nodes such as Datanodes are decommissioned for various reasons. This is automatically handled and is a simple process that makes the operations of Hadoop simple. Hence if we add a Journal Daemon (or thread) or Bookie to the slave nodes it makes the system harder to manage. It is preferable to run the Journal Daemons or Bookies on the master nodes - NN, ZK, JT etc.
          Hide
          Todd Lipcon added a comment -

          The daemons can stop accepting writes when it realizes that active lock is no longer held by the writer. Clearly an advantage of an active daemon compared to using passive storage.

          Relying on ZK here is insufficient - the actual protocol itself needs fencing to guarantee that a quorum of loggers have seen the "lost lock" before the new writer starts writing.

          I agree with your later comments that rolling the edits is a helpful construct here, but you need to also make sure there's consensus on the "active writer" when beginning a new log segment.

          I'm about halfway done with a prototype implementation of this, I should have something to show by middle of next week. At that point I'll also post a more thorough explanation of the design.

          Show
          Todd Lipcon added a comment - The daemons can stop accepting writes when it realizes that active lock is no longer held by the writer. Clearly an advantage of an active daemon compared to using passive storage. Relying on ZK here is insufficient - the actual protocol itself needs fencing to guarantee that a quorum of loggers have seen the "lost lock" before the new writer starts writing. I agree with your later comments that rolling the edits is a helpful construct here, but you need to also make sure there's consensus on the "active writer" when beginning a new log segment. I'm about halfway done with a prototype implementation of this, I should have something to show by middle of next week. At that point I'll also post a more thorough explanation of the design.
          Hide
          Suresh Srinivas added a comment -

          but like Einstein said, no simpler!

          Its all relative

          BTW it would be good write design for this. That avoid lenghty comments and keeps the summary of what is proposed in place, instead of scattering in multiple comments.

          This is mostly great – so long as you have an external fencing strategy which prevents the old active from attempting to continue to write after the new active is trying to read.

          External fencing is not needed, given active daemons having ability to fence.

          it gets the loggers to promise not to accept edits from the old active

          The daemons can stop accepting writes when it realizes that active lock is no longer held by the writer. Clearly an advantage of an active daemon compared to using passive storage.

          But, we still have one more problem: given some txid N, we might have multiple actives that have tried to write the same transaction ID. Example scenario:

          The case of writes making it though some daemons can also be solved. The writes that have made through W daemons wins. The others are marked not in sync and need to sync up. Explanation to follow.

          The solution we are building is specific to namenode editlogs. There is only one active writer (as Ivan brought up earlier). Here is the outline I am thinking of.

          Lets start with steady state with K of N journal deamons. When a journal daemon fails, we roll the edits. When a journal daemon joins, we roll the edits. New journal daemon could start syncing other finalized edits, while keeping track of edits in progress. We also keep track of the list of the active daemons in zookeeper. Rolling gives a logical point for newly joined daemon to sync up (sort of like generation stamp).
          During failover, the new active, gets from the actively written journals, the point to which it has to sync up to. It then rolls the edits also to that point. Rolling also gives you a way to discard extra journal records that made it to < W daemons, during failover. When there are overlapping records, say e1-105 and e100-200, you read 100-105 from the second editlog, and discard it from the first editlog.

          Again there are scenarios that are missing here. I plan to post more details in a design on this.

          Show
          Suresh Srinivas added a comment - but like Einstein said, no simpler! Its all relative BTW it would be good write design for this. That avoid lenghty comments and keeps the summary of what is proposed in place, instead of scattering in multiple comments. This is mostly great – so long as you have an external fencing strategy which prevents the old active from attempting to continue to write after the new active is trying to read. External fencing is not needed, given active daemons having ability to fence. it gets the loggers to promise not to accept edits from the old active The daemons can stop accepting writes when it realizes that active lock is no longer held by the writer. Clearly an advantage of an active daemon compared to using passive storage. But, we still have one more problem: given some txid N, we might have multiple actives that have tried to write the same transaction ID. Example scenario: The case of writes making it though some daemons can also be solved. The writes that have made through W daemons wins. The others are marked not in sync and need to sync up. Explanation to follow. The solution we are building is specific to namenode editlogs. There is only one active writer (as Ivan brought up earlier). Here is the outline I am thinking of. Lets start with steady state with K of N journal deamons. When a journal daemon fails, we roll the edits. When a journal daemon joins, we roll the edits. New journal daemon could start syncing other finalized edits, while keeping track of edits in progress. We also keep track of the list of the active daemons in zookeeper. Rolling gives a logical point for newly joined daemon to sync up (sort of like generation stamp). During failover, the new active, gets from the actively written journals, the point to which it has to sync up to. It then rolls the edits also to that point. Rolling also gives you a way to discard extra journal records that made it to < W daemons, during failover. When there are overlapping records, say e1-105 and e100-200, you read 100-105 from the second editlog, and discard it from the first editlog. Again there are scenarios that are missing here. I plan to post more details in a design on this.
          Hide
          Todd Lipcon added a comment -

          Hey Andrew, thanks for the ops perspective.

          The idea of embedding these logger daemons inside others is definitely something I'm considering. Embedding in DNs is one idea – the other direction is to actually have a quorum of NNs, so that when an edit is logged, it is also applied to the SBN's namespace. But for simplicity on a first cut, I think the plan is to go with external processes and then figure out where best to embed them.

          Show
          Todd Lipcon added a comment - Hey Andrew, thanks for the ops perspective. The idea of embedding these logger daemons inside others is definitely something I'm considering. Embedding in DNs is one idea – the other direction is to actually have a quorum of NNs, so that when an edit is logged, it is also applied to the SBN's namespace. But for simplicity on a first cut, I think the plan is to go with external processes and then figure out where best to embed them.
          Hide
          Andrew Purtell added a comment -

          From a user perspective.

          [Todd] I think a quorum commit is vastly superior for HA, especially given we'd like to collocate the log replicas on machines doing other work. When those machines have latency hiccups, or crash, we don't want the active NN to have to wait for long timeout periods before continuing.

          I think this is a promising direction. See next:

          [Eli] BK has two of the same main issues that we have depending on a an HA filer: (1) many users don't want to admin a separate storage system (even if you "embed" BK it will be discrete, fail independently etc)

          Perhaps we can go so far as to suggest the loggers be an additional thread added to the DataNodes. Perhaps some subset of the DN pool is elected for the purpose. (Need we waste a whole disk just for the transaction log? Maybe the log can be shared with DN storage. Or using a SSD device for this purpose seems reasonable but the average user should not be expected to have nodes with such on hand.) On the one hand, this would increase the internal complexity of the DataNode implementation, even if the functionality can be pretty well partitioned – separate package, separate thread, etc. On the other hand, there would be not yet another moving part to consider when deploying components around the cluster: ZooKeeper quorum peers, NameNodes, DataNodes, the YARN AM, the Yarn NMs, HBase Masters, HBase RegionServers etc. etc. etc.

          This idea may go too far, but IMHO embedding BookKeeper goes enough in the other direction to give me heartburn thinking about HA cluster ops.

          Show
          Andrew Purtell added a comment - From a user perspective. [Todd] I think a quorum commit is vastly superior for HA, especially given we'd like to collocate the log replicas on machines doing other work. When those machines have latency hiccups, or crash, we don't want the active NN to have to wait for long timeout periods before continuing. I think this is a promising direction. See next: [Eli] BK has two of the same main issues that we have depending on a an HA filer: (1) many users don't want to admin a separate storage system (even if you "embed" BK it will be discrete, fail independently etc) Perhaps we can go so far as to suggest the loggers be an additional thread added to the DataNodes. Perhaps some subset of the DN pool is elected for the purpose. (Need we waste a whole disk just for the transaction log? Maybe the log can be shared with DN storage. Or using a SSD device for this purpose seems reasonable but the average user should not be expected to have nodes with such on hand.) On the one hand, this would increase the internal complexity of the DataNode implementation, even if the functionality can be pretty well partitioned – separate package, separate thread, etc. On the other hand, there would be not yet another moving part to consider when deploying components around the cluster: ZooKeeper quorum peers, NameNodes, DataNodes, the YARN AM, the Yarn NMs, HBase Masters, HBase RegionServers etc. etc. etc. This idea may go too far, but IMHO embedding BookKeeper goes enough in the other direction to give me heartburn thinking about HA cluster ops.
          Hide
          Flavio Junqueira added a comment -

          Eli, We support having multiple options for writing and recovering edit logs. In fact, Ivan was a major contributor to the pluggable interface, check HDFS-1580.

          Todd, Given the similarity, you could depart from where we are instead of starting from scratch, but sure, I respect your choice. Btw, BK does not implement multi-paxos, Zab is closer to multi-paxos.

          Show
          Flavio Junqueira added a comment - Eli, We support having multiple options for writing and recovering edit logs. In fact, Ivan was a major contributor to the pluggable interface, check HDFS-1580 . Todd, Given the similarity, you could depart from where we are instead of starting from scratch, but sure, I respect your choice. Btw, BK does not implement multi-paxos, Zab is closer to multi-paxos.
          Hide
          Todd Lipcon added a comment -

          >>We have a single writer, except for when we don't. During a failover, without a STONITH capability,
          >Without some sort of fencing, you're going to have to run agreement on every update. If this is acceptable, you could have just made the namenode a thin RPC layer on top of zookeeper, and you get fault tolerance for free.

          Yea, as described in my comment this morning, there is a fencing operation built into the logger daemons. Same as BK. So you only need a consensus about recovery. It's the same thing as BK, similar to multi-paxos, etc – the steady state is fast and you pay costs at leader switchover.

          Show
          Todd Lipcon added a comment - >>We have a single writer, except for when we don't. During a failover, without a STONITH capability, >Without some sort of fencing, you're going to have to run agreement on every update. If this is acceptable, you could have just made the namenode a thin RPC layer on top of zookeeper, and you get fault tolerance for free. Yea, as described in my comment this morning, there is a fencing operation built into the logger daemons. Same as BK. So you only need a consensus about recovery. It's the same thing as BK, similar to multi-paxos, etc – the steady state is fast and you pay costs at leader switchover.
          Hide
          Ivan Kelly added a comment -

          We have a single writer, except for when we don't. During a failover, without a STONITH capability,

          Without some sort of fencing, you're going to have to run agreement on every update. If this is acceptable, you could have just made the namenode a thin RPC layer on top of zookeeper, and you get fault tolerance for free.

          Show
          Ivan Kelly added a comment - We have a single writer, except for when we don't. During a failover, without a STONITH capability, Without some sort of fencing, you're going to have to run agreement on every update. If this is acceptable, you could have just made the namenode a thin RPC layer on top of zookeeper, and you get fault tolerance for free.
          Hide
          Benjamin Reed added a comment -

          just because zookkeeper started in research does not mean that we intended it to be just a research project. bookkeeper was made specifically to address a production issue in HDFS. you are going to write a quorum system from scratch. it's a research project. (it's hard too!) the comparison will be interesting although you don't have to write any code to see what the problems with the approaches are. and i agree in the end it will be pretty easy to objectively choose between the two, so it is useful to be a proof point to reference in the future.

          Show
          Benjamin Reed added a comment - just because zookkeeper started in research does not mean that we intended it to be just a research project. bookkeeper was made specifically to address a production issue in HDFS. you are going to write a quorum system from scratch. it's a research project. (it's hard too!) the comparison will be interesting although you don't have to write any code to see what the problems with the approaches are. and i agree in the end it will be pretty easy to objectively choose between the two, so it is useful to be a proof point to reference in the future.
          Hide
          Todd Lipcon added a comment -

          Hi Flavio. The philosophy of having building blocks is a great one. I'm an ardent supporter of using ZooKeeper, and other building blocks which fit our needs. I think BookKeeper, though, is like using a chainsaw to get a haircut. It does lots of stuff we don't need, and will take a ton of work to move it to fit the other requirements we do have.

          Also, if these aspects are important for you, why don't you want to contribute them to the project?

          Moving BK over to Hadoop-based configuration, IPC, security, local edit log implementation, quorum based commit, etc, would leave almost nothing left of BK, except for (a) features we don't need, like striping and interleaved logs and (b) the core protocol, which you have described yourself as relatively simple.

          Finally, we have focused on the implementation of core protocols like zab and the quorum consensus of BookKeeper. Why not leverage this experience and focus?

          I absolutely am leveraging that experience. I spent some 10 hours this weekend studying your papers and presentations on BK and ZAB. BK is clearly a very similar system and I'm sure I will reference its design while working on this system.

          Regardless, as I said above, I intend to continue down this avenue. Please continue working on the BookKeeper one if you think it is a better avenue. If my proposed solution turns into a disaster because I screw up the quorum implementations, I'm sure no one will use it, and I'll be glad that you continued to work on an alternate. The joy of open source and pluggable implementations is that we can both prove our ideas in code and let the community vote with their feet.

          I don't think this argument is likely to be fruitful if we continue. So let's just agree to disagree and each of us can get to work on proving the other wrong with actual code.

          Show
          Todd Lipcon added a comment - Hi Flavio. The philosophy of having building blocks is a great one. I'm an ardent supporter of using ZooKeeper, and other building blocks which fit our needs. I think BookKeeper, though, is like using a chainsaw to get a haircut. It does lots of stuff we don't need, and will take a ton of work to move it to fit the other requirements we do have. Also, if these aspects are important for you, why don't you want to contribute them to the project? Moving BK over to Hadoop-based configuration, IPC, security, local edit log implementation, quorum based commit, etc, would leave almost nothing left of BK, except for (a) features we don't need, like striping and interleaved logs and (b) the core protocol, which you have described yourself as relatively simple. Finally, we have focused on the implementation of core protocols like zab and the quorum consensus of BookKeeper. Why not leverage this experience and focus? I absolutely am leveraging that experience. I spent some 10 hours this weekend studying your papers and presentations on BK and ZAB. BK is clearly a very similar system and I'm sure I will reference its design while working on this system. Regardless, as I said above, I intend to continue down this avenue. Please continue working on the BookKeeper one if you think it is a better avenue. If my proposed solution turns into a disaster because I screw up the quorum implementations, I'm sure no one will use it, and I'll be glad that you continued to work on an alternate. The joy of open source and pluggable implementations is that we can both prove our ideas in code and let the community vote with their feet. I don't think this argument is likely to be fruitful if we continue. So let's just agree to disagree and each of us can get to work on proving the other wrong with actual code.
          Hide
          Eli Collins added a comment -

          Flavio, Ivan,

          BK has two of the same main issues that we have depending on a an HA filer: (1) many users don't want to admin a separate storage system (even if you "embed" BK it will be discrete, fail independently etc) and (2) BK does a lot more than we need, is a non-trivial separate system to debug.

          If you guys want to push on BK, go for it, but the BK approach will not preclude alternative designs. That was a primary goal in HDFS-1623, to allow for more than one design.

          Thanks,
          Eli

          Show
          Eli Collins added a comment - Flavio, Ivan, BK has two of the same main issues that we have depending on a an HA filer: (1) many users don't want to admin a separate storage system (even if you "embed" BK it will be discrete, fail independently etc) and (2) BK does a lot more than we need, is a non-trivial separate system to debug. If you guys want to push on BK, go for it, but the BK approach will not preclude alternative designs. That was a primary goal in HDFS-1623 , to allow for more than one design. Thanks, Eli
          Hide
          Flavio Junqueira added a comment -

          I'm not sure I understand why there is so much weight on not having a dependency in this discussion. It goes against one of the reasons why we have even considered doing projects like zookeeper or bookkeeper: they are building blocks. I understand that there is possibly a taste component here, but I believe that having such building blocks is important because it is difficult to get them right.

          Certainly, it's a "small matter of code" to add all of these things to BookKeeper. But given that BK is primarily a project maintained by a research organization, and none of the above are at all interesting from a research perspective, I don't think it's likely to happen any time soon.

          This is an incorrect assumption about the project. One major contributor and committer is not with a research organization. Also, if these aspects are important for you, why don't you want to contribute them to the project? It would certainly help to get more contributors and grow the community.

          I also haven't seen a discussion on the bookkeeper-dev list to understand the status of the project and its directions from HDFS folks. Perhaps we are heading towards the direction you're pointing to and you don't know. Honestly, I don't think we have planned to cover all features you mention, but at least some we have. For example, we have a jira open for SSL, which we have moved for a future release because it is not a requirement for the applications that currently use BookKeeper. Here is a chance to influence another Apache project.

          Finally, we have focused on the implementation of core protocols like zab and the quorum consensus of BookKeeper. Why not leverage this experience and focus?

          Show
          Flavio Junqueira added a comment - I'm not sure I understand why there is so much weight on not having a dependency in this discussion. It goes against one of the reasons why we have even considered doing projects like zookeeper or bookkeeper: they are building blocks. I understand that there is possibly a taste component here, but I believe that having such building blocks is important because it is difficult to get them right. Certainly, it's a "small matter of code" to add all of these things to BookKeeper. But given that BK is primarily a project maintained by a research organization, and none of the above are at all interesting from a research perspective, I don't think it's likely to happen any time soon. This is an incorrect assumption about the project. One major contributor and committer is not with a research organization. Also, if these aspects are important for you, why don't you want to contribute them to the project? It would certainly help to get more contributors and grow the community. I also haven't seen a discussion on the bookkeeper-dev list to understand the status of the project and its directions from HDFS folks. Perhaps we are heading towards the direction you're pointing to and you don't know. Honestly, I don't think we have planned to cover all features you mention, but at least some we have. For example, we have a jira open for SSL, which we have moved for a future release because it is not a requirement for the applications that currently use BookKeeper. Here is a chance to influence another Apache project. Finally, we have focused on the implementation of core protocols like zab and the quorum consensus of BookKeeper. Why not leverage this experience and focus?
          Hide
          Todd Lipcon added a comment -

          The flip side of this though is that if a ZK issue were identified it could be fixed and redeployed w/o requiring a re-release of Hadoop. ZK blocker issues are typically turned around quite rapidly, while rare on the stable codeline it's usually on order of days to a week.

          That's true of ZK issues. But is the same true of BK, which is marked "contrib" and isn't deployed in production anywhere except for one org AFAIK? Looking at recent commits to the BK tree, it's an awful lot of bug fixes relating to the indexing code that allows multiple logs to be interleaved, etc – issues stemming from its generality.

          Show
          Todd Lipcon added a comment - The flip side of this though is that if a ZK issue were identified it could be fixed and redeployed w/o requiring a re-release of Hadoop. ZK blocker issues are typically turned around quite rapidly, while rare on the stable codeline it's usually on order of days to a week. That's true of ZK issues. But is the same true of BK, which is marked "contrib" and isn't deployed in production anywhere except for one org AFAIK? Looking at recent commits to the BK tree, it's an awful lot of bug fixes relating to the indexing code that allows multiple logs to be interleaved, etc – issues stemming from its generality.
          Hide
          Patrick Hunt added a comment -

          If there is a bug discovered in this code, we can fix it with a new Hadoop release without having to wait on a new release of ZooKeeper. Since ZK and HDFS may be managed by different ops teams, this also simplifies upgrade.

          The flip side of this though is that if a ZK issue were identified it could be fixed and redeployed w/o requiring a re-release of Hadoop. ZK blocker issues are typically turned around quite rapidly, while rare on the stable codeline it's usually on order of days to a week.

          My experience at large companies running ZK is that they typically have special purpose installations of ZK to support things such as this. There would be no splitting of ops responsibilities.

          Show
          Patrick Hunt added a comment - If there is a bug discovered in this code, we can fix it with a new Hadoop release without having to wait on a new release of ZooKeeper. Since ZK and HDFS may be managed by different ops teams, this also simplifies upgrade. The flip side of this though is that if a ZK issue were identified it could be fixed and redeployed w/o requiring a re-release of Hadoop. ZK blocker issues are typically turned around quite rapidly, while rare on the stable codeline it's usually on order of days to a week. My experience at large companies running ZK is that they typically have special purpose installations of ZK to support things such as this. There would be no splitting of ops responsibilities.
          Hide
          Todd Lipcon added a comment -

          Sequencing writes between different writers is the hard part. BookKeeper seems to do this by using ZK to enforce mutual exclusion, which means at its heart it too relies on a consensus protocol to cope with these tricky failure cases. This makes it a very legitimate point in the design space, but one that shares plenty with the proposal here.

          Yep – rather than fully implement ZAB, my initial implementation will also rely on ZK for writer sequencing. If we want to use this outside the context of ZK (for example with a different failure detector for the NNs) we could move on to implement the epoch-sequencing bit of ZAB.

          Show
          Todd Lipcon added a comment - Sequencing writes between different writers is the hard part. BookKeeper seems to do this by using ZK to enforce mutual exclusion, which means at its heart it too relies on a consensus protocol to cope with these tricky failure cases. This makes it a very legitimate point in the design space, but one that shares plenty with the proposal here. Yep – rather than fully implement ZAB, my initial implementation will also rely on ZK for writer sequencing. If we want to use this outside the context of ZK (for example with a different failure detector for the NNs) we could move on to implement the epoch-sequencing bit of ZAB.
          Hide
          Todd Lipcon added a comment -

          These arguments seem very much to be a case of NIH.

          No, they're an argument for uniformity of code base. Hadoop's already a large project. Briefly skimming the BK code, I see:

          • A new NIO server which we'll have to understand and probably bugfix (we've spent literally years working on our own NIO server for IPC)
          • A bunch of ad-hoc serialization code (eg in BookieServer.java). We just spent a long time making Hadoop wire-compatible using protobufs. We don't want to inherit more code which uses ad-hoc serialization.
          • No metrics subsystem at all - we want to continue to make use of the existing metrics implementation in Hadoop
          • No SASL or SSL implementation. On-the-wire encryption is a requirement we're hearing more and more in Hadoop. Hadoop IPC already gives us SASL-based encryption
          • Password-based authentication instead of kerberos-based. One more password to configure
          • Its own on-disk format for logs. So if you take a backup from a bookie, you can't use tools like the OEV to view them
          • A different file format, etc

          Certainly, it's a "small matter of code" to add all of these things to BookKeeper. But given that BK is primarily a project maintained by a research organization, and none of the above are at all interesting from a research perspective, I don't think it's likely to happen any time soon.

          Then, there is a valid NIH concern – or really not-maintained-here. As I said above, if we have a bug in BK, we need to (a) convince someone on the BK team to fix it, (b) get it into ZK trunk, (c) get the ZK team to make a new release, (d) check Hadoop against any other new changes in that release, (e) convince an operations team which may be distinct from the Hadoop ops team to update the ZooKeeper installation. That's really painful. If BK were a mature project with tons of production users, I'd agree we should just depend on it, given the number of bugs we'd likely find would be very low.

          Anyway, this JIRA isn't to argue against BookKeeper. If you want to keep exploring it, please go ahead - the advantage of a pluggable interface here is that different implementations may coexist.

          Also, I don't think ZAB is the right tool for this in any case. You have a single writer, which can therefore act as a sequencer on the entries. You just need to broadcast to an ensemble, and wait for quorum responses, as I outlined above for BookKeeper.

          We have a single writer, except for when we don't. During a failover, without a STONITH capability, we may have overlapping writers. Please see the examples above for why we need sequencing of multiple writers.

          Show
          Todd Lipcon added a comment - These arguments seem very much to be a case of NIH. No, they're an argument for uniformity of code base. Hadoop's already a large project. Briefly skimming the BK code, I see: A new NIO server which we'll have to understand and probably bugfix (we've spent literally years working on our own NIO server for IPC) A bunch of ad-hoc serialization code (eg in BookieServer.java). We just spent a long time making Hadoop wire-compatible using protobufs. We don't want to inherit more code which uses ad-hoc serialization. No metrics subsystem at all - we want to continue to make use of the existing metrics implementation in Hadoop No SASL or SSL implementation. On-the-wire encryption is a requirement we're hearing more and more in Hadoop. Hadoop IPC already gives us SASL-based encryption Password-based authentication instead of kerberos-based. One more password to configure Its own on-disk format for logs. So if you take a backup from a bookie, you can't use tools like the OEV to view them A different file format, etc Certainly, it's a "small matter of code" to add all of these things to BookKeeper. But given that BK is primarily a project maintained by a research organization, and none of the above are at all interesting from a research perspective, I don't think it's likely to happen any time soon. Then, there is a valid NIH concern – or really not-maintained-here. As I said above, if we have a bug in BK, we need to (a) convince someone on the BK team to fix it, (b) get it into ZK trunk, (c) get the ZK team to make a new release, (d) check Hadoop against any other new changes in that release, (e) convince an operations team which may be distinct from the Hadoop ops team to update the ZooKeeper installation. That's really painful. If BK were a mature project with tons of production users, I'd agree we should just depend on it, given the number of bugs we'd likely find would be very low. Anyway, this JIRA isn't to argue against BookKeeper. If you want to keep exploring it, please go ahead - the advantage of a pluggable interface here is that different implementations may coexist. Also, I don't think ZAB is the right tool for this in any case. You have a single writer, which can therefore act as a sequencer on the entries. You just need to broadcast to an ensemble, and wait for quorum responses, as I outlined above for BookKeeper. We have a single writer, except for when we don't. During a failover, without a STONITH capability, we may have overlapping writers. Please see the examples above for why we need sequencing of multiple writers.