Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3077

Quorum-based protocol for reading and writing edit logs

    Details

      Description

      Currently, one of the weak points of the HA design is that it relies on shared storage such as an NFS filer for the shared edit log. One alternative that has been proposed is to depend on BookKeeper, a ZooKeeper subproject which provides a highly available replicated edit log on commodity hardware. This JIRA is to implement another alternative, based on a quorum commit protocol, integrated more tightly in HDFS and with the requirements driven only by HDFS's needs rather than more generic use cases. More details to follow.

      1. hdfs-3077.txt
        239 kB
        Todd Lipcon
      2. hdfs-3077.txt
        239 kB
        Todd Lipcon
      3. hdfs-3077.txt
        239 kB
        Todd Lipcon
      4. hdfs-3077.txt
        239 kB
        Todd Lipcon
      5. hdfs-3077.txt
        231 kB
        Todd Lipcon
      6. hdfs-3077.txt
        214 kB
        Todd Lipcon
      7. hdfs-3077.txt
        209 kB
        Todd Lipcon
      8. hdfs-3077-branch-2.txt
        527 kB
        Todd Lipcon
      9. hdfs-3077-partial.txt
        110 kB
        Todd Lipcon
      10. hdfs-3077-test-merge.txt
        525 kB
        Todd Lipcon
      11. qjournal-design.pdf
        293 kB
        Todd Lipcon
      12. qjournal-design.pdf
        287 kB
        Todd Lipcon
      13. qjournal-design.pdf
        285 kB
        Todd Lipcon
      14. qjournal-design.pdf
        275 kB
        Todd Lipcon
      15. qjournal-design.pdf
        251 kB
        Todd Lipcon
      16. qjournal-design.pdf
        229 kB
        Todd Lipcon
      17. qjournal-design.tex
        48 kB
        Todd Lipcon
      18. qjournal-design.tex
        43 kB
        Todd Lipcon

        Issue Links

        1.
        Upgrade guava to 11.0.2 Sub-task Resolved Todd Lipcon
         
        2.
        Add infrastructure for waiting for a quorum of ListenableFutures to respond Sub-task Resolved Todd Lipcon
         
        3.
        Add preliminary QJournalProtocol interface, translators Sub-task Resolved Todd Lipcon
         
        4.
        Simple refactors in existing NN code to assist QuorumJournalManager extension Sub-task Closed Todd Lipcon
         
        5.
        Allow EditLogFileInputStream to read from a remote URL Sub-task Closed Todd Lipcon
         
        6.
        Supply NamespaceInfo when instantiating JournalManagers Sub-task Closed Todd Lipcon
         
        7.
        Active NN should exit when it cannot write to quorum number of Journal Daemons Sub-task Resolved Unassigned
         
        8.
        Add class to manage JournalList Sub-task Resolved Unassigned
         
        9.
        QJM: support purgeEditLogs() call to remotely purge logs Sub-task Resolved Todd Lipcon
         
        10.
        QJM: JNStorage should read its storage info even before a writer becomes active Sub-task Resolved Todd Lipcon
         
        11.
        QJM: Fix getEditLogManifest to fetch httpPort if necessary Sub-task Resolved Todd Lipcon
         
        12.
        Genericize format() to non-file JournalManagers Sub-task Closed Todd Lipcon
         
        13.
        Fix QJM startup when individual JNs have gaps Sub-task Resolved Todd Lipcon
         
        14.
        QJM: if a logger misses an RPC, don't retry that logger until next segment Sub-task Resolved Todd Lipcon
         
        15.
        QJM: exhaustive failure injection test for skipped RPCs Sub-task Resolved Todd Lipcon
         
        16. QJM: improve formatting behavior for JNs Sub-task Open Todd Lipcon
         
        17.
        JournalManager#format() should be able to throw IOException Sub-task Closed Ivan Kelly
         
        18.
        Implement genericized format() in QJM Sub-task Resolved Todd Lipcon
         
        19.
        QJM: validate journal dir at startup Sub-task Resolved Todd Lipcon
         
        20.
        QJM: add segment txid as a parameter to journal() RPC Sub-task Resolved Todd Lipcon
         
        21.
        Avoid throwing NPE when finalizeSegment() is called on invalid segment Sub-task Resolved Todd Lipcon
         
        22.
        QJM: handle empty log segments during recovery Sub-task Resolved Todd Lipcon
         
        23.
        QJM: improvements to QJM fault testing Sub-task Resolved Todd Lipcon
         
        24.
        QJM: hadoop-daemon.sh should be updated to accept "journalnode" Sub-task Resolved Eli Collins
         
        25.
        Fixes for edge cases in QJM recovery protocol Sub-task Resolved Todd Lipcon
         
        26. QJM: implement md5sum verification Sub-task Open Todd Lipcon
         
        27. QJM: don't require a fencer to be configured if shared storage has built-in single-writer semantics Sub-task Open Unassigned
         
        28.
        QJM: track last "committed" txid Sub-task Resolved Todd Lipcon
         
        29. QJM: Support rolling restart of JNs Sub-task Open Todd Lipcon
         
        30.
        QJM: expose non-file journal manager details in web UI Sub-task Resolved Todd Lipcon
         
        31.
        QJM: add metrics to JournalNode Sub-task Resolved Todd Lipcon
         
        32.
        QJM: Provide defaults for dfs.journalnode.*address Sub-task Resolved Eli Collins
         
        33.
        QJM: Journal format() should reset cached values Sub-task Resolved Todd Lipcon
         
        34.
        QJM: optimize log sync when JN is lagging behind Sub-task Resolved Todd Lipcon
         
        35.
        QJM: SBN fails if selectInputStreams throws RTE Sub-task Resolved Todd Lipcon
         
        36.
        QJM: Make QJM work with security enabled Sub-task Resolved Aaron T. Myers
         
        37.
        QJM: testRecoverAfterDoubleFailures can be flaky due to IPC client caching Sub-task Resolved Todd Lipcon
         
        38.
        QJM: enable TCP_NODELAY for IPC Sub-task Resolved Todd Lipcon
         
        39.
        QJM: Writer-side metrics Sub-task Resolved Todd Lipcon
         
        40.
        QJM: avoid validating log segments on log rolls Sub-task Resolved Todd Lipcon
         
        41.
        QJM: send 'heartbeat' messages to JNs even when they are out-of-sync Sub-task Resolved Todd Lipcon
         
        42.
        QJM: journalnode does not die/log ERROR when keytab is not found in secure mode Sub-task Resolved Unassigned
         
        43.
        QJM: quorum timeout on failover with large log segment Sub-task Resolved Todd Lipcon
         
        44.
        QJM: acceptRecovery should abort current segment Sub-task Resolved Todd Lipcon
         
        45.
        QJM: Failover fails with auth error in secure cluster Sub-task Resolved Todd Lipcon
         
        46.
        JournalNodes log JournalNotFormattedException backtrace error before being formatted Sub-task Resolved Todd Lipcon
         
        47.
        QJM: Add user documentation for QJM Sub-task Resolved Aaron T. Myers
         
        48.
        QJM: Add JournalNode to the start / stop scripts Sub-task Closed Andy Isaacson
         
        49.
        QJM: remove currently unused "md5sum" field. Sub-task Resolved Todd Lipcon
         
        50.
        QJM: misc TODO cleanup, improved log messages, etc Sub-task Resolved Todd Lipcon
         
        51.
        QJM: Make acceptRecovery() atomic Sub-task Resolved Todd Lipcon
         
        52.
        QJM: purge temporary files when no longer within retention period Sub-task Resolved Todd Lipcon
         
        53.
        TestJournalNode#testJournal fails because of test case execution order Sub-task Resolved Chao Shi
         
        54.
        Unclosed FileInputStream in GetJournalEditServlet Sub-task Resolved Chao Shi
         
        55. QJM: Sychronize past log segments to JNs that missed them Sub-task Open Todd Lipcon
         
        56. QJM: Merge newEpoch and prepareRecovery Sub-task Open Suresh Srinivas
         

          Activity

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              82 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development