Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-3077

Quorum-based protocol for reading and writing edit logs

    Details

      Description

      Currently, one of the weak points of the HA design is that it relies on shared storage such as an NFS filer for the shared edit log. One alternative that has been proposed is to depend on BookKeeper, a ZooKeeper subproject which provides a highly available replicated edit log on commodity hardware. This JIRA is to implement another alternative, based on a quorum commit protocol, integrated more tightly in HDFS and with the requirements driven only by HDFS's needs rather than more generic use cases. More details to follow.

        Attachments

        1. hdfs-3077.txt
          239 kB
          Todd Lipcon
        2. hdfs-3077.txt
          239 kB
          Todd Lipcon
        3. hdfs-3077.txt
          239 kB
          Todd Lipcon
        4. hdfs-3077.txt
          239 kB
          Todd Lipcon
        5. hdfs-3077.txt
          231 kB
          Todd Lipcon
        6. hdfs-3077.txt
          214 kB
          Todd Lipcon
        7. hdfs-3077.txt
          209 kB
          Todd Lipcon
        8. hdfs-3077-branch-2.txt
          527 kB
          Todd Lipcon
        9. hdfs-3077-partial.txt
          110 kB
          Todd Lipcon
        10. hdfs-3077-test-merge.txt
          525 kB
          Todd Lipcon
        11. qjournal-design.pdf
          293 kB
          Todd Lipcon
        12. qjournal-design.pdf
          287 kB
          Todd Lipcon
        13. qjournal-design.pdf
          285 kB
          Todd Lipcon
        14. qjournal-design.pdf
          275 kB
          Todd Lipcon
        15. qjournal-design.pdf
          251 kB
          Todd Lipcon
        16. qjournal-design.pdf
          229 kB
          Todd Lipcon
        17. qjournal-design.tex
          48 kB
          Todd Lipcon
        18. qjournal-design.tex
          43 kB
          Todd Lipcon

          Issue Links

          1.
          Upgrade guava to 11.0.2 Sub-task Closed Todd Lipcon
          2.
          Add infrastructure for waiting for a quorum of ListenableFutures to respond Sub-task Resolved Todd Lipcon
          3.
          Add preliminary QJournalProtocol interface, translators Sub-task Resolved Todd Lipcon
          4.
          Simple refactors in existing NN code to assist QuorumJournalManager extension Sub-task Closed Todd Lipcon
          5.
          Allow EditLogFileInputStream to read from a remote URL Sub-task Closed Todd Lipcon
          6.
          Supply NamespaceInfo when instantiating JournalManagers Sub-task Closed Todd Lipcon
          7.
          Active NN should exit when it cannot write to quorum number of Journal Daemons Sub-task Resolved Unassigned
          8.
          Add class to manage JournalList Sub-task Resolved Unassigned
          9.
          QJM: support purgeEditLogs() call to remotely purge logs Sub-task Resolved Todd Lipcon
          10.
          QJM: JNStorage should read its storage info even before a writer becomes active Sub-task Resolved Todd Lipcon
          11.
          QJM: Fix getEditLogManifest to fetch httpPort if necessary Sub-task Resolved Todd Lipcon
          12.
          Genericize format() to non-file JournalManagers Sub-task Closed Todd Lipcon
          13.
          Fix QJM startup when individual JNs have gaps Sub-task Resolved Todd Lipcon
          14.
          QJM: if a logger misses an RPC, don't retry that logger until next segment Sub-task Resolved Todd Lipcon
          15.
          QJM: exhaustive failure injection test for skipped RPCs Sub-task Resolved Todd Lipcon
          16.
          QJM: improve formatting behavior for JNs Sub-task Open Unassigned
          17.
          JournalManager#format() should be able to throw IOException Sub-task Closed Ivan Kelly
          18.
          Implement genericized format() in QJM Sub-task Resolved Todd Lipcon
          19.
          QJM: validate journal dir at startup Sub-task Resolved Todd Lipcon
          20.
          QJM: add segment txid as a parameter to journal() RPC Sub-task Resolved Todd Lipcon
          21.
          Avoid throwing NPE when finalizeSegment() is called on invalid segment Sub-task Resolved Todd Lipcon
          22.
          QJM: handle empty log segments during recovery Sub-task Resolved Todd Lipcon
          23.
          QJM: improvements to QJM fault testing Sub-task Resolved Todd Lipcon
          24.
          QJM: hadoop-daemon.sh should be updated to accept "journalnode" Sub-task Resolved Eli Collins
          25.
          Fixes for edge cases in QJM recovery protocol Sub-task Resolved Todd Lipcon
          26.
          QJM: implement md5sum verification Sub-task Open Todd Lipcon
          27.
          QJM: don't require a fencer to be configured if shared storage has built-in single-writer semantics Sub-task Patch Available Yi Liu
          28.
          QJM: track last "committed" txid Sub-task Resolved Todd Lipcon
          29.
          QJM: Support rolling restart of JNs Sub-task Open Todd Lipcon
          30.
          QJM: expose non-file journal manager details in web UI Sub-task Resolved Todd Lipcon
          31.
          QJM: add metrics to JournalNode Sub-task Resolved Todd Lipcon
          32.
          QJM: Provide defaults for dfs.journalnode.*address Sub-task Resolved Eli Collins
          33.
          QJM: Journal format() should reset cached values Sub-task Resolved Todd Lipcon
          34.
          QJM: optimize log sync when JN is lagging behind Sub-task Resolved Todd Lipcon
          35.
          QJM: SBN fails if selectInputStreams throws RTE Sub-task Resolved Todd Lipcon
          36.
          QJM: Make QJM work with security enabled Sub-task Resolved Aaron T. Myers
          37.
          QJM: testRecoverAfterDoubleFailures can be flaky due to IPC client caching Sub-task Resolved Todd Lipcon
          38.
          QJM: enable TCP_NODELAY for IPC Sub-task Resolved Todd Lipcon
          39.
          QJM: Writer-side metrics Sub-task Resolved Todd Lipcon
          40.
          QJM: avoid validating log segments on log rolls Sub-task Resolved Todd Lipcon
          41.
          QJM: send 'heartbeat' messages to JNs even when they are out-of-sync Sub-task Resolved Todd Lipcon
          42.
          QJM: journalnode does not die/log ERROR when keytab is not found in secure mode Sub-task Resolved Unassigned
          43.
          QJM: quorum timeout on failover with large log segment Sub-task Resolved Todd Lipcon
          44.
          QJM: acceptRecovery should abort current segment Sub-task Resolved Todd Lipcon
          45.
          QJM: Failover fails with auth error in secure cluster Sub-task Resolved Todd Lipcon
          46.
          JournalNodes log JournalNotFormattedException backtrace error before being formatted Sub-task Resolved Todd Lipcon
          47.
          QJM: Add user documentation for QJM Sub-task Resolved Aaron T. Myers
          48.
          QJM: Add JournalNode to the start / stop scripts Sub-task Closed Andy Isaacson
          49.
          QJM: remove currently unused "md5sum" field. Sub-task Resolved Todd Lipcon
          50.
          QJM: misc TODO cleanup, improved log messages, etc Sub-task Resolved Todd Lipcon
          51.
          QJM: Make acceptRecovery() atomic Sub-task Resolved Todd Lipcon
          52.
          QJM: purge temporary files when no longer within retention period Sub-task Resolved Todd Lipcon
          53.
          TestJournalNode#testJournal fails because of test case execution order Sub-task Resolved Chao Shi
          54.
          Unclosed FileInputStream in GetJournalEditServlet Sub-task Resolved Chao Shi
          55.
          QJM: Sychronize past log segments to JNs that missed them Sub-task Resolved Hanisha Koneru
          56.
          QJM: Merge newEpoch and prepareRecovery Sub-task Open Suresh Srinivas

            Activity

              People

              • Assignee:
                tlipcon Todd Lipcon
                Reporter:
                tlipcon Todd Lipcon
              • Votes:
                0 Vote for this issue
                Watchers:
                87 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: