Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-1623

High Availability Framework for HDFS NN

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0.0-alpha
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Attachments

      1. dfsio-results.tsv
        7 kB
        Todd Lipcon
      2. ha-testplan.pdf
        138 kB
        Todd Lipcon
      3. ha-testplan.tex
        17 kB
        Todd Lipcon
      4. HA-tests.pdf
        81 kB
        Hari Mankude
      5. HDFS-1623.rel23.patch
        1.35 MB
        Suresh Srinivas
      6. HDFS-1623.trunk.patch
        1.24 MB
        Jitendra Nath Pandey
      7. HDFS-High-Availability.pdf
        163 kB
        Eli Collins
      8. NameNode HA_v2_1.pdf
        435 kB
        Sanjay Radia
      9. NameNode HA_v2.pdf
        411 kB
        Sanjay Radia
      10. Namenode HA Framework.pdf
        96 kB
        Sanjay Radia

        Issue Links

        1.
        HA: Send block report from datanode to both active and standby namenodes Sub-task Resolved Sanjay Radia
        2.
        HA: Datanode fencing mechanism Sub-task Resolved Todd Lipcon
        3.
        HA: HDFS clients must handle namenode failover and switch over to the new active namenode. Sub-task Resolved Aaron T. Myers
        4.
        HA: Introduce active and standby states to the namenode Sub-task Resolved Suresh Srinivas
        5.
        HA: Support for sharing the namenode state from active to standby. Sub-task Resolved Jitendra Nath Pandey
        6.
        Client should not timeout during a NN start - esp useful for HA NN Sub-task Resolved Unassigned
        7.
        Remove NameNode roles Active and Standby (they become states) Sub-task Closed Suresh Srinivas
        8.
        Merge NameNode roles into NodeType. Sub-task Resolved Suresh Srinivas
        9.
        Configuration changes for HA namenode Sub-task Resolved Suresh Srinivas
        10.
        Log newly allocated blocks Sub-task Closed Todd Lipcon
        11.
        HA: Checkpointing in an HA setup Sub-task Resolved Todd Lipcon
        12.
        Start/stop appropriate namenode internal services during transition to active and standby Sub-task Resolved Suresh Srinivas
        13.
        Generalize the HAServiceProtocol interface Sub-task Resolved Justin Joseph
        14.
        HA: Enable the configuration of multiple HA cluster addresses Sub-task Resolved Aaron T. Myers
        15.
        Mark appropriate methods of ClientProtocol with the idempotent annotation Sub-task Resolved Aaron T. Myers
        16.
        Add tests for Namenode active standby states Sub-task Resolved Suresh Srinivas
        17.
        getServerDefaults and getStats don't check operation category Sub-task Resolved Aaron T. Myers
        18.
        Change ConfiguredFailoverProxyProvider to take advantage of HDFS-2231 Sub-task Resolved Aaron T. Myers
        19.
        Add HA-related metrics Sub-task Resolved Aaron T. Myers
        20.
        NameNode needs to add HAServiceProtocol to its RPC Server Sub-task Resolved Todd Lipcon
        21.
        HA: NN fails to start since it tries to start secret manager in safemode Sub-task Resolved Todd Lipcon
        22.
        Scope dfs.ha.namenodes config by nameservice Sub-task Resolved Todd Lipcon
        23.
        Add protobuf service and implementation for HAServiceProtocol Sub-task Resolved Suresh Srinivas
        24.
        HA: MiniDFSCluster support to mix and match federation with HA Sub-task Resolved Todd Lipcon
        25.
        HA: Balancer support for HA namenodes Sub-task Resolved Uma Maheswara Rao G
        26.
        NN should log newly-allocated blocks without losing BlockInfo Sub-task Resolved Aaron T. Myers
        27.
        HA: don't initialize replication queues until entering Active mode Sub-task Resolved Todd Lipcon
        28.
        HA: handle refreshNameNodes in federated HA clusters Sub-task Resolved Todd Lipcon
        29.
        Change DatanodeProtocol#sendHeartbeat to return HeartbeatResponse Sub-task Resolved Suresh Srinivas
        30.
        HA: fix TestDFSUpgrade on HA branch Sub-task Resolved Todd Lipcon
        31.
        HA: Add test case for hot standby capability Sub-task Resolved Todd Lipcon
        32.
        HA: ConfiguredFailoverProxyProvider doesn't correctly stop ProtocolTranslators Sub-task Resolved Todd Lipcon
        33.
        HA: TestDfsOverAvroRpc failing after introduction of HeartbeatResponse type Sub-task Resolved Todd Lipcon
        34.
        HA: BPOfferService.verifyAndSetNamespaceInfo needs to be synchronized Sub-task Resolved Todd Lipcon
        35.
        HA: determine DN's view of which NN is active based on heartbeat responses Sub-task Resolved Todd Lipcon
        36.
        Standby needs to ingest latest edit logs before transitioning to active` Sub-task Resolved Todd Lipcon
        37.
        HA: Fix NN Active->Standby transition Sub-task Resolved Todd Lipcon
        38.
        HA: NN should throw StandbyException in response to RPCs in STANDBY state Sub-task Resolved Todd Lipcon
        39.
        HA: Web UI should indicate the NN state Sub-task Resolved Eli Collins
        40.
        HA: When a FailoverProxyProvider is used, DFSClient should not retry connection ten times before failing over Sub-task Resolved Aaron T. Myers
        41.
        Add interface to query current state to HAServiceProtocol Sub-task Resolved Eli Collins
        42.
        DFSClient should construct failover proxy with exponential backoff Sub-task Resolved Todd Lipcon
        43.
        HA: When a FailoverProxyProvider is used, Client should not retry for 45 times(hard coded value) if it is timing out to connect to server. Sub-task Resolved Uma Maheswara Rao G
        44.
        Authority-based lookup of proxy provider fails if path becomes canonicalized Sub-task Resolved Todd Lipcon
        45.
        Fix up some failing unit tests on HA branch Sub-task Resolved Todd Lipcon
        46.
        HA: write tests for quota tracking and HA Sub-task Resolved Todd Lipcon
        47.
        HA: BookKeeperEditLogInputStream doesn't implement isInProgress() Sub-task Resolved Aaron T. Myers
        48.
        HA: Tests and fixes for pipeline targets and replica recovery Sub-task Resolved Todd Lipcon
        49.
        HA: Bugs related to failover from/into safe-mode Sub-task Resolved Todd Lipcon
        50.
        Synchronization issues around state transition Sub-task Resolved Todd Lipcon
        51.
        HA: Appropriately handle error conditions in EditLogTailer Sub-task Resolved Aaron T. Myers
        52.
        HA : An alternative approach to clients handling Namenode failover. Sub-task Resolved Uma Maheswara Rao G
        53.
        HA: Fix test cases which use standalone FSNamesystems Sub-task Resolved Todd Lipcon
        54.
        HA: Configuration needs to allow different dfs.http.addresses for each HA NN Sub-task Resolved Todd Lipcon
        55.
        HA : TestStandbyIsHot is failing while copying in_use.lock file from NN1 nameSpaceDirs to NN2 nameSpaceDirs Sub-task Resolved Uma Maheswara Rao G
        56.
        NN web UI can throw NPE after startup, before standby state is entered Sub-task Resolved Todd Lipcon
        57.
        HA: Refactor shared HA-related test code into HATestUtils class Sub-task Resolved Todd Lipcon
        58.
        Add support for the standby in the bin scripts Sub-task Resolved Eli Collins
        59.
        Document HA configuration and CLI Sub-task Resolved Aaron T. Myers
        60.
        HA: add tests for multiple shared edits dirs Sub-task Resolved Unassigned
        61.
        HA: support 2NN with SBN Sub-task Resolved Unassigned
        62.
        HA: Automatically trigger log rolls periodically on the active NN Sub-task Resolved Aaron T. Myers
        63.
        FSEditLog.selectinputStreams is reading through in-progress streams even when non-in-progress are requested Sub-task Resolved Aaron T. Myers
        64.
        HA: observed dataloss in replication stress test Sub-task Resolved Todd Lipcon
        65.
        HA: entering safe mode after starting SBN can NPE Sub-task Resolved Uma Maheswara Rao G
        66.
        HA: exit if multiple shared dirs are configured Sub-task Resolved Eli Collins
        67.
        Warm standby does not read the in_progress edit log Sub-task Resolved Unassigned
        68.
        TestCheckpoint is timing out Sub-task Resolved Uma Maheswara Rao G
        69.
        Standby namenode stuck in safemode during a failover Sub-task Resolved Hari Mankude
        70.
        HA: test for case where standby partially reads log and then performs checkpoint Sub-task Resolved Aaron T. Myers
        71.
        HA: ConfiguredFailoverProxyProvider should support NameNodeProtocol Sub-task Resolved Uma Maheswara Rao G
        72.
        HA: When HA is enabled with a shared edits dir, that dir should be marked required Sub-task Resolved Aaron T. Myers
        73.
        HA: On transition to active, standby should not swallow ELIE Sub-task Resolved Aaron T. Myers
        74.
        HA: reading edit logs from an earlier version leaves blocks in under-construction state Sub-task Resolved Todd Lipcon
        75.
        HA: TestStandbyCheckpoints.testBothNodesInStandbyState fails intermittently Sub-task Resolved Todd Lipcon
        76.
        HA: Add lease recovery handling to HA Sub-task Resolved Suresh Srinivas
        77.
        TestHAAdmin.testFailover is failing Sub-task Resolved Eli Collins
        78.
        HA: Make fsck work Sub-task Resolved Aaron T. Myers
        79.
        HA: Active NN may purge edit log files before standby NN has a chance to read them Sub-task Resolved Todd Lipcon
        80.
        HA: Standby NN takes a long time to recover from a dead DN starting up Sub-task Resolved Todd Lipcon
        81.
        TestDFSClientFailover shuold use simpleHATopology. Sub-task Resolved Uma Maheswara Rao G
        82.
        SBN should not mark blocks under-replicated when exiting safemode Sub-task Resolved Todd Lipcon
        83.
        HA: Add a test for a federated cluster with HA NNs Sub-task Resolved Brandon Li
        84.
        ConfiguredFailoverProxyProvider should be moved to common. Sub-task Resolved Bikas Saha
        85.
        Service level authorizartion for HAServiceProtocol Sub-task Resolved Jitendra Nath Pandey
        86.
        HA: haadmin should use namenode ids Sub-task Resolved Eli Collins
        87.
        Add test to verify that delegation tokens are honored after failover. Sub-task Resolved Jitendra Nath Pandey
        88.
        HA: Client should fail if a failover occurs which switches block pool ID Sub-task Resolved Brandon Li
        89.
        When becoming active, NN should treat all leases as freshly renewed Sub-task Resolved Todd Lipcon
        90.
        Document new HA-related configs in hdfs-default.xml Sub-task Resolved Eli Collins
        91.
        Add a simple sanity check for HA config Sub-task Resolved Todd Lipcon
        92.
        HA: Transition to active can cause NN deadlock Sub-task Resolved Aaron T. Myers
        93.
        HA: failover does not succeed if prior NN died just after creating an edit log segment Sub-task Resolved Aaron T. Myers
        94.
        HA: Fix ConfiguredFailoverProxyProvider for federation Sub-task Resolved Aaron T. Myers
        95.
        HA: HAAdmin does not work if security is enabled Sub-task Resolved Aaron T. Myers
        96.
        HA: TestSafeMode#testNoExtensionIfNoBlocks is failing Sub-task Resolved Uma Maheswara Rao G
        97.
        SBN should not allow browsing of the file system via web UI Sub-task Resolved Bikas Saha
        98.
        HA: NN fails to start if the shared edits dir is marked required Sub-task Resolved Aaron T. Myers
        99.
        DFSUtil.getSuffixIDs silently ignores exception in NetUtils.createSocketAddr Sub-task Resolved Bikas Saha
        100.
        LOCAL_ADDRESS_MATCHER.match has NPE when called from DFSUtil.getSuffixIDs when the host is incorrect Sub-task Resolved Bikas Saha
        101.
        HA: TestDFSRollback#testRollback is failing Sub-task Resolved Aaron T. Myers
        102.
        HA: checkpointing should verify that the dfs.http.address has been configured to a non-loopback for peer NN Sub-task Resolved Todd Lipcon
        103.
        Failures observed if dfs.edits.dir and shared.edits.dir have same directories. Sub-task Resolved Bikas Saha
        104.
        Standby does not start up due to a gap in transaction id Sub-task Resolved Hari Mankude
        105.
        Standby namenode gets a "cannot lock storage" exception during startup Sub-task Resolved Hari Mankude
        106.
        HA: Remove some INFO level logging accidentally left around Sub-task Resolved Unassigned
        107.
        HA: edit log should log to shared dirs before local dirs Sub-task Resolved Todd Lipcon
        108.
        HA: DFSUtil#getSuffixIDs should skip unset configurations Sub-task Resolved Aaron T. Myers
        109.
        HA: automatically determine the nameservice Id if only one nameservice is configured Sub-task Resolved Eli Collins
        110.
        Starting delegation token manager during safemode fails Sub-task Resolved Todd Lipcon
        111.
        HA: Improvements for SBN web UI - not show under-replicated/missing blocks Sub-task Resolved Brandon Li
        112.
        HA: Client support for getting delegation tokens to an HA cluster Sub-task Resolved Todd Lipcon
        113.
        HA: Standby NN NPE when shared edits dir is deleted Sub-task Resolved Bikas Saha
        114.
        HA: NPE if shared edits directory is not available during failover Sub-task Resolved Hari Mankude
        115.
        HA: Inaccessible shared edits dir not getting removed from FSImage storage dirs upon error Sub-task Resolved Bikas Saha
        116.
        HA: Namenode not shutting down when shared edits dir is inaccessible Sub-task Resolved Bikas Saha
        117.
        HA: TestFailureOfSharedDir.testFailureOfSharedDir() has race condition Sub-task Resolved Bikas Saha
        118.
        HA: haadmin should not work if run by regular user Sub-task Resolved Eli Collins
        119.
        HA: fix remaining TODO items Sub-task Resolved Aaron T. Myers
        120.
        HA: close out operation categories Sub-task Resolved Eli Collins
        121.
        HA: Standby checkpointing fails to authenticate in secure cluster Sub-task Resolved Todd Lipcon
        122.
        HA: enable hadoop security authorization for haadmin / protocols Sub-task Resolved Unassigned
        123.
        HA: ConfiguredFailoverProxyProvider should not create a NameNode proxy with an underlying retry proxy Sub-task Resolved Uma Maheswara Rao G
        124.
        HA: stress test and fixes for block synchronization Sub-task Resolved Todd Lipcon
        125.
        HA: Allow configs to be scoped to all NNs in the nameservice Sub-task Resolved Todd Lipcon
        126.
        HA: Shared edits dir property should be suffixed with nameservice and namenodeID Sub-task Resolved Todd Lipcon
        127.
        HA: TestDFSHAAdmin needs tests with MiniDFSCluster Sub-task Resolved Brandon Li
        128.
        HA: TestHAStateTransitions fails on Windows Sub-task Resolved Uma Maheswara Rao G
        129.
        HA: NullPointerException while formatting NameNode(After Configuring HA) Sub-task Resolved Uma Maheswara Rao G
        130.
        HA: TestActiveStandbyElectorRealZK fails if build dir does not exist Sub-task Resolved Aaron T. Myers
        131.
        HA: On startup NN throws an NPE in the metrics system Sub-task Resolved Aaron T. Myers
        132.
        HA: NN throws NPE during shutdown if it fails to startup Sub-task Resolved Todd Lipcon
        133.
        HA: NN should not start with upgrade option or with a pending an unfinalized upgrade Sub-task Resolved Aaron T. Myers
        134.
        HA: IllegalStateException during standby startup in getCurSegmentTxId Sub-task Resolved Hari Mankude
        135.
        HA: Sweep for remaining proxy construction which doesn't go through failover path Sub-task Resolved Aaron T. Myers
        136.
        HA: TestDFSUtil is failing Sub-task Resolved Uma Maheswara Rao G
        137.
        HA: small optimization building incremental block report Sub-task Resolved Todd Lipcon
        138.
        HA: re-enable NO_ACK optimization for block deletion Sub-task Resolved Todd Lipcon
        139.
        HA: MiniDFSCluster does not delete standby NN name dirs during format Sub-task Resolved Aaron T. Myers
        140.
        HA: TestBalancerWithHANameNodes is failing Sub-task Resolved Aaron T. Myers
        141.
        HA: Balancer should use logical uri for creating failover proxy with HA enabled. Sub-task Resolved Aaron T. Myers
        142.
        HA: BackupNode#checkOperation should permit CHECKPOINT operations Sub-task Resolved Eli Collins
        143.
        Get performance on HA branch to match trunk Sub-task Resolved Todd Lipcon
        144.
        HA: NameNode format doesn't pick up dfs.namenode.name.dir.NameServiceId configure Sub-task Resolved Mingjie Lai
        145.
        HA: Implement a simple NN health check Sub-task Resolved Aaron T. Myers
        146.
        HA: fix failure of TestFileAppendRestart due to OP_UPDATE_BLOCKS Sub-task Resolved Todd Lipcon
        147.
        HA: Address findbugs and javadoc warnings on branch Sub-task Resolved Todd Lipcon
        148.
        HA: NPE in FSNamesystem when in safe mode Sub-task Resolved Gregory Chanan

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              sanjay.radia Sanjay Radia
            • Votes:
              0 Vote for this issue
              Watchers:
              110 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: