Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1623

High Availability Framework for HDFS NN

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0.0-alpha
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed
    1. dfsio-results.tsv
      7 kB
      Todd Lipcon
    2. ha-testplan.pdf
      138 kB
      Todd Lipcon
    3. ha-testplan.tex
      17 kB
      Todd Lipcon
    4. HA-tests.pdf
      81 kB
      Hari Mankude
    5. HDFS-1623.rel23.patch
      1.35 MB
      Suresh Srinivas
    6. HDFS-1623.trunk.patch
      1.24 MB
      Jitendra Nath Pandey
    7. HDFS-High-Availability.pdf
      163 kB
      Eli Collins
    8. NameNode HA_v2_1.pdf
      435 kB
      Sanjay Radia
    9. NameNode HA_v2.pdf
      411 kB
      Sanjay Radia
    10. Namenode HA Framework.pdf
      96 kB
      Sanjay Radia

      Issue Links

      1.
      HA: Send block report from datanode to both active and standby namenodes Sub-task Resolved Sanjay Radia
       
      2.
      HA: Datanode fencing mechanism Sub-task Resolved Todd Lipcon
       
      3.
      HA: HDFS clients must handle namenode failover and switch over to the new active namenode. Sub-task Resolved Aaron T. Myers
       
      4.
      HA: Introduce active and standby states to the namenode Sub-task Resolved Suresh Srinivas
       
      5.
      HA: Support for sharing the namenode state from active to standby. Sub-task Resolved Jitendra Nath Pandey
       
      6.
      Client should not timeout during a NN start - esp useful for HA NN Sub-task Resolved Unassigned
       
      7.
      Remove NameNode roles Active and Standby (they become states) Sub-task Closed Suresh Srinivas
       
      8.
      Merge NameNode roles into NodeType. Sub-task Resolved Suresh Srinivas
       
      9.
      Configuration changes for HA namenode Sub-task Resolved Suresh Srinivas
       
      10.
      Log newly allocated blocks Sub-task Closed Todd Lipcon
       
      11.
      HA: Checkpointing in an HA setup Sub-task Resolved Todd Lipcon
       
      12.
      Start/stop appropriate namenode internal services during transition to active and standby Sub-task Resolved Suresh Srinivas
       
      13.
      Generalize the HAServiceProtocol interface Sub-task Resolved Justin Joseph
       
      14.
      HA: Enable the configuration of multiple HA cluster addresses Sub-task Resolved Aaron T. Myers
       
      15.
      Mark appropriate methods of ClientProtocol with the idempotent annotation Sub-task Resolved Aaron T. Myers
       
      16.
      Add tests for Namenode active standby states Sub-task Resolved Suresh Srinivas
       
      17.
      getServerDefaults and getStats don't check operation category Sub-task Resolved Aaron T. Myers
       
      18.
      Change ConfiguredFailoverProxyProvider to take advantage of HDFS-2231 Sub-task Resolved Aaron T. Myers
       
      19.
      Add HA-related metrics Sub-task Resolved Aaron T. Myers
       
      20.
      NameNode needs to add HAServiceProtocol to its RPC Server Sub-task Resolved Todd Lipcon
       
      21.
      HA: NN fails to start since it tries to start secret manager in safemode Sub-task Resolved Todd Lipcon
       
      22.
      Scope dfs.ha.namenodes config by nameservice Sub-task Resolved Todd Lipcon
       
      23.
      Add protobuf service and implementation for HAServiceProtocol Sub-task Resolved Suresh Srinivas
       
      24.
      HA: MiniDFSCluster support to mix and match federation with HA Sub-task Resolved Todd Lipcon
       
      25.
      HA: Balancer support for HA namenodes Sub-task Resolved Uma Maheswara Rao G
       
      26.
      NN should log newly-allocated blocks without losing BlockInfo Sub-task Resolved Aaron T. Myers
       
      27.
      HA: don't initialize replication queues until entering Active mode Sub-task Resolved Todd Lipcon
       
      28.
      HA: handle refreshNameNodes in federated HA clusters Sub-task Resolved Todd Lipcon
       
      29.
      Change DatanodeProtocol#sendHeartbeat to return HeartbeatResponse Sub-task Resolved Suresh Srinivas
       
      30.
      HA: fix TestDFSUpgrade on HA branch Sub-task Resolved Todd Lipcon
       
      31.
      HA: Add test case for hot standby capability Sub-task Resolved Todd Lipcon
       
      32.
      HA: ConfiguredFailoverProxyProvider doesn't correctly stop ProtocolTranslators Sub-task Resolved Todd Lipcon
       
      33.
      HA: TestDfsOverAvroRpc failing after introduction of HeartbeatResponse type Sub-task Resolved Todd Lipcon
       
      34.
      HA: BPOfferService.verifyAndSetNamespaceInfo needs to be synchronized Sub-task Resolved Todd Lipcon
       
      35.
      HA: determine DN's view of which NN is active based on heartbeat responses Sub-task Resolved Todd Lipcon
       
      36.
      Standby needs to ingest latest edit logs before transitioning to active` Sub-task Resolved Todd Lipcon
       
      37.
      HA: Fix NN Active->Standby transition Sub-task Resolved Todd Lipcon
       
      38.
      HA: NN should throw StandbyException in response to RPCs in STANDBY state Sub-task Resolved Todd Lipcon
       
      39.
      HA: Web UI should indicate the NN state Sub-task Resolved Eli Collins
       
      40.
      HA: When a FailoverProxyProvider is used, DFSClient should not retry connection ten times before failing over Sub-task Resolved Aaron T. Myers
       
      41.
      Add interface to query current state to HAServiceProtocol Sub-task Resolved Eli Collins
       
      42.
      DFSClient should construct failover proxy with exponential backoff Sub-task Resolved Todd Lipcon
       
      43.
      HA: When a FailoverProxyProvider is used, Client should not retry for 45 times(hard coded value) if it is timing out to connect to server. Sub-task Resolved Uma Maheswara Rao G
       
      44.
      Authority-based lookup of proxy provider fails if path becomes canonicalized Sub-task Resolved Todd Lipcon
       
      45.
      Fix up some failing unit tests on HA branch Sub-task Resolved Todd Lipcon
       
      46.
      HA: write tests for quota tracking and HA Sub-task Resolved Todd Lipcon
       
      47.
      HA: BookKeeperEditLogInputStream doesn't implement isInProgress() Sub-task Resolved Aaron T. Myers
       
      48.
      HA: Tests and fixes for pipeline targets and replica recovery Sub-task Resolved Todd Lipcon
       
      49.
      HA: Bugs related to failover from/into safe-mode Sub-task Resolved Todd Lipcon
       
      50.
      Synchronization issues around state transition Sub-task Resolved Todd Lipcon
       
      51.
      HA: Appropriately handle error conditions in EditLogTailer Sub-task Resolved Aaron T. Myers
       
      52.
      HA : An alternative approach to clients handling Namenode failover. Sub-task Resolved Uma Maheswara Rao G
       
      53.
      HA: Fix test cases which use standalone FSNamesystems Sub-task Resolved Todd Lipcon
       
      54.
      HA: Configuration needs to allow different dfs.http.addresses for each HA NN Sub-task Resolved Todd Lipcon
       
      55.
      HA : TestStandbyIsHot is failing while copying in_use.lock file from NN1 nameSpaceDirs to NN2 nameSpaceDirs Sub-task Resolved Uma Maheswara Rao G
       
      56.
      NN web UI can throw NPE after startup, before standby state is entered Sub-task Resolved Todd Lipcon
       
      57.
      HA: Refactor shared HA-related test code into HATestUtils class Sub-task Resolved Todd Lipcon
       
      58.
      Add support for the standby in the bin scripts Sub-task Resolved Eli Collins
       
      59.
      Document HA configuration and CLI Sub-task Resolved Aaron T. Myers
       
      60.
      HA: add tests for multiple shared edits dirs Sub-task Resolved Unassigned
       
      61.
      HA: support 2NN with SBN Sub-task Resolved Unassigned
       
      62.
      HA: Automatically trigger log rolls periodically on the active NN Sub-task Resolved Aaron T. Myers
       
      63.
      FSEditLog.selectinputStreams is reading through in-progress streams even when non-in-progress are requested Sub-task Resolved Aaron T. Myers
       
      64.
      HA: observed dataloss in replication stress test Sub-task Resolved Todd Lipcon
       
      65.
      HA: entering safe mode after starting SBN can NPE Sub-task Resolved Uma Maheswara Rao G
       
      66.
      HA: exit if multiple shared dirs are configured Sub-task Resolved Eli Collins
       
      67.
      Warm standby does not read the in_progress edit log Sub-task Resolved Unassigned
       
      68.
      TestCheckpoint is timing out Sub-task Resolved Uma Maheswara Rao G
       
      69.
      Standby namenode stuck in safemode during a failover Sub-task Resolved Hari Mankude
       
      70.
      HA: test for case where standby partially reads log and then performs checkpoint Sub-task Resolved Aaron T. Myers
       
      71.
      HA: ConfiguredFailoverProxyProvider should support NameNodeProtocol Sub-task Resolved Uma Maheswara Rao G
       
      72.
      HA: When HA is enabled with a shared edits dir, that dir should be marked required Sub-task Resolved Aaron T. Myers
       
      73.
      HA: On transition to active, standby should not swallow ELIE Sub-task Resolved Aaron T. Myers
       
      74.
      HA: reading edit logs from an earlier version leaves blocks in under-construction state Sub-task Resolved Todd Lipcon
       
      75.
      HA: TestStandbyCheckpoints.testBothNodesInStandbyState fails intermittently Sub-task Resolved Todd Lipcon
       
      76.
      HA: Add lease recovery handling to HA Sub-task Resolved Suresh Srinivas
       
      77.
      TestHAAdmin.testFailover is failing Sub-task Resolved Eli Collins
       
      78.
      HA: Make fsck work Sub-task Resolved Aaron T. Myers
       
      79.
      HA: Active NN may purge edit log files before standby NN has a chance to read them Sub-task Resolved Todd Lipcon
       
      80.
      HA: Standby NN takes a long time to recover from a dead DN starting up Sub-task Resolved Todd Lipcon
       
      81.
      TestDFSClientFailover shuold use simpleHATopology. Sub-task Resolved Uma Maheswara Rao G
       
      82.
      SBN should not mark blocks under-replicated when exiting safemode Sub-task Resolved Todd Lipcon
       
      83.
      HA: Add a test for a federated cluster with HA NNs Sub-task Resolved Brandon Li
       
      84.
      ConfiguredFailoverProxyProvider should be moved to common. Sub-task Resolved Bikas Saha
       
      85.
      Service level authorizartion for HAServiceProtocol Sub-task Resolved Jitendra Nath Pandey
       
      86.
      HA: haadmin should use namenode ids Sub-task Resolved Eli Collins
       
      87.
      Add test to verify that delegation tokens are honored after failover. Sub-task Resolved Jitendra Nath Pandey
       
      88.
      HA: Client should fail if a failover occurs which switches block pool ID Sub-task Resolved Brandon Li
       
      89.
      When becoming active, NN should treat all leases as freshly renewed Sub-task Resolved Todd Lipcon
       
      90.
      Document new HA-related configs in hdfs-default.xml Sub-task Resolved Eli Collins
       
      91.
      Add a simple sanity check for HA config Sub-task Resolved Todd Lipcon
       
      92.
      HA: Transition to active can cause NN deadlock Sub-task Resolved Aaron T. Myers
       
      93.
      HA: failover does not succeed if prior NN died just after creating an edit log segment Sub-task Resolved Aaron T. Myers
       
      94.
      HA: Fix ConfiguredFailoverProxyProvider for federation Sub-task Resolved Aaron T. Myers
       
      95.
      HA: HAAdmin does not work if security is enabled Sub-task Resolved Aaron T. Myers
       
      96.
      HA: TestSafeMode#testNoExtensionIfNoBlocks is failing Sub-task Resolved Uma Maheswara Rao G
       
      97.
      SBN should not allow browsing of the file system via web UI Sub-task Resolved Bikas Saha
       
      98.
      HA: NN fails to start if the shared edits dir is marked required Sub-task Resolved Aaron T. Myers
       
      99.
      DFSUtil.getSuffixIDs silently ignores exception in NetUtils.createSocketAddr Sub-task Resolved Bikas Saha
       
      100.
      LOCAL_ADDRESS_MATCHER.match has NPE when called from DFSUtil.getSuffixIDs when the host is incorrect Sub-task Resolved Bikas Saha
       
      101.
      HA: TestDFSRollback#testRollback is failing Sub-task Resolved Aaron T. Myers
       
      102.
      HA: checkpointing should verify that the dfs.http.address has been configured to a non-loopback for peer NN Sub-task Resolved Todd Lipcon
       
      103.
      Failures observed if dfs.edits.dir and shared.edits.dir have same directories. Sub-task Resolved Bikas Saha
       
      104.
      Standby does not start up due to a gap in transaction id Sub-task Resolved Hari Mankude
       
      105.
      Standby namenode gets a "cannot lock storage" exception during startup Sub-task Resolved Hari Mankude
       
      106.
      HA: Remove some INFO level logging accidentally left around Sub-task Resolved Unassigned
       
      107.
      HA: edit log should log to shared dirs before local dirs Sub-task Resolved Todd Lipcon
       
      108.
      HA: DFSUtil#getSuffixIDs should skip unset configurations Sub-task Resolved Aaron T. Myers
       
      109.
      HA: automatically determine the nameservice Id if only one nameservice is configured Sub-task Resolved Eli Collins
       
      110.
      Starting delegation token manager during safemode fails Sub-task Resolved Todd Lipcon
       
      111.
      HA: Improvements for SBN web UI - not show under-replicated/missing blocks Sub-task Resolved Brandon Li
       
      112.
      HA: Client support for getting delegation tokens to an HA cluster Sub-task Resolved Todd Lipcon
       
      113.
      HA: Standby NN NPE when shared edits dir is deleted Sub-task Resolved Bikas Saha
       
      114.
      HA: NPE if shared edits directory is not available during failover Sub-task Resolved Hari Mankude
       
      115.
      HA: Inaccessible shared edits dir not getting removed from FSImage storage dirs upon error Sub-task Resolved Bikas Saha
       
      116.
      HA: Namenode not shutting down when shared edits dir is inaccessible Sub-task Resolved Bikas Saha
       
      117.
      HA: TestFailureOfSharedDir.testFailureOfSharedDir() has race condition Sub-task Resolved Bikas Saha
       
      118.
      HA: haadmin should not work if run by regular user Sub-task Resolved Eli Collins
       
      119.
      HA: fix remaining TODO items Sub-task Resolved Aaron T. Myers
       
      120.
      HA: close out operation categories Sub-task Resolved Eli Collins
       
      121.
      HA: Standby checkpointing fails to authenticate in secure cluster Sub-task Resolved Todd Lipcon
       
      122.
      HA: enable hadoop security authorization for haadmin / protocols Sub-task Resolved Unassigned
       
      123.
      HA: ConfiguredFailoverProxyProvider should not create a NameNode proxy with an underlying retry proxy Sub-task Resolved Uma Maheswara Rao G
       
      124.
      HA: stress test and fixes for block synchronization Sub-task Resolved Todd Lipcon
       
      125.
      HA: Allow configs to be scoped to all NNs in the nameservice Sub-task Resolved Todd Lipcon
       
      126.
      HA: Shared edits dir property should be suffixed with nameservice and namenodeID Sub-task Resolved Todd Lipcon
       
      127.
      HA: TestDFSHAAdmin needs tests with MiniDFSCluster Sub-task Resolved Brandon Li
       
      128.
      HA: TestHAStateTransitions fails on Windows Sub-task Resolved Uma Maheswara Rao G
       
      129.
      HA: NullPointerException while formatting NameNode(After Configuring HA) Sub-task Resolved Uma Maheswara Rao G
       
      130.
      HA: TestActiveStandbyElectorRealZK fails if build dir does not exist Sub-task Resolved Aaron T. Myers
       
      131.
      HA: On startup NN throws an NPE in the metrics system Sub-task Resolved Aaron T. Myers
       
      132.
      HA: NN throws NPE during shutdown if it fails to startup Sub-task Resolved Todd Lipcon
       
      133.
      HA: NN should not start with upgrade option or with a pending an unfinalized upgrade Sub-task Resolved Aaron T. Myers
       
      134.
      HA: IllegalStateException during standby startup in getCurSegmentTxId Sub-task Resolved Hari Mankude
       
      135.
      HA: Sweep for remaining proxy construction which doesn't go through failover path Sub-task Resolved Aaron T. Myers
       
      136.
      HA: TestDFSUtil is failing Sub-task Resolved Uma Maheswara Rao G
       
      137.
      HA: small optimization building incremental block report Sub-task Resolved Todd Lipcon
       
      138.
      HA: re-enable NO_ACK optimization for block deletion Sub-task Resolved Todd Lipcon
       
      139.
      HA: MiniDFSCluster does not delete standby NN name dirs during format Sub-task Resolved Aaron T. Myers
       
      140.
      HA: TestBalancerWithHANameNodes is failing Sub-task Resolved Aaron T. Myers
       
      141.
      HA: Balancer should use logical uri for creating failover proxy with HA enabled. Sub-task Resolved Aaron T. Myers
       
      142.
      HA: BackupNode#checkOperation should permit CHECKPOINT operations Sub-task Resolved Eli Collins
       
      143.
      Get performance on HA branch to match trunk Sub-task Resolved Todd Lipcon
       
      144.
      HA: NameNode format doesn't pick up dfs.namenode.name.dir.NameServiceId configure Sub-task Resolved Mingjie Lai
       
      145.
      HA: Implement a simple NN health check Sub-task Resolved Aaron T. Myers
       
      146.
      HA: fix failure of TestFileAppendRestart due to OP_UPDATE_BLOCKS Sub-task Resolved Todd Lipcon
       
      147.
      HA: Address findbugs and javadoc warnings on branch Sub-task Resolved Todd Lipcon
       
      148.
      HA: NPE in FSNamesystem when in safe mode Sub-task Resolved Gregory Chanan
       

        Activity

        No work has yet been logged on this issue.

          People

          • Assignee:
            Unassigned
            Reporter:
            Sanjay Radia
          • Votes:
            0 Vote for this issue
            Watchers:
            106 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development