Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-2823

SCM HA Support

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: SCM HA
    • Target Version/s:

      Description

      OM HA is close to feature complete now. It's time to support SCM HA, to make sure there is no SPoF in the system.

       

      Design doc: https://docs.google.com/document/d/1vr_z6mQgtS1dtI0nANoJlzvF1oLV-AtnNJnxAgg69rM/edit?usp=sharing

        Attachments

          Issue Links

          1.
          Standalone SCM RatisServer Sub-task Resolved Li Cheng

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          2.
          SCM StateMachine Sub-task Resolved Li Cheng

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          3.
          Introduce generic SCMRatisRequest and SCMRatisResponse Sub-task Resolved Nanda kumar  
          4.
          SCM Invoke Handler for Ratis calls Sub-task Resolved Nanda kumar  
          5.
          Refactor configuration in SCMRatisServer to Java-based configuration Sub-task Resolved Li Cheng  
          6.
          Handle AllocateContainer operation for HA Sub-task Resolved Nanda kumar  
          7.
          New PipelineManager interface to persist to RatisServer Sub-task Resolved Li Cheng  
          8.
          Switch to PipelineStateManagerV2 and put PipelineFactory in PipelineManager Sub-task Resolved Li Cheng  
          9.
          Introduce SCMStateMachineHandler marker interface Sub-task Resolved Nanda kumar  
          10.
          Add unit tests for new PipelineManager interface Sub-task Resolved Li Cheng  
          11.
          Add unit test for SCMRatisResponse Sub-task Resolved Li Cheng  
          12.
          Add unit test for SCMRatisRequest Sub-task Resolved Li Cheng  
          13.
          Handle inner classes in SCMRatisRequest and SCMRatisResponse Sub-task Resolved Nanda kumar  
          14.
          decouple finalize and destroy pipeline Sub-task Resolved Li Cheng  
          15.
          Implement container related operations in ContainerManagerImpl Sub-task Resolved Nanda kumar  
          16.
          Switch current pipeline interface to the new Replication based interface to write to Ratis Sub-task Resolved Glen Geng  
          17.
          Add isLeader check for SCM state updates Sub-task Resolved Li Cheng  
          18.
          remove the 1st edition of RatisServer of SCM HA which is copied from OM HA Sub-task Resolved Glen Geng  
          19.
          update RATIS version from 1.0.0 to 1.1.0-85281b2-SNAPSHOT Sub-task Resolved Glen Geng  
          20.
          RATIS ONE Pipeline is closed but not removed when a datanode goes stale Sub-task Resolved Glen Geng  
          21.
          Pipeline is not removed when a datanode goes stale Sub-task Resolved Glen Geng  
          22.
          Add failover proxy to SCM block protocol Sub-task Resolved Li Cheng  
          23.
          enable SCM Raft Group based on config ozone.scm.names Sub-task Resolved Glen Geng  
          24.
          CLI command to show current SCM leader and follower status Sub-task Resolved Rui Wang  
          25.
          Switch to ContainerManagerV2 Sub-task Resolved Li Cheng  
          26.
          SCMBlockLocationFailoverProxyProvider should use ScmBlockLocationProtocolPB.class in RPC.setProtocolEngine Sub-task Resolved Glen Geng  
          27.
          Handle PipelineAction and OpenPipline from DN to SCM Sub-task Resolved Unassigned  
          28.
          Make sure AllocateBlock can only be executed on leader SCM Sub-task Resolved Unassigned  
          29.
          Handle NodeReport from DN to SCMs Sub-task Resolved Unassigned  
          30.
          Handle events fired from PipelineManager to close container Sub-task Resolved Unassigned  
          31.
          Handle ContainerReport and IncrementalContainerReport Sub-task Resolved Unassigned  
          32.
          Replication can only be executed on leader Sub-task Resolved Unassigned  
          33.
          Use new ContainerManager in SCM Sub-task Resolved Nanda kumar  
          34.
          Add failover proxy for SCM container client Sub-task Resolved Li Cheng  
          35.
          DN can distinguish SCMCommand from stale leader SCM Sub-task Resolved Glen Geng  
          36.
          Fix CI and test failures after force push on 2020/10/26 Sub-task Resolved Nanda kumar  
          37.
          Fix TestMiniOzoneHACluster.testGetOMLeader() Sub-task Resolved Rui Wang  
          38.
          Add ReadWriteLock into PipelineStateManagerV2Impl to protect contentions between RaftServer and PipelineManager Sub-task Resolved Glen Geng  
          39.
          Need throw exception to trigger FailoverProxyProvider of SCM client to work Sub-task Resolved Glen Geng  
          40.
          Remove checkLeader in PipelineManager. Sub-task Resolved Glen Geng  
          41.
          Add tests for replication annotation Sub-task Resolved Rui Wang  
          42.
          SCM ServiceManager Sub-task Resolved Glen Geng  
          43.
          Use getRoleInfoProto() in isLeader check Sub-task Resolved Glen Geng  
          44.
          Handle stale leader issue Sub-task Resolved Unassigned  
          45.
          Add Snapshot into new SCMRatisServer and SCMStateMachine Sub-task Resolved Rui Wang  
          46.
          SCM needs to replay RaftLog for recovery Sub-task Resolved Rui Wang  
          47.
          BackgroundPipelineCreator can only serve leader Sub-task Resolved Unassigned  
          48.
          Implement Ratis Snapshots on SCM Sub-task Resolved Rui Wang  
          49.
          DeleteBlock via Ratis in SCM HA Sub-task Resolved runzhiwang  
          50.
          Load Snapshot info upon SCM Ratis starts Sub-task Resolved Rui Wang  
          51.
          Allow Enabling Purge SCM Ratis log Sub-task Resolved Rui Wang  
          52.
          Stop BackgroundPipelineCreator when PipelineManager is closed Sub-task Resolved Rui Wang  
          53.
          SCMStateMachine::applyTransaction() should not invoke TransactionContext.getClientRequest() Sub-task Resolved Glen Geng  
          54.
          Fix SCMHAManager#getPeerIdFromRoleInfo Sub-task Resolved Glen Geng  
          55.
          Update pipeline db when pipeline state is changed Sub-task Resolved Shashikant Banerjee  
          56.
          Avoid rewriting pipeline information during PipelineStateManagerV2Impl initialization Sub-task Resolved Rui Wang  
          57.
          SCMContext Phase 1 - Raft Related Info Sub-task Resolved Glen Geng  
          58.
          SCMContext Sub-task Resolved Glen Geng  
          59.
          Handle potential data loss during ReplicationManager.handleOverReplicatedContainer() Sub-task Resolved Glen Geng  
          60.
          Refactor SCMHAManager and SCMRatisServer with RaftServer.Division Sub-task Resolved Glen Geng  
          61.
          Use OM style Configuration to initialize SCM HA Sub-task Resolved Rui Wang  
          62.
          Support SCM HA in MiniOzoneHACluster Sub-task Open Rui Wang  
          63.
          PipelineStateManagerV2Impl#removePipeline will remove pipeline from db in case of failure Sub-task Open Unassigned  
          64.
          Backport updates from ContainerManager(V1) Sub-task Open Unassigned  
          65.
          Backport updates from PipelineManager(V1) Sub-task Open Unassigned  
          66.
          Use suggestedLeader for SCM failover proxy performing failover Sub-task Resolved Unassigned  
          67.
          Add unit test for SCMHAInvocationHandler Sub-task Open Nanda kumar  
          68.
          Handle pipeline reports Sub-task Open Unassigned  
          69.
          Handle ContainerAction and CloseContainer Sub-task Open Unassigned  
          70.
          Arrange Util classes for SCM HA Sub-task Open Nanda kumar  
          71.
          SCM CLI command towards certain IP Sub-task Open Unassigned  
          72.
          Update javadoc in SCMHA related classes Sub-task Open Nanda kumar  
          73.
          Revisit SCM client retry and failover when SCM leader changes Sub-task Open Shashikant Banerjee  
          74.
          Design for SCM HA configuration Sub-task Open Unassigned  
          75.
          Provide docker-compose for SCM HA Sub-task Open Unassigned  
          76.
          Refactor out Ratis logic chain Sub-task Open Unassigned  
          77.
          SafeMode exit rule for all SCMs Sub-task Open Unassigned  
          78.
          Decommission can be only executed on leader Sub-task Open Rui Wang  
          79.
          Bootstrap SCM HA Security Sub-task Resolved Bharat Viswanadham  
          80.
          CLI for SCMs info Sub-task Open Unassigned  
          81.
          Design for Error/Exception handling in state update for container/pipeline V2 Sub-task Open Glen Geng  
          82.
          acceptance test for SCM HA Sub-task Resolved Bharat Viswanadham  
          83.
          Add unit test for container operation in ContainerManagerImpl Sub-task Open Nanda kumar  
          84.
          replace scmID with clusterID for container and volume at Datanode side Sub-task Open Glen Geng  
          85.
          In ContainerStateManagerV2, modification of RocksDB should be consistent with that of memory state. Sub-task Open Glen Geng  
          86.
          Fix Recon after HDDS-4133 Sub-task Patch Available Nanda kumar  
          87.
          TestSCMStateMachine Sub-task Open Unassigned  
          88.
          SCMBlockLocationFailoverProxyProvider should handle LeaderNotReadyException Sub-task Open Rui Wang  
          89.
          Testing Infrastructure Random Failures Sub-task Open Unassigned  
          90.
          SCM HA needs handle the generation of clusterID and scmUuid in a decent way. Sub-task Open Unassigned  
          91.
          Add unit test to prove that datanode can handle term in SCMCommand properly Sub-task Open Unassigned  
          92.
          FailoverProxyProvider of SCM client should support leaderHint. Sub-task Open Rui Wang  
          93.
          Handle inflight delete/add actions in ReplicationManager properly. Sub-task Open YI-CHEN WANG  
          94.
          Use singe server raft cluster in MiniOzoneCluster. Sub-task Resolved Glen Geng  
          95.
          Fix set configs in SCMHAConfigration Sub-task Resolved Rui Wang  
          96.
          min/max election timeout of SCMRatisServer is not set properly. Sub-task Resolved Glen Geng  
          97.
          Solve deadlock triggered by PipelineActionHandler. Sub-task Resolved Glen Geng  
          98.
          Add term into SetNodeOperationalStateCommand. Sub-task Resolved Glen Geng  
          99.
          Fix SCMHAManagerImpl#isLeader after RATIS-1227 Sub-task Resolved Unassigned  
          100.
          Implement DB buffer in MockHAManager Sub-task Resolved Rui Wang  
          101.
          Handle backward compatible when upgrading from non HA to HA Sub-task Open Rui Wang  
          102.
          Change default SCM snapshot frequency to a lower value Sub-task Resolved Rui Wang  
          103.
          Ratis Snapshot should be loaded from the confg Sub-task Resolved Rui Wang  
          104.
          Implement Distributed Sequence ID Generator Sub-task Closed Glen Geng  
          105.
          Should disallow log purge before installSnapshot is implemented Sub-task Resolved Rui Wang  
          106.
          Implement InstallSnapshot for SCM HA Sub-task Open Shashikant Banerjee  
          107.
          Merge OMTransactionInfo with SCMTransactionInfo Sub-task Resolved Shashikant Banerjee  
          108.
          Disallow committing to DB by getCurrentBatchOperation() Sub-task Open Unassigned  
          109.
          Use applyTransactionSerial instead of applyTransaction Sub-task Resolved Rui Wang  
          110.
          Support encode and decode ArrayList and Long Sub-task Resolved runzhiwang  
          111.
          Replace UniqueID by the Distributed Sequence ID Generator Sub-task Resolved Rui Wang  
          112.
          Bootstrap new SCM node Sub-task Resolved Shashikant Banerjee  
          113.
          Add ratis snapshot retention policy for SCM HA Sub-task Open Shashikant Banerjee  
          114.
          Better handle the case that setting a trx that is earlier than latest trx in SCMDBTransactionBuffer Sub-task Open Rui Wang  
          115.
          Admin command should take effect on all SCM instance Sub-task Resolved Glen Geng  
          116.
          Add STOP state to SCMService. Sub-task Resolved Unassigned  
          117.
          activatePipeline/deactivatePipeline in PipelineManagerV2Impl should acquire lock before calling StateManager#updatePipelineState. Sub-task Resolved Xu Shao Hong  
          118.
          Implement increment count optimization in DeletedBlockLog V2 Sub-task Resolved Rui Wang  
          119.
          Add functionality to transfer Rocks db checkpoint from leader to follower Sub-task Resolved Shashikant Banerjee  
          120.
          Add transactionId into deletingTxIDs when remove it from DB Sub-task Resolved runzhiwang  
          121.
          Temporarily ignore failing Recon tests Sub-task Open Nanda kumar  
          122.
          Merge SCMRatisSnapshotInfo and OMRatisSnapshotInfo into a single class Sub-task Resolved Shashikant Banerjee  
          123.
          Disable Prevote in Ratis in SCM HA by default Sub-task Resolved Rui Wang  
          124.
          Fix findbugs issues after HDDS-2195 Sub-task Resolved Glen Geng  
          125.
          Fix TestContainerEndpoint after merging master to HDDS-2823. Sub-task Resolved Glen Geng  
          126.
          Fix TestReconContainerManager after merge master to HDDS-2823 Sub-task Resolved Glen Geng  
          127.
          Fix misc acceptance test: List pipelines on unknown host Sub-task Resolved Glen Geng  
          128.
          Add install checkpoint in SCMStateMachine Sub-task Resolved Shashikant Banerjee  
          129.
          Integrate DeleteBlockLog with PartialTableCache Sub-task Resolved Unassigned  
          130.
          Use OM style config to construct RaftGroup and initialize Raft Servers Sub-task Open Rui Wang  
          131.
          Add multiple SCM nodes to MiniOzoneCluster Sub-task Resolved Shashikant Banerjee  
          132.
          [SCM HA Security] Implement generate SCM certificate Sub-task Resolved Bharat Viswanadham  
          133.
          Use SCM service ID in SCMBlockClient and SCM Client Sub-task Resolved Bharat Viswanadham  
          134.
          Implement scm --bootstrap command Sub-task Resolved Shashikant Banerjee  
          135.
          Make SCM Generic config support HA Style Sub-task Resolved Bharat Viswanadham  
          136.
          Move Ratis group creation to scm --init phase Sub-task Resolved Shashikant Banerjee  
          137.
          Rename MiniOzoneHACluster to MiniOzoneOMHACluster Sub-task Resolved Mukul Kumar Singh  
          138.
          Use SCM service ID in finding SCM Datanode address. Sub-task Resolved Bharat Viswanadham  
          139.
          Make changes required for SCM admin commands to work with SCM HA Sub-task Resolved Bharat Viswanadham  
          140.
          Merge SCM HA Configuration Sub-task Open Bharat Viswanadham  
          141.
          Reopen replication/wait.robot added by HDDS-4834 Sub-task Resolved Glen Geng  
          142.
          Handle NotLeaderException with Event Queue Handlers Sub-task Open Unassigned  
          143.
          Provide docker-compose for SCM HA Sub-task Resolved Attila Doroszlai  
          144.
          Datanode with scmID format should work with clusterID directory format Sub-task Resolved Mukul Kumar Singh  
          145.
          [SCM HA Security] Implement listCertificates based on role Sub-task Resolved Bharat Viswanadham  
          146.
          [SCM HA Security] Add failover proxy to SCM Security Server Protocol Sub-task Resolved Bharat Viswanadham  
          147.
          Make SCM ratis server spin up time during initialization configurable Sub-task Resolved Jackson Yao  
          148.
          Retry policy for SCM requests over ratis Sub-task Open Shashikant Banerjee  
          149.
          Fix removing local SCM when submitting request to other SCM. Sub-task Resolved Bharat Viswanadham  
          150.
          Fix and enable TestReconTasks Sub-task Resolved Mukul Kumar Singh  
          151.
          Fix and enable TestEndpoints.java Sub-task Resolved Mukul Kumar Singh  
          152.
          SCM Ratis enable/disable switch Sub-task Resolved Shashikant Banerjee  
          153.
          Use PipelineManagerV2Impl in Recon and enable ignored Recon test cases. Sub-task Resolved Glen Geng  
          154.
          Add integration test for SequenceIdGen Sub-task Open Unassigned  
          155.
          Need a tool to upgrade current non-HA SCM node to single node HA cluster Sub-task Resolved Shashikant Banerjee  
          156.
          [SCM HA Security] Create SCM Cert Client and change DefaultCA to allow self signed and intermediary Sub-task Resolved Bharat Viswanadham  
          157.
          [SCM HA Security] Ozone services should be disabled in SCM HA enabled and security enabled cluster Sub-task Resolved Bharat Viswanadham  
          158.
          Add SCM HA to Chaos tests Sub-task Resolved Mukul Kumar Singh  
          159.
          Add SCM to Ratis Log Parser Sub-task Open Mukul Kumar Singh  
          160.
          Support inline upgrade from containerId, delTxnId, localId to SequenceIdGenerator. Sub-task Resolved Glen Geng  
          161.
          [SCM HA Security] Integrate CertClient Sub-task Resolved Bharat Viswanadham  
          162.
          refactor code in SCMStateMachine. Sub-task Resolved Glen Geng  
          163.
          NullPointerException during SCM init Sub-task Resolved Bharat Viswanadham  
          164.
          [SCM HA Security] When Ratis is enabled, SCM secure cluster is not working Sub-task Resolved Bharat Viswanadham  
          165.
          Provide example k8s files to run full HA Ozone Sub-task Resolved Marton Elek  
          166.
          Return with exit code 0 in case of optional scm bootstrap/init Sub-task Resolved Marton Elek  
          167.
          [SCM HA Security] Implement listCAs and getRootCA API Sub-task Resolved Bharat Viswanadham  
          168.
          [SCM HA Security] Make CertStore DB updates for StoreValidateCertificate go via Ratis Sub-task Resolved Bharat Viswanadham  
          169.
          [SCM HA Security] Handle leader changes during bootstrap Sub-task Resolved Bharat Viswanadham  
          170.
          Fix flaky test TestSCMInstallSnapshotWithHA#testInstallCorruptedCheckpointFailure Sub-task Resolved Shashikant Banerjee  
          171.
          Adapt admincli tests for SCM HA Sub-task Resolved Attila Doroszlai  
          172.
          Back-port HDDS-4911 (List container by container state) to ContainerManagerV2 Sub-task Resolved Jackson Yao  
          173.
          Solve intellj warnings on DBTransactionBuffer. Sub-task Resolved Xu Shao Hong  
          174.
          Remove SequenceIdGenerator#StateManagerImpl Sub-task Resolved Jackson Yao  
          175.
          [SCM HA Security] Make storeValidCertificate method idempotent Sub-task Resolved Bharat Viswanadham  
          176.
          [SCM HA Security] generate certserialID in distributed sequence Sub-task Open Unassigned  
          177.
          [SCM HA Security] Make changes required for ratis enabled with new model of RootCA/subCA Sub-task Resolved Bharat Viswanadham  
          178.
          [Doc] Add SCM HA Setup Doc Sub-task Resolved Marton Elek  
          179.
          localId is not consistent across SCMs when setup a multi node SCM HA cluster. Sub-task Resolved Glen Geng  
          180.
          During bootstrap, always download checkpoint from leader SCM. Sub-task Open Unassigned  
          181.
          SCM get roles command should provide Ratis Leader/Follower information. Sub-task Resolved George Huang  
          182.
          [SCM HA Security] Make upgraded cluster to ratis enabled single node cluster Sub-task Open Bharat Viswanadham  
          183.
          SCM may not be able to know full port list of Datanode after Datanode is started. Sub-task Resolved Glen Geng  
          184.
          Merge SCM HA configs to ScmConfigKeys Sub-task Open Unassigned  
          185.
          [SCM HA Security] Handle leader changes between SCMInfo and getSCMSigned Cert in OM Sub-task Resolved Bharat Viswanadham  
          186.
          [SCM HA Security] Fix duration of sub-ca certs Sub-task Resolved Bharat Viswanadham  
          187.
          [SCM HA Security] Make InterSCM grpc channel secure Sub-task Resolved Bharat Viswanadham  
          188.
          [SCM HA Security] Remove code of not starting ozone services when Security is enabled on SCM HA cluster Sub-task Resolved Bharat Viswanadham  
          189.
          [SCM HA Security] NPE during secure SCM initialization with HA code updated to an already existing cluster Sub-task Resolved Bharat Viswanadham  
          190.
          Ensure failover to suggested leader if any for NotLeaderException Sub-task Resolved Shashikant Banerjee  
          191.
          [SCM HA Security] Enable s3 test suite for ozone-secure-ha Sub-task Resolved Bharat Viswanadham  
          192.
          make Decommission work under SCM HA. Sub-task Resolved Glen Geng  
          193.
          Use MiniOzoneHAClusterImpl in TestDecommissionAndMaintenance. Sub-task Open Glen Geng  
          194.
          [SCM HA Security] Handle bootstrap of SCM when primary SCM is down Sub-task Open Unassigned  
          195.
          Fix Install Snapshot Mechanism in SCMStateMachine Sub-task Resolved Shashikant Banerjee  
          196.
          Prioritising SCM's in SCM HA setup Sub-task Open Shashikant Banerjee  
          197.
          Add more tests for SCM Failover scenarios Sub-task Open Shashikant Banerjee  
          198.
          Divide snapshot related work into notifyInstallSnapshotFromLeader and reinitialize for SCMStateMachine. Sub-task Resolved Glen Geng  
          199.
          If primordial SCM id is set, a non-HA cluster can not be initialized. Sub-task Resolved Mukul Kumar Singh  
          200.
          Use scm#checkLeader before processing client requests Sub-task Resolved Bharat Viswanadham  
          201.
          Fix scm roles command if one of the host is unresolvable Sub-task Resolved Bharat Viswanadham  
          202.
          For AccessControlException do not perform failover Sub-task Resolved Bharat Viswanadham  
          203.
          ozone freon randomkeys failed after leader SCM node is down Sub-task Resolved Bharat Viswanadham  
          204.
          Change default grpc and ratis ports for scm ha Sub-task Resolved Sadanand Shenoy  
          205.
          Make admin check work for SCM HA cluster Sub-task Resolved Bharat Viswanadham  
          206.
          SCM subsequent init failed when previous scm init failed Sub-task Resolved Bharat Viswanadham  
          207.
          SCM UI should have leader/follower and Primordial SCM information Sub-task Resolved Sadanand Shenoy  
          208.
          Fix Suggested leader in Client Sub-task Resolved Bharat Viswanadham  
          209.
          Wait for ever to obtain CA list which is needed during OM/DN startup Sub-task Resolved Bharat Viswanadham  
          210.
          SCM HA: Continuous PipelineNotFoundException seen in SCM log Sub-task Resolved Lokesh Jain  
          211.
          Fix fall back of config in SCM HA Cluster Sub-task Resolved Bharat Viswanadham  
          212.
          StorageContainerLocationProtocol api should throw SCMException Sub-task Open Unassigned  
          213.
          Handle unsecure cluster convert to secure cluster for SCM Sub-task Resolved Bharat Viswanadham  
          214.
          Add reinitialize() for SequenceIdGenerator. Sub-task Resolved Glen Geng  
          215.
          [SCM-HA] SCM start failed with PipelineNotFoundException Sub-task Resolved Shashikant Banerjee  

            Activity

              People

              • Assignee:
                licheng Li Cheng
                Reporter:
                Sammi Sammi Chen
              • Votes:
                0 Vote for this issue
                Watchers:
                27 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m