Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-2823

SCM HA Support

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.4.0
    • SCM HA

    Description

      OM HA is close to feature complete now. It's time to support SCM HA, to make sure there is no SPoF in the system.

       

      Design doc: https://docs.google.com/document/d/1vr_z6mQgtS1dtI0nANoJlzvF1oLV-AtnNJnxAgg69rM/edit?usp=sharing

      Attachments

        Issue Links

          1.
          Standalone SCM RatisServer Sub-task Resolved Li Cheng

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          2.
          SCM StateMachine Sub-task Resolved Li Cheng

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          3.
          Introduce generic SCMRatisRequest and SCMRatisResponse Sub-task Resolved Nandakumar  
          4.
          SCM Invoke Handler for Ratis calls Sub-task Resolved Nandakumar  
          5.
          Refactor configuration in SCMRatisServer to Java-based configuration Sub-task Resolved Li Cheng  
          6.
          Handle AllocateContainer operation for HA Sub-task Resolved Nandakumar  
          7.
          New PipelineManager interface to persist to RatisServer Sub-task Resolved Li Cheng  
          8.
          Switch to PipelineStateManagerV2 and put PipelineFactory in PipelineManager Sub-task Resolved Li Cheng  
          9.
          Introduce SCMStateMachineHandler marker interface Sub-task Resolved Nandakumar  
          10.
          Add unit tests for new PipelineManager interface Sub-task Resolved Li Cheng  
          11.
          Add unit test for SCMRatisResponse Sub-task Resolved Li Cheng  
          12.
          Add unit test for SCMRatisRequest Sub-task Resolved Li Cheng  
          13.
          Handle inner classes in SCMRatisRequest and SCMRatisResponse Sub-task Resolved Nandakumar  
          14.
          decouple finalize and destroy pipeline Sub-task Resolved Li Cheng  
          15.
          Implement container related operations in ContainerManagerImpl Sub-task Resolved Nandakumar  
          16.
          Switch current pipeline interface to the new Replication based interface to write to Ratis Sub-task Resolved Glen Geng  
          17.
          Add isLeader check for SCM state updates Sub-task Resolved Li Cheng  
          18.
          remove the 1st edition of RatisServer of SCM HA which is copied from OM HA Sub-task Resolved Glen Geng  
          19.
          update RATIS version from 1.0.0 to 1.1.0-85281b2-SNAPSHOT Sub-task Resolved Glen Geng  
          20.
          RATIS ONE Pipeline is closed but not removed when a datanode goes stale Sub-task Resolved Glen Geng  
          21.
          Pipeline is not removed when a datanode goes stale Sub-task Resolved Glen Geng  
          22.
          Add failover proxy to SCM block protocol Sub-task Resolved Li Cheng  
          23.
          enable SCM Raft Group based on config ozone.scm.names Sub-task Resolved Glen Geng  
          24.
          CLI command to show current SCM leader and follower status Sub-task Resolved Rui Wang  
          25.
          Switch to ContainerManagerV2 Sub-task Resolved Li Cheng  
          26.
          SCMBlockLocationFailoverProxyProvider should use ScmBlockLocationProtocolPB.class in RPC.setProtocolEngine Sub-task Resolved Glen Geng  
          27.
          Handle PipelineAction and OpenPipline from DN to SCM Sub-task Resolved Unassigned  
          28.
          Make sure AllocateBlock can only be executed on leader SCM Sub-task Resolved Unassigned  
          29.
          Handle NodeReport from DN to SCMs Sub-task Resolved Unassigned  
          30.
          Handle events fired from PipelineManager to close container Sub-task Resolved Unassigned  
          31.
          Handle ContainerReport and IncrementalContainerReport Sub-task Resolved Unassigned  
          32.
          Replication can only be executed on leader Sub-task Resolved Unassigned  
          33.
          Use new ContainerManager in SCM Sub-task Resolved Nandakumar  
          34.
          Add failover proxy for SCM container client Sub-task Resolved Li Cheng  
          35.
          DN can distinguish SCMCommand from stale leader SCM Sub-task Resolved Glen Geng  
          36.
          Fix CI and test failures after force push on 2020/10/26 Sub-task Resolved Nandakumar  
          37.
          Fix TestMiniOzoneHACluster.testGetOMLeader() Sub-task Resolved Rui Wang  
          38.
          Add ReadWriteLock into PipelineStateManagerV2Impl to protect contentions between RaftServer and PipelineManager Sub-task Resolved Glen Geng  
          39.
          Need throw exception to trigger FailoverProxyProvider of SCM client to work Sub-task Resolved Glen Geng  
          40.
          Remove checkLeader in PipelineManager. Sub-task Resolved Glen Geng  
          41.
          Add tests for replication annotation Sub-task Resolved Rui Wang  
          42.
          SCM ServiceManager Sub-task Resolved Glen Geng  
          43.
          Use getRoleInfoProto() in isLeader check Sub-task Resolved Glen Geng  
          44.
          Handle stale leader issue Sub-task Resolved Unassigned  
          45.
          Add Snapshot into new SCMRatisServer and SCMStateMachine Sub-task Resolved Rui Wang  
          46.
          SCM needs to replay RaftLog for recovery Sub-task Resolved Rui Wang  
          47.
          BackgroundPipelineCreator can only serve leader Sub-task Resolved Unassigned  
          48.
          Implement Ratis Snapshots on SCM Sub-task Resolved Rui Wang  
          49.
          DeleteBlock via Ratis in SCM HA Sub-task Resolved runzhiwang  
          50.
          Load Snapshot info upon SCM Ratis starts Sub-task Resolved Rui Wang  
          51.
          Allow Enabling Purge SCM Ratis log Sub-task Resolved Rui Wang  
          52.
          Stop BackgroundPipelineCreator when PipelineManager is closed Sub-task Resolved Rui Wang  
          53.
          SCMStateMachine::applyTransaction() should not invoke TransactionContext.getClientRequest() Sub-task Resolved Glen Geng  
          54.
          Fix SCMHAManager#getPeerIdFromRoleInfo Sub-task Resolved Glen Geng  
          55.
          Update pipeline db when pipeline state is changed Sub-task Resolved Shashikant Banerjee  
          56.
          Avoid rewriting pipeline information during PipelineStateManagerV2Impl initialization Sub-task Resolved Rui Wang  
          57.
          SCMContext Phase 1 - Raft Related Info Sub-task Resolved Glen Geng  
          58.
          SCMContext Sub-task Resolved Glen Geng  
          59.
          Handle potential data loss during ReplicationManager.handleOverReplicatedContainer() Sub-task Resolved Glen Geng  
          60.
          Refactor SCMHAManager and SCMRatisServer with RaftServer.Division Sub-task Resolved Glen Geng  
          61.
          Use OM style Configuration to initialize SCM HA Sub-task Resolved Rui Wang  
          62.
          PipelineStateManagerV2Impl#removePipeline will remove pipeline from db in case of failure Sub-task Resolved Jie Yao  
          63.
          acceptance test for SCM HA Sub-task Resolved Bharat Viswanadham  
          64.
          Use suggestedLeader for SCM failover proxy performing failover Sub-task Resolved Unassigned  
          65.
          Bootstrap SCM HA Security Sub-task Resolved Bharat Viswanadham  
          66.
          Use singe server raft cluster in MiniOzoneCluster. Sub-task Resolved Glen Geng  
          67.
          Fix set configs in SCMHAConfigration Sub-task Resolved Rui Wang  
          68.
          min/max election timeout of SCMRatisServer is not set properly. Sub-task Resolved Glen Geng  
          69.
          Solve deadlock triggered by PipelineActionHandler. Sub-task Resolved Glen Geng  
          70.
          Add term into SetNodeOperationalStateCommand. Sub-task Resolved Glen Geng  
          71.
          Fix SCMHAManagerImpl#isLeader after RATIS-1227 Sub-task Resolved Unassigned  
          72.
          Implement DB buffer in MockHAManager Sub-task Resolved Rui Wang  
          73.
          Change default SCM snapshot frequency to a lower value Sub-task Resolved Rui Wang  
          74.
          Ratis Snapshot should be loaded from the confg Sub-task Resolved Rui Wang  
          75.
          Implement Distributed Sequence ID Generator Sub-task Closed Glen Geng  
          76.
          replace scmID with clusterID for container and volume at Datanode side Sub-task Resolved Glen Geng  
          77.
          Fix Recon after HDDS-4133 Sub-task Resolved Nandakumar  
          78.
          Should disallow log purge before installSnapshot is implemented Sub-task Resolved Rui Wang  
          79.
          Backport updates from PipelineManager(V1) Sub-task Resolved Unassigned  
          80.
          Handle pipeline reports Sub-task Resolved Unassigned  
          81.
          Handle ContainerAction and CloseContainer Sub-task Resolved Unassigned  
          82.
          Provide docker-compose for SCM HA Sub-task Resolved Unassigned  
          83.
          SafeMode exit rule for all SCMs Sub-task Resolved Swaminathan Balachandran  
          84.
          Use applyTransactionSerial instead of applyTransaction Sub-task Resolved Rui Wang  
          85.
          Merge OMTransactionInfo with SCMTransactionInfo Sub-task Resolved Shashikant Banerjee  
          86.
          Support encode and decode ArrayList and Long Sub-task Resolved runzhiwang  
          87.
          Replace UniqueID by the Distributed Sequence ID Generator Sub-task Resolved Rui Wang  
          88.
          Bootstrap new SCM node Sub-task Resolved Shashikant Banerjee  
          89.
          Admin command should take effect on all SCM instance Sub-task Resolved Glen Geng  
          90.
          Add STOP state to SCMService. Sub-task Resolved Unassigned  
          91.
          activatePipeline/deactivatePipeline in PipelineManagerV2Impl should acquire lock before calling StateManager#updatePipelineState. Sub-task Resolved Xu Shao Hong  
          92.
          Add functionality to transfer Rocks db checkpoint from leader to follower Sub-task Resolved Shashikant Banerjee  
          93.
          Implement increment count optimization in DeletedBlockLog V2 Sub-task Resolved Rui Wang  
          94.
          Add transactionId into deletingTxIDs when remove it from DB Sub-task Resolved runzhiwang  
          95.
          Merge SCMRatisSnapshotInfo and OMRatisSnapshotInfo into a single class Sub-task Resolved Shashikant Banerjee  
          96.
          Disable Prevote in Ratis in SCM HA by default Sub-task Resolved Rui Wang  
          97.
          Fix findbugs issues after HDDS-2195 Sub-task Resolved Glen Geng  
          98.
          Fix TestContainerEndpoint after merging master to HDDS-2823. Sub-task Resolved Glen Geng  
          99.
          Add install checkpoint in SCMStateMachine Sub-task Resolved Shashikant Banerjee  
          100.
          Fix misc acceptance test: List pipelines on unknown host Sub-task Resolved Glen Geng  
          101.
          Fix TestReconContainerManager after merge master to HDDS-2823 Sub-task Resolved Glen Geng  
          102.
          Integrate DeleteBlockLog with PartialTableCache Sub-task Resolved Unassigned  
          103.
          Add multiple SCM nodes to MiniOzoneCluster Sub-task Resolved Shashikant Banerjee  
          104.
          [SCM HA Security] Implement generate SCM certificate Sub-task Resolved Bharat Viswanadham  
          105.
          Use SCM service ID in SCMBlockClient and SCM Client Sub-task Resolved Bharat Viswanadham  
          106.
          Implement scm --bootstrap command Sub-task Resolved Shashikant Banerjee  
          107.
          Make SCM Generic config support HA Style Sub-task Resolved Bharat Viswanadham  
          108.
          Move Ratis group creation to scm --init phase Sub-task Resolved Shashikant Banerjee  
          109.
          Rename MiniOzoneHACluster to MiniOzoneOMHACluster Sub-task Resolved Mukul Kumar Singh  
          110.
          Use SCM service ID in finding SCM Datanode address. Sub-task Resolved Bharat Viswanadham  
          111.
          Make changes required for SCM admin commands to work with SCM HA Sub-task Resolved Bharat Viswanadham  
          112.
          Reopen replication/wait.robot added by HDDS-4834 Sub-task Resolved Glen Geng  
          113.
          Provide docker-compose for SCM HA Sub-task Resolved Attila Doroszlai  
          114.
          Datanode with scmID format should work with clusterID directory format Sub-task Resolved Mukul Kumar Singh  
          115.
          [SCM HA Security] Implement listCertificates based on role Sub-task Resolved Bharat Viswanadham  
          116.
          [SCM HA Security] Add failover proxy to SCM Security Server Protocol Sub-task Resolved Bharat Viswanadham  
          117.
          Make SCM ratis server spin up time during initialization configurable Sub-task Resolved Jie Yao  
          118.
          Fix removing local SCM when submitting request to other SCM. Sub-task Resolved Bharat Viswanadham  
          119.
          Fix and enable TestReconTasks Sub-task Resolved Mukul Kumar Singh  
          120.
          Fix and enable TestEndpoints.java Sub-task Resolved Mukul Kumar Singh  
          121.
          SCM Ratis enable/disable switch Sub-task Resolved Shashikant Banerjee  
          122.
          Use PipelineManagerV2Impl in Recon and enable ignored Recon test cases. Sub-task Resolved Glen Geng  
          123.
          Need a tool to upgrade current non-HA SCM node to single node HA cluster Sub-task Resolved Shashikant Banerjee  
          124.
          [SCM HA Security] Create SCM Cert Client and change DefaultCA to allow self signed and intermediary Sub-task Resolved Bharat Viswanadham  
          125.
          [SCM HA Security] Ozone services should be disabled in SCM HA enabled and security enabled cluster Sub-task Resolved Bharat Viswanadham  
          126.
          Add SCM HA to Chaos tests Sub-task Resolved Mukul Kumar Singh  
          127.
          Support inline upgrade from containerId, delTxnId, localId to SequenceIdGenerator. Sub-task Resolved Glen Geng  
          128.
          [SCM HA Security] Integrate CertClient Sub-task Resolved Bharat Viswanadham  
          129.
          refactor code in SCMStateMachine. Sub-task Resolved Glen Geng  
          130.
          NullPointerException during SCM init Sub-task Resolved Bharat Viswanadham  
          131.
          [SCM HA Security] When Ratis is enabled, SCM secure cluster is not working Sub-task Resolved Bharat Viswanadham  
          132.
          Provide example k8s files to run full HA Ozone Sub-task Resolved Marton Elek  
          133.
          Return with exit code 0 in case of optional scm bootstrap/init Sub-task Resolved Marton Elek  
          134.
          [SCM HA Security] Implement listCAs and getRootCA API Sub-task Resolved Bharat Viswanadham  
          135.
          [SCM HA Security] Make CertStore DB updates for StoreValidateCertificate go via Ratis Sub-task Resolved Bharat Viswanadham  
          136.
          [SCM HA Security] Handle leader changes during bootstrap Sub-task Resolved Bharat Viswanadham  
          137.
          Fix flaky test TestSCMInstallSnapshotWithHA#testInstallCorruptedCheckpointFailure Sub-task Resolved Shashikant Banerjee  
          138.
          Adapt admincli tests for SCM HA Sub-task Resolved Attila Doroszlai  
          139.
          Back-port HDDS-4911 (List container by container state) to ContainerManagerV2 Sub-task Resolved Jie Yao  
          140.
          Solve intellj warnings on DBTransactionBuffer. Sub-task Resolved Xu Shao Hong  
          141.
          Remove SequenceIdGenerator#StateManagerImpl Sub-task Resolved Jie Yao  
          142.
          [SCM HA Security] Make storeValidCertificate method idempotent Sub-task Resolved Bharat Viswanadham  
          143.
          [SCM HA Security] Make changes required for ratis enabled with new model of RootCA/subCA Sub-task Resolved Bharat Viswanadham  
          144.
          [Doc] Add SCM HA Setup Doc Sub-task Resolved Marton Elek  
          145.
          localId is not consistent across SCMs when setup a multi node SCM HA cluster. Sub-task Resolved Glen Geng  
          146.
          SCM get roles command should provide Ratis Leader/Follower information. Sub-task Resolved George Huang  
          147.
          SCM may not be able to know full port list of Datanode after Datanode is started. Sub-task Resolved Glen Geng  
          148.
          Merge SCM HA configs to ScmConfigKeys Sub-task Resolved Aswin Shakil  
          149.
          [SCM HA Security] Handle leader changes between SCMInfo and getSCMSigned Cert in OM Sub-task Resolved Bharat Viswanadham  
          150.
          [SCM HA Security] Fix duration of sub-ca certs Sub-task Resolved Bharat Viswanadham  
          151.
          [SCM HA Security] Make InterSCM grpc channel secure Sub-task Resolved Bharat Viswanadham  
          152.
          [SCM HA Security] Remove code of not starting ozone services when Security is enabled on SCM HA cluster Sub-task Resolved Bharat Viswanadham  
          153.
          [SCM HA Security] NPE during secure SCM initialization with HA code updated to an already existing cluster Sub-task Resolved Bharat Viswanadham  
          154.
          Ensure failover to suggested leader if any for NotLeaderException Sub-task Resolved Shashikant Banerjee  
          155.
          [SCM HA Security] Enable s3 test suite for ozone-secure-ha Sub-task Resolved Bharat Viswanadham  
          156.
          make Decommission work under SCM HA. Sub-task Resolved Glen Geng  
          157.
          Fix Install Snapshot Mechanism in SCMStateMachine Sub-task Resolved Shashikant Banerjee  
          158.
          Divide snapshot related work into notifyInstallSnapshotFromLeader and reinitialize for SCMStateMachine. Sub-task Resolved Glen Geng  
          159.
          If primordial SCM id is set, a non-HA cluster can not be initialized. Sub-task Resolved Mukul Kumar Singh  
          160.
          Use scm#checkLeader before processing client requests Sub-task Resolved Bharat Viswanadham  
          161.
          Fix scm roles command if one of the host is unresolvable Sub-task Resolved Bharat Viswanadham  
          162.
          For AccessControlException do not perform failover Sub-task Resolved Bharat Viswanadham  
          163.
          ozone freon randomkeys failed after leader SCM node is down Sub-task Resolved Bharat Viswanadham  
          164.
          Change default grpc and ratis ports for scm ha Sub-task Resolved Sadanand Shenoy  
          165.
          Make admin check work for SCM HA cluster Sub-task Resolved Bharat Viswanadham  
          166.
          SCM subsequent init failed when previous scm init failed Sub-task Resolved Bharat Viswanadham  
          167.
          SCM UI should have leader/follower and Primordial SCM information Sub-task Resolved Sadanand Shenoy  
          168.
          Fix Suggested leader in Client Sub-task Resolved Bharat Viswanadham  
          169.
          Wait for ever to obtain CA list which is needed during OM/DN startup Sub-task Resolved Bharat Viswanadham  
          170.
          SCM HA: Continuous PipelineNotFoundException seen in SCM log Sub-task Resolved Lokesh Jain  
          171.
          Fix fall back of config in SCM HA Cluster Sub-task Resolved Bharat Viswanadham  
          172.
          Handle unsecure cluster convert to secure cluster for SCM Sub-task Resolved Bharat Viswanadham  
          173.
          Add reinitialize() for SequenceIdGenerator. Sub-task Resolved Glen Geng  
          174.
          [SCM-HA] SCM start failed with PipelineNotFoundException Sub-task Resolved Shashikant Banerjee  
          175.
          [SCM-HA] SCM start failed with PipelineNotFoundException Sub-task Resolved Shashikant Banerjee  
          176.
          Use OM style config to construct RaftGroup and initialize Raft Servers Sub-task Resolved Rui Wang  
          177.
          [SCM HA Security] generate certserialID in distributed sequence Sub-task Resolved Ritesh Shukla  
          178.
          remove scm from SCM HA group Sub-task Resolved Unassigned  
          179.
          Add ratis metric for scm Sub-task Resolved Xu Shao Hong  
          180.
          For any IOexception from @Replicated method we should throw it Sub-task Resolved Jie Yao  
          181.
          use `Fileutils.move` instead of `Files.move` when installing snapshot Sub-task Resolved Jie Yao  
          182.
          terminate om if statemachine is shut down by ratis Sub-task Resolved Jie Yao  
          183.
          [Doc] Update OM HA Setup Doc Sub-task Resolved Navin Kumar  
          184.
          Temporarily ignore failing Recon tests Sub-task Resolved Nandakumar  
          185.
          Backport updates from ContainerManager(V1) Sub-task Resolved Unassigned  

          Activity

            People

              licheng Li Cheng
              Sammi Sammi Chen
              Votes:
              0 Vote for this issue
              Watchers:
              28 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m