Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.4.0
    • SCM HA

    Description

      OM HA is close to feature complete now. It's time to support SCM HA, to make sure there is no SPoF in the system.

       

      Design doc: https://docs.google.com/document/d/1vr_z6mQgtS1dtI0nANoJlzvF1oLV-AtnNJnxAgg69rM/edit?usp=sharing

      Attachments

        Issue Links

          1.
          Standalone SCM RatisServer Sub-task Resolved Li Cheng

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          2.
          SCM StateMachine Sub-task Resolved Li Cheng

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          3.
          Introduce generic SCMRatisRequest and SCMRatisResponse Sub-task Resolved Nandakumar  
          4.
          SCM Invoke Handler for Ratis calls Sub-task Resolved Nandakumar  
          5.
          Refactor configuration in SCMRatisServer to Java-based configuration Sub-task Resolved Li Cheng  
          6.
          Handle AllocateContainer operation for HA Sub-task Resolved Nandakumar  
          7.
          New PipelineManager interface to persist to RatisServer Sub-task Resolved Li Cheng  
          8.
          Switch to PipelineStateManagerV2 and put PipelineFactory in PipelineManager Sub-task Resolved Li Cheng  
          9.
          Introduce SCMStateMachineHandler marker interface Sub-task Resolved Nandakumar  
          10.
          Add unit tests for new PipelineManager interface Sub-task Resolved Li Cheng  
          11.
          Add unit test for SCMRatisResponse Sub-task Resolved Li Cheng  
          12.
          Add unit test for SCMRatisRequest Sub-task Resolved Li Cheng  
          13.
          Handle inner classes in SCMRatisRequest and SCMRatisResponse Sub-task Resolved Nandakumar  
          14.
          decouple finalize and destroy pipeline Sub-task Resolved Li Cheng  
          15.
          Implement container related operations in ContainerManagerImpl Sub-task Resolved Nandakumar  
          16.
          Switch current pipeline interface to the new Replication based interface to write to Ratis Sub-task Resolved Glen Geng  
          17.
          Add isLeader check for SCM state updates Sub-task Resolved Li Cheng  
          18.
          remove the 1st edition of RatisServer of SCM HA which is copied from OM HA Sub-task Resolved Glen Geng  
          19.
          update RATIS version from 1.0.0 to 1.1.0-85281b2-SNAPSHOT Sub-task Resolved Glen Geng  
          20.
          RATIS ONE Pipeline is closed but not removed when a datanode goes stale Sub-task Resolved Glen Geng  
          21.
          Pipeline is not removed when a datanode goes stale Sub-task Resolved Glen Geng  
          22.
          Add failover proxy to SCM block protocol Sub-task Resolved Li Cheng  
          23.
          enable SCM Raft Group based on config ozone.scm.names Sub-task Resolved Glen Geng  
          24.
          CLI command to show current SCM leader and follower status Sub-task Resolved Rui Wang  
          25.
          Switch to ContainerManagerV2 Sub-task Resolved Li Cheng  
          26.
          SCMBlockLocationFailoverProxyProvider should use ScmBlockLocationProtocolPB.class in RPC.setProtocolEngine Sub-task Resolved Glen Geng  
          27.
          Handle PipelineAction and OpenPipline from DN to SCM Sub-task Resolved Unassigned  
          28.
          Make sure AllocateBlock can only be executed on leader SCM Sub-task Resolved Unassigned  
          29.
          Handle NodeReport from DN to SCMs Sub-task Resolved Unassigned  
          30.
          Handle events fired from PipelineManager to close container Sub-task Resolved Unassigned  
          31.
          Handle ContainerReport and IncrementalContainerReport Sub-task Resolved Unassigned  
          32.
          Replication can only be executed on leader Sub-task Resolved Unassigned  
          33.
          Use new ContainerManager in SCM Sub-task Resolved Nandakumar  
          34.
          Add failover proxy for SCM container client Sub-task Resolved Li Cheng  
          35.
          DN can distinguish SCMCommand from stale leader SCM Sub-task Resolved Glen Geng  
          36.
          Fix CI and test failures after force push on 2020/10/26 Sub-task Resolved Nandakumar  
          37.
          Fix TestMiniOzoneHACluster.testGetOMLeader() Sub-task Resolved Rui Wang  
          38.
          Add ReadWriteLock into PipelineStateManagerV2Impl to protect contentions between RaftServer and PipelineManager Sub-task Resolved Glen Geng  
          39.
          Need throw exception to trigger FailoverProxyProvider of SCM client to work Sub-task Resolved Glen Geng  
          40.
          Remove checkLeader in PipelineManager. Sub-task Resolved Glen Geng  
          41.
          Add tests for replication annotation Sub-task Resolved Rui Wang  
          42.
          SCM ServiceManager Sub-task Resolved Glen Geng  
          43.
          Use getRoleInfoProto() in isLeader check Sub-task Resolved Glen Geng  
          44.
          Handle stale leader issue Sub-task Resolved Unassigned  
          45.
          Add Snapshot into new SCMRatisServer and SCMStateMachine Sub-task Resolved Rui Wang  
          46.
          SCM needs to replay RaftLog for recovery Sub-task Resolved Rui Wang  
          47.
          BackgroundPipelineCreator can only serve leader Sub-task Resolved Unassigned  
          48.
          Implement Ratis Snapshots on SCM Sub-task Resolved Rui Wang  
          49.
          DeleteBlock via Ratis in SCM HA Sub-task Resolved runzhiwang  
          50.
          Load Snapshot info upon SCM Ratis starts Sub-task Resolved Rui Wang  
          51.
          Allow Enabling Purge SCM Ratis log Sub-task Resolved Rui Wang  
          52.
          Stop BackgroundPipelineCreator when PipelineManager is closed Sub-task Resolved Rui Wang  
          53.
          SCMStateMachine::applyTransaction() should not invoke TransactionContext.getClientRequest() Sub-task Resolved Glen Geng  
          54.
          Fix SCMHAManager#getPeerIdFromRoleInfo Sub-task Resolved Glen Geng  
          55.
          Update pipeline db when pipeline state is changed Sub-task Resolved Shashikant Banerjee  
          56.
          Avoid rewriting pipeline information during PipelineStateManagerV2Impl initialization Sub-task Resolved Rui Wang  
          57.
          SCMContext Phase 1 - Raft Related Info Sub-task Resolved Glen Geng  
          58.
          SCMContext Sub-task Resolved Glen Geng  
          59.
          Handle potential data loss during ReplicationManager.handleOverReplicatedContainer() Sub-task Resolved Glen Geng  
          60.
          Refactor SCMHAManager and SCMRatisServer with RaftServer.Division Sub-task Resolved Glen Geng  
          61.
          Use OM style Configuration to initialize SCM HA Sub-task Resolved Rui Wang  
          62.
          PipelineStateManagerV2Impl#removePipeline will remove pipeline from db in case of failure Sub-task Resolved Jie Yao  
          63.
          acceptance test for SCM HA Sub-task Resolved Bharat Viswanadham  
          64.
          Use suggestedLeader for SCM failover proxy performing failover Sub-task Resolved Unassigned  
          65.
          Bootstrap SCM HA Security Sub-task Resolved Bharat Viswanadham  
          66.
          Use singe server raft cluster in MiniOzoneCluster. Sub-task Resolved Glen Geng  
          67.
          Fix set configs in SCMHAConfigration Sub-task Resolved Rui Wang  
          68.
          min/max election timeout of SCMRatisServer is not set properly. Sub-task Resolved Glen Geng  
          69.
          Solve deadlock triggered by PipelineActionHandler. Sub-task Resolved Glen Geng  
          70.
          Add term into SetNodeOperationalStateCommand. Sub-task Resolved Glen Geng  
          71.
          Fix SCMHAManagerImpl#isLeader after RATIS-1227 Sub-task Resolved Unassigned  
          72.
          Implement DB buffer in MockHAManager Sub-task Resolved Rui Wang  
          73.
          Change default SCM snapshot frequency to a lower value Sub-task Resolved Rui Wang  
          74.
          Ratis Snapshot should be loaded from the confg Sub-task Resolved Rui Wang  
          75.
          Implement Distributed Sequence ID Generator Sub-task Closed Glen Geng  
          76.
          replace scmID with clusterID for container and volume at Datanode side Sub-task Resolved Glen Geng  
          77.
          Fix Recon after HDDS-4133 Sub-task Resolved Nandakumar  
          78.
          Should disallow log purge before installSnapshot is implemented Sub-task Resolved Rui Wang  
          79.
          Backport updates from PipelineManager(V1) Sub-task Resolved Unassigned  
          80.
          Handle pipeline reports Sub-task Resolved Unassigned  
          81.
          Handle ContainerAction and CloseContainer Sub-task Resolved Unassigned  
          82.
          Provide docker-compose for SCM HA Sub-task Resolved Unassigned  
          83.
          SafeMode exit rule for all SCMs Sub-task Resolved Swaminathan Balachandran  
          84.
          Use applyTransactionSerial instead of applyTransaction Sub-task Resolved Rui Wang  
          85.
          Merge OMTransactionInfo with SCMTransactionInfo Sub-task Resolved Shashikant Banerjee  
          86.
          Support encode and decode ArrayList and Long Sub-task Resolved runzhiwang  
          87.
          Replace UniqueID by the Distributed Sequence ID Generator Sub-task Resolved Rui Wang  
          88.
          Bootstrap new SCM node Sub-task Resolved Shashikant Banerjee  
          89.
          Admin command should take effect on all SCM instance Sub-task Resolved Glen Geng  
          90.
          Add STOP state to SCMService. Sub-task Resolved Unassigned  
          91.
          activatePipeline/deactivatePipeline in PipelineManagerV2Impl should acquire lock before calling StateManager#updatePipelineState. Sub-task Resolved Xu Shao Hong  
          92.
          Add functionality to transfer Rocks db checkpoint from leader to follower Sub-task Resolved Shashikant Banerjee  
          93.
          Implement increment count optimization in DeletedBlockLog V2 Sub-task Resolved Rui Wang  
          94.
          Add transactionId into deletingTxIDs when remove it from DB Sub-task Resolved runzhiwang  
          95.
          Merge SCMRatisSnapshotInfo and OMRatisSnapshotInfo into a single class Sub-task Resolved Shashikant Banerjee  
          96.
          Disable Prevote in Ratis in SCM HA by default Sub-task Resolved Rui Wang  
          97.
          Fix findbugs issues after HDDS-2195 Sub-task Resolved Glen Geng  
          98.
          Fix TestContainerEndpoint after merging master to HDDS-2823. Sub-task Resolved Glen Geng  
          99.
          Add install checkpoint in SCMStateMachine Sub-task Resolved Shashikant Banerjee  
          100.
          Fix misc acceptance test: List pipelines on unknown host Sub-task Resolved Glen Geng  
          101.
          Fix TestReconContainerManager after merge master to HDDS-2823 Sub-task Resolved Glen Geng  
          102.
          Integrate DeleteBlockLog with PartialTableCache Sub-task Resolved Unassigned  
          103.
          Add multiple SCM nodes to MiniOzoneCluster Sub-task Resolved Shashikant Banerjee  
          104.
          [SCM HA Security] Implement generate SCM certificate Sub-task Resolved Bharat Viswanadham  
          105.
          Use SCM service ID in SCMBlockClient and SCM Client Sub-task Resolved Bharat Viswanadham  
          106.
          Implement scm --bootstrap command Sub-task Resolved Shashikant Banerjee  
          107.
          Make SCM Generic config support HA Style Sub-task Resolved Bharat Viswanadham  
          108.
          Move Ratis group creation to scm --init phase Sub-task Resolved Shashikant Banerjee  
          109.
          Rename MiniOzoneHACluster to MiniOzoneOMHACluster Sub-task Resolved Mukul Kumar Singh  
          110.
          Use SCM service ID in finding SCM Datanode address. Sub-task Resolved Bharat Viswanadham  
          111.
          Make changes required for SCM admin commands to work with SCM HA Sub-task Resolved Bharat Viswanadham  
          112.
          Reopen replication/wait.robot added by HDDS-4834 Sub-task Resolved Glen Geng  
          113.
          Provide docker-compose for SCM HA Sub-task Resolved Attila Doroszlai  
          114.
          Datanode with scmID format should work with clusterID directory format Sub-task Resolved Mukul Kumar Singh  
          115.
          [SCM HA Security] Implement listCertificates based on role Sub-task Resolved Bharat Viswanadham  
          116.
          [SCM HA Security] Add failover proxy to SCM Security Server Protocol Sub-task Resolved Bharat Viswanadham  
          117.
          Make SCM ratis server spin up time during initialization configurable Sub-task Resolved Jie Yao  
          118.
          Fix removing local SCM when submitting request to other SCM. Sub-task Resolved Bharat Viswanadham  
          119.
          Fix and enable TestReconTasks Sub-task Resolved Mukul Kumar Singh  
          120.
          Fix and enable TestEndpoints.java Sub-task Resolved Mukul Kumar Singh  
          121.
          SCM Ratis enable/disable switch Sub-task Resolved Shashikant Banerjee  
          122.
          Use PipelineManagerV2Impl in Recon and enable ignored Recon test cases. Sub-task Resolved Glen Geng  
          123.
          Need a tool to upgrade current non-HA SCM node to single node HA cluster Sub-task Resolved Shashikant Banerjee  
          124.
          [SCM HA Security] Create SCM Cert Client and change DefaultCA to allow self signed and intermediary Sub-task Resolved Bharat Viswanadham  
          125.
          [SCM HA Security] Ozone services should be disabled in SCM HA enabled and security enabled cluster Sub-task Resolved Bharat Viswanadham  
          126.
          Add SCM HA to Chaos tests Sub-task Resolved Mukul Kumar Singh  
          127.
          Support inline upgrade from containerId, delTxnId, localId to SequenceIdGenerator. Sub-task Resolved Glen Geng  
          128.
          [SCM HA Security] Integrate CertClient Sub-task Resolved Bharat Viswanadham  
          129.
          refactor code in SCMStateMachine. Sub-task Resolved Glen Geng  
          130.
          NullPointerException during SCM init Sub-task Resolved Bharat Viswanadham  
          131.
          [SCM HA Security] When Ratis is enabled, SCM secure cluster is not working Sub-task Resolved Bharat Viswanadham  
          132.
          Provide example k8s files to run full HA Ozone Sub-task Resolved Marton Elek  
          133.
          Return with exit code 0 in case of optional scm bootstrap/init Sub-task Resolved Marton Elek  
          134.
          [SCM HA Security] Implement listCAs and getRootCA API Sub-task Resolved Bharat Viswanadham  
          135.
          [SCM HA Security] Make CertStore DB updates for StoreValidateCertificate go via Ratis Sub-task Resolved Bharat Viswanadham  
          136.
          [SCM HA Security] Handle leader changes during bootstrap Sub-task Resolved Bharat Viswanadham  
          137.
          Fix flaky test TestSCMInstallSnapshotWithHA#testInstallCorruptedCheckpointFailure Sub-task Resolved Shashikant Banerjee  
          138.
          Adapt admincli tests for SCM HA Sub-task Resolved Attila Doroszlai  
          139.
          Back-port HDDS-4911 (List container by container state) to ContainerManagerV2 Sub-task Resolved Jie Yao  
          140.
          Solve intellj warnings on DBTransactionBuffer. Sub-task Resolved Xu Shao Hong  
          141.
          Remove SequenceIdGenerator#StateManagerImpl Sub-task Resolved Jie Yao  
          142.
          [SCM HA Security] Make storeValidCertificate method idempotent Sub-task Resolved Bharat Viswanadham  
          143.
          [SCM HA Security] Make changes required for ratis enabled with new model of RootCA/subCA Sub-task Resolved Bharat Viswanadham  
          144.
          [Doc] Add SCM HA Setup Doc Sub-task Resolved Marton Elek  
          145.
          localId is not consistent across SCMs when setup a multi node SCM HA cluster. Sub-task Resolved Glen Geng  
          146.
          SCM get roles command should provide Ratis Leader/Follower information. Sub-task Resolved George Huang  
          147.
          SCM may not be able to know full port list of Datanode after Datanode is started. Sub-task Resolved Glen Geng  
          148.
          Merge SCM HA configs to ScmConfigKeys Sub-task Resolved Aswin Shakil  
          149.
          [SCM HA Security] Handle leader changes between SCMInfo and getSCMSigned Cert in OM Sub-task Resolved Bharat Viswanadham  
          150.
          [SCM HA Security] Fix duration of sub-ca certs Sub-task Resolved Bharat Viswanadham  
          151.
          [SCM HA Security] Make InterSCM grpc channel secure Sub-task Resolved Bharat Viswanadham  
          152.
          [SCM HA Security] Remove code of not starting ozone services when Security is enabled on SCM HA cluster Sub-task Resolved Bharat Viswanadham  
          153.
          [SCM HA Security] NPE during secure SCM initialization with HA code updated to an already existing cluster Sub-task Resolved Bharat Viswanadham  
          154.
          Ensure failover to suggested leader if any for NotLeaderException Sub-task Resolved Shashikant Banerjee  
          155.
          [SCM HA Security] Enable s3 test suite for ozone-secure-ha Sub-task Resolved Bharat Viswanadham  
          156.
          make Decommission work under SCM HA. Sub-task Resolved Glen Geng  
          157.
          Fix Install Snapshot Mechanism in SCMStateMachine Sub-task Resolved Shashikant Banerjee  
          158.
          Divide snapshot related work into notifyInstallSnapshotFromLeader and reinitialize for SCMStateMachine. Sub-task Resolved Glen Geng  
          159.
          If primordial SCM id is set, a non-HA cluster can not be initialized. Sub-task Resolved Mukul Kumar Singh  
          160.
          Use scm#checkLeader before processing client requests Sub-task Resolved Bharat Viswanadham  
          161.
          Fix scm roles command if one of the host is unresolvable Sub-task Resolved Bharat Viswanadham  
          162.
          For AccessControlException do not perform failover Sub-task Resolved Bharat Viswanadham  
          163.
          ozone freon randomkeys failed after leader SCM node is down Sub-task Resolved Bharat Viswanadham  
          164.
          Change default grpc and ratis ports for scm ha Sub-task Resolved Sadanand Shenoy  
          165.
          Make admin check work for SCM HA cluster Sub-task Resolved Bharat Viswanadham  
          166.
          SCM subsequent init failed when previous scm init failed Sub-task Resolved Bharat Viswanadham  
          167.
          SCM UI should have leader/follower and Primordial SCM information Sub-task Resolved Sadanand Shenoy  
          168.
          Fix Suggested leader in Client Sub-task Resolved Bharat Viswanadham  
          169.
          Wait for ever to obtain CA list which is needed during OM/DN startup Sub-task Resolved Bharat Viswanadham  
          170.
          SCM HA: Continuous PipelineNotFoundException seen in SCM log Sub-task Resolved Lokesh Jain  
          171.
          Fix fall back of config in SCM HA Cluster Sub-task Resolved Bharat Viswanadham  
          172.
          Handle unsecure cluster convert to secure cluster for SCM Sub-task Resolved Bharat Viswanadham  
          173.
          Add reinitialize() for SequenceIdGenerator. Sub-task Resolved Glen Geng  
          174.
          [SCM-HA] SCM start failed with PipelineNotFoundException Sub-task Resolved Shashikant Banerjee  
          175.
          [SCM-HA] SCM start failed with PipelineNotFoundException Sub-task Resolved Shashikant Banerjee  
          176.
          Use OM style config to construct RaftGroup and initialize Raft Servers Sub-task Resolved Rui Wang  
          177.
          [SCM HA Security] generate certserialID in distributed sequence Sub-task Resolved Ritesh Shukla  
          178.
          remove scm from SCM HA group Sub-task Resolved Unassigned  
          179.
          Add ratis metric for scm Sub-task Resolved Xu Shao Hong  
          180.
          For any IOexception from @Replicated method we should throw it Sub-task Resolved Jie Yao  
          181.
          use `Fileutils.move` instead of `Files.move` when installing snapshot Sub-task Resolved Jie Yao  
          182.
          terminate om if statemachine is shut down by ratis Sub-task Resolved Jie Yao  
          183.
          [Doc] Update OM HA Setup Doc Sub-task Resolved Navin Kumar  
          184.
          Temporarily ignore failing Recon tests Sub-task Resolved Nandakumar  
          185.
          Backport updates from ContainerManager(V1) Sub-task Resolved Unassigned  

          Activity

            cxorm Yi-Sheng Lien added a comment -

            Hi Sammi, thanks for creating this Jira.
            Could you share some documents about this idea ?

            cxorm Yi-Sheng Lien added a comment - Hi Sammi , thanks for creating this Jira. Could you share some documents about this idea ?
            Sammi Sammi Chen added a comment - - edited

            The design document is in draft.

            Sammi Sammi Chen added a comment - - edited The design document is in draft.
            timmylicheng Li Cheng added a comment - - edited https://docs.google.com/document/d/1vr_z6mQgtS1dtI0nANoJlzvF1oLV-AtnNJnxAgg69rM/edit?usp=sharing  is in draft and comes close to a wide review.
            ppogde Prashant Pogde added a comment -

            I am managing the 1.1.0 release and we currently have more than 600 issues targeted for 1.1.0. I am moving the target field to 1.2.0.

            If you are actively working on this jira and believe this should be targeted to 1.1.0 release, Please change the target field back to 1.1.0 before Feb 05, 2021.

            ppogde Prashant Pogde added a comment - I am managing the 1.1.0 release and we currently have more than 600 issues targeted for 1.1.0. I am moving the target field to 1.2.0. If you are actively working on this jira and believe this should be targeted to 1.1.0 release, Please change the target field back to 1.1.0 before Feb 05, 2021.
            jacksonyao Jie Yao added a comment -

            i think there are some more import works about scm ha. for example, add/delete scm. are we going to work on these later? shashikant bharat

            jacksonyao Jie Yao added a comment - i think there are some more import works about scm ha. for example, add/delete scm. are we going to work on these later? shashikant bharat

            yes, we have plan for working on it. cc: nanda

            msingh Mukul Kumar Singh added a comment - yes, we have plan for working on it. cc: nanda
            jacksonyao Jie Yao added a comment -

            msingh nanda are we going to start the following work about SCM HA recently? if there is any sub-tasks, it's my pleasure to take some.

            jacksonyao Jie Yao added a comment - msingh nanda  are we going to start the following work about SCM HA recently? if there is any sub-tasks, it's my pleasure to take some.
            erose Ethan Rose added a comment -

            I am managing the 1.2.0 release and we currently have more than 600 issues targeted for 1.2.0. I am moving the target field to 1.3.0.

            If you are actively working on this jira and believe this should be targeted for the 1.2.0 release, Please reach out to me via Apache email or Slack.

            erose Ethan Rose added a comment - I am managing the 1.2.0 release and we currently have more than 600 issues targeted for 1.2.0. I am moving the target field to 1.3.0. If you are actively working on this jira and believe this should be targeted for the 1.2.0 release, Please reach out to me via Apache email or Slack.
            micahzhao mingchao zhao added a comment -

            Ozone 1.3.0 had been released and we currently have more than 600 open issues targeted for 1.3.0. I am moving the target field to 1.4.0.

            If there is anything needs to be discussed about the Target Version, Please reach out to me via Apache email or Slack.

            micahzhao mingchao zhao added a comment - Ozone 1.3.0 had been released and we currently have more than 600 open issues targeted for 1.3.0. I am moving the target field to 1.4.0. If there is anything needs to be discussed about the Target Version, Please reach out to me via Apache email or Slack.
            nanda Nandakumar added a comment -

            This feature is merged to the master branch.
            The pending tasks will be tracked under HDDS-7823.

            nanda Nandakumar added a comment - This feature is merged to the master branch. The pending tasks will be tracked under HDDS-7823 .

            People

              licheng Li Cheng
              Sammi Sammi Chen
              Votes:
              0 Vote for this issue
              Watchers:
              28 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m