Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-2823

SCM HA Support

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: SCM HA
    • Target Version/s:

      Description

      OM HA is close to feature complete now. It's time to support SCM HA, to make sure there is no SPoF in the system.

       

      Design doc: https://docs.google.com/document/d/1vr_z6mQgtS1dtI0nANoJlzvF1oLV-AtnNJnxAgg69rM/edit?usp=sharing

        Attachments

        1.
        Standalone SCM RatisServer Sub-task Resolved Li Cheng

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        Actions
        2.
        SCM StateMachine Sub-task Resolved Li Cheng

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        Actions
        3.
        Introduce generic SCMRatisRequest and SCMRatisResponse Sub-task Resolved Nanda kumar   Actions
        4.
        SCM Invoke Handler for Ratis calls Sub-task Resolved Nanda kumar   Actions
        5.
        Refactor configuration in SCMRatisServer to Java-based configuration Sub-task Resolved Li Cheng   Actions
        6.
        Handle AllocateContainer operation for HA Sub-task Resolved Nanda kumar   Actions
        7.
        New PipelineManager interface to persist to RatisServer Sub-task Resolved Li Cheng   Actions
        8.
        Switch to PipelineStateManagerV2 and put PipelineFactory in PipelineManager Sub-task Resolved Li Cheng   Actions
        9.
        Introduce SCMStateMachineHandler marker interface Sub-task Resolved Nanda kumar   Actions
        10.
        Add unit tests for new PipelineManager interface Sub-task Resolved Li Cheng   Actions
        11.
        Add unit test for SCMRatisResponse Sub-task Resolved Li Cheng   Actions
        12.
        Add unit test for SCMRatisRequest Sub-task Resolved Li Cheng   Actions
        13.
        Handle inner classes in SCMRatisRequest and SCMRatisResponse Sub-task Resolved Nanda kumar   Actions
        14.
        decouple finalize and destroy pipeline Sub-task Resolved Li Cheng   Actions
        15.
        Implement container related operations in ContainerManagerImpl Sub-task Resolved Nanda kumar   Actions
        16.
        Switch current pipeline interface to the new Replication based interface to write to Ratis Sub-task Resolved Glen Geng   Actions
        17.
        Add isLeader check for SCM state updates Sub-task Resolved Li Cheng   Actions
        18.
        remove the 1st edition of RatisServer of SCM HA which is copied from OM HA Sub-task Resolved Glen Geng   Actions
        19.
        update RATIS version from 1.0.0 to 1.1.0-85281b2-SNAPSHOT Sub-task Resolved Glen Geng   Actions
        20.
        RATIS ONE Pipeline is closed but not removed when a datanode goes stale Sub-task Resolved Glen Geng   Actions
        21.
        Pipeline is not removed when a datanode goes stale Sub-task Resolved Glen Geng   Actions
        22.
        Add failover proxy to SCM block protocol Sub-task Resolved Li Cheng   Actions
        23.
        enable SCM Raft Group based on config ozone.scm.names Sub-task Resolved Glen Geng   Actions
        24.
        CLI command to show current SCM leader and follower status Sub-task Resolved Rui Wang   Actions
        25.
        Switch to ContainerManagerV2 Sub-task Resolved Li Cheng   Actions
        26.
        SCMBlockLocationFailoverProxyProvider should use ScmBlockLocationProtocolPB.class in RPC.setProtocolEngine Sub-task Resolved Glen Geng   Actions
        27.
        Handle PipelineAction and OpenPipline from DN to SCM Sub-task Resolved Unassigned   Actions
        28.
        Make sure AllocateBlock can only be executed on leader SCM Sub-task Resolved Unassigned   Actions
        29.
        Handle NodeReport from DN to SCMs Sub-task Resolved Unassigned   Actions
        30.
        Handle events fired from PipelineManager to close container Sub-task Resolved Unassigned   Actions
        31.
        Handle ContainerReport and IncrementalContainerReport Sub-task Resolved Unassigned   Actions
        32.
        Replication can only be executed on leader Sub-task Resolved Unassigned   Actions
        33.
        Use new ContainerManager in SCM Sub-task Resolved Nanda kumar   Actions
        34.
        Add failover proxy for SCM container client Sub-task Resolved Li Cheng   Actions
        35.
        DN can distinguish SCMCommand from stale leader SCM Sub-task Resolved Glen Geng   Actions
        36.
        Fix CI and test failures after force push on 2020/10/26 Sub-task Resolved Nanda kumar   Actions
        37.
        Fix TestMiniOzoneHACluster.testGetOMLeader() Sub-task Resolved Rui Wang   Actions
        38.
        Add ReadWriteLock into PipelineStateManagerV2Impl to protect contentions between RaftServer and PipelineManager Sub-task Resolved Glen Geng   Actions
        39.
        Need throw exception to trigger FailoverProxyProvider of SCM client to work Sub-task Resolved Glen Geng   Actions
        40.
        Remove checkLeader in PipelineManager. Sub-task Resolved Glen Geng   Actions
        41.
        Add tests for replication annotation Sub-task Resolved Rui Wang   Actions
        42.
        SCM ServiceManager Sub-task Resolved Glen Geng   Actions
        43.
        Use getRoleInfoProto() in isLeader check Sub-task Resolved Glen Geng   Actions
        44.
        Handle stale leader issue Sub-task Resolved Unassigned   Actions
        45.
        Add Snapshot into new SCMRatisServer and SCMStateMachine Sub-task Resolved Rui Wang   Actions
        46.
        SCM needs to replay RaftLog for recovery Sub-task Resolved Rui Wang   Actions
        47.
        BackgroundPipelineCreator can only serve leader Sub-task Resolved Unassigned   Actions
        48.
        Implement Ratis Snapshots on SCM Sub-task Resolved Rui Wang   Actions
        49.
        DeleteBlock via Ratis in SCM HA Sub-task Resolved runzhiwang   Actions
        50.
        Load Snapshot info upon SCM Ratis starts Sub-task Resolved Rui Wang   Actions
        51.
        Allow Enabling Purge SCM Ratis log Sub-task Resolved Rui Wang   Actions
        52.
        Stop BackgroundPipelineCreator when PipelineManager is closed Sub-task Resolved Rui Wang   Actions
        53.
        SCMStateMachine::applyTransaction() should not invoke TransactionContext.getClientRequest() Sub-task Resolved Glen Geng   Actions
        54.
        Fix SCMHAManager#getPeerIdFromRoleInfo Sub-task Resolved Glen Geng   Actions
        55.
        Update pipeline db when pipeline state is changed Sub-task Resolved Shashikant Banerjee   Actions
        56.
        Avoid rewriting pipeline information during PipelineStateManagerV2Impl initialization Sub-task Resolved Rui Wang   Actions
        57.
        SCMContext Phase 1 - Raft Related Info Sub-task Resolved Glen Geng   Actions
        58.
        SCMContext Sub-task Resolved Glen Geng   Actions
        59.
        Handle potential data loss during ReplicationManager.handleOverReplicatedContainer() Sub-task Resolved Glen Geng   Actions
        60.
        Refactor SCMHAManager and SCMRatisServer with RaftServer.Division Sub-task Resolved Glen Geng   Actions
        61.
        Use OM style Configuration to initialize SCM HA Sub-task Resolved Rui Wang   Actions
        62.
        Support SCM HA in MiniOzoneHACluster Sub-task Open Rui Wang   Actions
        63.
        PipelineStateManagerV2Impl#removePipeline will remove pipeline from db in case of failure Sub-task Open Unassigned   Actions
        64.
        Backport updates from ContainerManager(V1) Sub-task Open Unassigned   Actions
        65.
        Backport updates from PipelineManager(V1) Sub-task Open Unassigned   Actions
        66.
        Use suggestedLeader for SCM failover proxy performing failover Sub-task Open Unassigned   Actions
        67.
        Add unit test for SCMHAInvocationHandler Sub-task Open Nanda kumar   Actions
        68.
        Handle pipeline reports Sub-task Open Unassigned   Actions
        69.
        Handle ContainerAction and CloseContainer Sub-task Open Unassigned   Actions
        70.
        Arrange Util classes for SCM HA Sub-task Open Nanda kumar   Actions
        71.
        SCM CLI command towards certain IP Sub-task Open Unassigned   Actions
        72.
        Update javadoc in SCMHA related classes Sub-task Open Nanda kumar   Actions
        73.
        Revisit SCM client retry and failover when SCM leader changes Sub-task Open Shashikant Banerjee   Actions
        74.
        Design for SCM HA configuration Sub-task Open Unassigned   Actions
        75.
        Provide docker-compose for SCM HA Sub-task Open Unassigned   Actions
        76.
        Refactor out Ratis logic chain Sub-task Open Unassigned   Actions
        77.
        SafeMode exit rule for all SCMs Sub-task Open Unassigned   Actions
        78.
        Decommission can be only executed on leader Sub-task Open Rui Wang   Actions
        79.
        Bootstrap SCM HA Security Sub-task Resolved Bharat Viswanadham   Actions
        80.
        CLI for SCMs info Sub-task Open Unassigned   Actions
        81.
        Design for Error/Exception handling in state update for container/pipeline V2 Sub-task Open Glen Geng   Actions
        82.
        acceptance test for SCM HA Sub-task Resolved Bharat Viswanadham   Actions
        83.
        Add unit test for container operation in ContainerManagerImpl Sub-task Open Nanda kumar   Actions
        84.
        replace scmID with clusterID for container and volume at Datanode side Sub-task Open Glen Geng   Actions
        85.
        In ContainerStateManagerV2, modification of RocksDB should be consistent with that of memory state. Sub-task Open Glen Geng   Actions
        86.
        Fix Recon after HDDS-4133 Sub-task Patch Available Nanda kumar   Actions
        87.
        TestSCMStateMachine Sub-task Open Unassigned   Actions
        88.
        SCMBlockLocationFailoverProxyProvider should handle LeaderNotReadyException Sub-task Open Rui Wang   Actions
        89.
        Testing Infrastructure Random Failures Sub-task Open Unassigned   Actions
        90.
        SCM HA needs handle the generation of clusterID and scmUuid in a decent way. Sub-task Open Unassigned   Actions
        91.
        Add unit test to prove that datanode can handle term in SCMCommand properly Sub-task Open Unassigned   Actions
        92.
        FailoverProxyProvider of SCM client should support leaderHint. Sub-task Open Rui Wang   Actions
        93.
        Handle inflight delete/add actions in ReplicationManager properly. Sub-task Open YI-CHEN WANG   Actions
        94.
        Use singe server raft cluster in MiniOzoneCluster. Sub-task Resolved Glen Geng   Actions
        95.
        Fix set configs in SCMHAConfigration Sub-task Resolved Rui Wang   Actions
        96.
        min/max election timeout of SCMRatisServer is not set properly. Sub-task Resolved Glen Geng   Actions
        97.
        Solve deadlock triggered by PipelineActionHandler. Sub-task Resolved Glen Geng   Actions
        98.
        Add term into SetNodeOperationalStateCommand. Sub-task Resolved Glen Geng   Actions
        99.
        Fix SCMHAManagerImpl#isLeader after RATIS-1227 Sub-task Resolved Unassigned   Actions
        100.
        Implement DB buffer in MockHAManager Sub-task Resolved Rui Wang   Actions
        101.
        Handle backward compatible when upgrading from non HA to HA Sub-task Open Rui Wang   Actions
        102.
        Change default SCM snapshot frequency to a lower value Sub-task Resolved Rui Wang   Actions
        103.
        Ratis Snapshot should be loaded from the confg Sub-task Resolved Rui Wang   Actions
        104.
        Implement Distributed Sequence ID Generator Sub-task Closed Glen Geng   Actions
        105.
        Should disallow log purge before installSnapshot is implemented Sub-task Resolved Rui Wang   Actions
        106.
        Implement InstallSnapshot for SCM HA Sub-task Open Shashikant Banerjee   Actions
        107.
        Merge OMTransactionInfo with SCMTransactionInfo Sub-task Resolved Shashikant Banerjee   Actions
        108.
        Disallow committing to DB by getCurrentBatchOperation() Sub-task Open Unassigned   Actions
        109.
        Use applyTransactionSerial instead of applyTransaction Sub-task Resolved Rui Wang   Actions
        110.
        Support encode and decode ArrayList and Long Sub-task Resolved runzhiwang   Actions
        111.
        Replace UniqueID by the Distributed Sequence ID Generator Sub-task Resolved Rui Wang   Actions
        112.
        Bootstrap new SCM node Sub-task Resolved Shashikant Banerjee   Actions
        113.
        Add ratis snapshot retention policy for SCM HA Sub-task Open Shashikant Banerjee   Actions
        114.
        Better handle the case that setting a trx that is earlier than latest trx in SCMDBTransactionBuffer Sub-task Open Rui Wang   Actions
        115.
        Admin command should take effect on all SCM instance Sub-task Resolved Glen Geng   Actions
        116.
        Add STOP state to SCMService. Sub-task Resolved Unassigned   Actions
        117.
        activatePipeline/deactivatePipeline in PipelineManagerV2Impl should acquire lock before calling StateManager#updatePipelineState. Sub-task Resolved Xu Shao Hong   Actions
        118.
        Implement increment count optimization in DeletedBlockLog V2 Sub-task Resolved Rui Wang   Actions
        119.
        Add functionality to transfer Rocks db checkpoint from leader to follower Sub-task Resolved Shashikant Banerjee   Actions
        120.
        Add transactionId into deletingTxIDs when remove it from DB Sub-task Resolved runzhiwang   Actions
        121.
        Temporarily ignore failing Recon tests Sub-task Open Nanda kumar   Actions
        122.
        Merge SCMRatisSnapshotInfo and OMRatisSnapshotInfo into a single class Sub-task Resolved Shashikant Banerjee   Actions
        123.
        Disable Prevote in Ratis in SCM HA by default Sub-task Resolved Rui Wang   Actions
        124.
        Fix findbugs issues after HDDS-2195 Sub-task Resolved Glen Geng   Actions
        125.
        Fix TestContainerEndpoint after merging master to HDDS-2823. Sub-task Resolved Glen Geng   Actions
        126.
        Fix TestReconContainerManager after merge master to HDDS-2823 Sub-task Resolved Glen Geng   Actions
        127.
        Fix misc acceptance test: List pipelines on unknown host Sub-task Resolved Glen Geng   Actions
        128.
        Add install checkpoint in SCMStateMachine Sub-task Resolved Shashikant Banerjee   Actions
        129.
        Integrate DeleteBlockLog with PartialTableCache Sub-task Resolved Unassigned   Actions
        130.
        Use OM style config to construct RaftGroup and initialize Raft Servers Sub-task Open Rui Wang   Actions
        131.
        Add multiple SCM nodes to MiniOzoneCluster Sub-task Resolved Shashikant Banerjee   Actions
        132.
        [SCM HA Security] Implement generate SCM certificate Sub-task Resolved Bharat Viswanadham   Actions
        133.
        Use SCM service ID in SCMBlockClient and SCM Client Sub-task Resolved Bharat Viswanadham   Actions
        134.
        Implement scm --bootstrap command Sub-task Resolved Shashikant Banerjee   Actions
        135.
        Make SCM Generic config support HA Style Sub-task Resolved Bharat Viswanadham   Actions
        136.
        Move Ratis group creation to scm --init phase Sub-task Resolved Shashikant Banerjee   Actions
        137.
        Rename MiniOzoneHACluster to MiniOzoneOMHACluster Sub-task Resolved Mukul Kumar Singh   Actions
        138.
        Use SCM service ID in finding SCM Datanode address. Sub-task Resolved Bharat Viswanadham   Actions
        139.
        Make changes required for SCM admin commands to work with SCM HA Sub-task Resolved Bharat Viswanadham   Actions
        140.
        Merge SCM HA Configuration Sub-task Open Bharat Viswanadham   Actions
        141.
        Reopen replication/wait.robot added by HDDS-4834 Sub-task Resolved Glen Geng   Actions
        142.
        Handle NotLeaderException with Event Queue Handlers Sub-task Open Unassigned   Actions
        143.
        Provide docker-compose for SCM HA Sub-task Resolved Attila Doroszlai   Actions
        144.
        Datanode with scmID format should work with clusterID directory format Sub-task Resolved Mukul Kumar Singh   Actions
        145.
        [SCM HA Security] Implement listCertificates based on role Sub-task Resolved Bharat Viswanadham   Actions
        146.
        [SCM HA Security] Add failover proxy to SCM Security Server Protocol Sub-task Resolved Bharat Viswanadham   Actions
        147.
        Make SCM ratis server spin up time during initialization configurable Sub-task Resolved Jackson Yao   Actions
        148.
        Retry policy for SCM requests over ratis Sub-task Open Shashikant Banerjee   Actions
        149.
        Fix removing local SCM when submitting request to other SCM. Sub-task Resolved Bharat Viswanadham   Actions
        150.
        Fix and enable TestReconTasks Sub-task Resolved Mukul Kumar Singh   Actions
        151.
        Fix and enable TestEndpoints.java Sub-task Resolved Mukul Kumar Singh   Actions
        152.
        SCM Ratis enable/disable switch Sub-task Resolved Shashikant Banerjee   Actions
        153.
        Use PipelineManagerV2Impl in Recon and enable ignored Recon test cases. Sub-task Resolved Glen Geng   Actions
        154.
        Add integration test for SequenceIdGen Sub-task Open Unassigned   Actions
        155.
        Need a tool to upgrade current non-HA SCM node to single node HA cluster Sub-task Resolved Shashikant Banerjee   Actions
        156.
        [SCM HA Security] Create SCM Cert Client and change DefaultCA to allow self signed and intermediary Sub-task Resolved Bharat Viswanadham   Actions
        157.
        [SCM HA Security] Ozone services should be disabled in SCM HA enabled and security enabled cluster Sub-task Resolved Bharat Viswanadham   Actions
        158.
        Add SCM HA to Chaos tests Sub-task Resolved Mukul Kumar Singh   Actions
        159.
        Add SCM to Ratis Log Parser Sub-task Open Mukul Kumar Singh   Actions
        160.
        Support inline upgrade from containerId, delTxnId, localId to SequenceIdGenerator. Sub-task Resolved Glen Geng   Actions
        161.
        [SCM HA Security] Integrate CertClient Sub-task Resolved Bharat Viswanadham   Actions
        162.
        refactor code in SCMStateMachine. Sub-task Resolved Glen Geng   Actions
        163.
        NullPointerException during SCM init Sub-task Resolved Bharat Viswanadham   Actions
        164.
        [SCM HA Security] When Ratis is enabled, SCM secure cluster is not working Sub-task Resolved Bharat Viswanadham   Actions
        165.
        Provide example k8s files to run full HA Ozone Sub-task Resolved Marton Elek   Actions
        166.
        Return with exit code 0 in case of optional scm bootstrap/init Sub-task Resolved Marton Elek   Actions
        167.
        [SCM HA Security] Implement listCAs and getRootCA API Sub-task Resolved Bharat Viswanadham   Actions
        168.
        [SCM HA Security] Make CertStore DB updates for StoreValidateCertificate go via Ratis Sub-task Resolved Bharat Viswanadham   Actions
        169.
        [SCM HA Security] Handle leader changes during bootstrap Sub-task Resolved Bharat Viswanadham   Actions
        170.
        Fix flaky test TestSCMInstallSnapshotWithHA#testInstallCorruptedCheckpointFailure Sub-task Resolved Shashikant Banerjee   Actions
        171.
        Adapt admincli tests for SCM HA Sub-task Open Attila Doroszlai   Actions
        172.
        Back-port HDDS-4911 (List container by container state) to ContainerManagerV2 Sub-task Resolved Jackson Yao   Actions
        173.
        Solve intellj warnings on DBTransactionBuffer. Sub-task Resolved Xu Shao Hong   Actions
        174.
        Remove SequenceIdGenerator#StateManagerImpl Sub-task Resolved Jackson Yao   Actions
        175.
        [SCM HA Security] Make storeValidCertificate method idempotent Sub-task Resolved Bharat Viswanadham   Actions
        176.
        [SCM HA Security] generate certserialID in distributed sequence Sub-task Open Unassigned   Actions
        177.
        [SCM HA Security] Make changes required for ratis enabled with new model of RootCA/subCA Sub-task Resolved Bharat Viswanadham   Actions
        178.
        [Doc] Add SCM HA Setup Doc Sub-task Resolved Marton Elek   Actions
        179.
        localId is not consistent across SCMs when setup a multi node SCM HA cluster. Sub-task Resolved Glen Geng   Actions
        180.
        During bootstrap, always download checkpoint from leader SCM. Sub-task Open Unassigned   Actions
        181.
        SCM get roles command should provide Ratis Leader/Follower information. Sub-task Resolved George Huang   Actions
        182.
        [SCM HA Security] Make upgraded cluster to ratis enabled single node cluster Sub-task Open Bharat Viswanadham   Actions
        183.
        SCM may not be able to know full port list of Datanode after Datanode is started. Sub-task Resolved Glen Geng   Actions
        184.
        Merge SCM HA configs to ScmConfigKeys Sub-task Open Unassigned   Actions
        185.
        [SCM HA Security] Handle leader changes between SCMInfo and getSCMSigned Cert in OM Sub-task Resolved Bharat Viswanadham   Actions
        186.
        [SCM HA Security] Fix duration of sub-ca certs Sub-task Resolved Bharat Viswanadham   Actions
        187.
        [SCM HA Security] Make InterSCM grpc channel secure Sub-task Resolved Bharat Viswanadham   Actions
        188.
        [SCM HA Security] Remove code of not starting ozone services when Security is enabled on SCM HA cluster Sub-task Resolved Bharat Viswanadham   Actions
        189.
        [SCM HA Security] NPE during secure SCM initialization with HA code updated to an already existing cluster Sub-task Resolved Bharat Viswanadham   Actions
        190.
        Ensure failover to suggested leader if any for NotLeaderException Sub-task Resolved Shashikant Banerjee   Actions
        191.
        [SCM HA Security] Enable s3 test suite for ozone-secure-ha Sub-task Resolved Bharat Viswanadham   Actions
        192.
        make Decommission work under SCM HA. Sub-task Resolved Glen Geng   Actions
        193.
        Use MiniOzoneHAClusterImpl in TestDecommissionAndMaintenance. Sub-task Open Glen Geng   Actions
        194.
        [SCM HA Security] Handle bootstrap of SCM when primary SCM is down Sub-task Open Unassigned   Actions
        195.
        Fix Install Snapshot Mechanism in SCMStateMachine Sub-task Resolved Shashikant Banerjee   Actions
        196.
        Prioritising SCM's in SCM HA setup Sub-task Open Shashikant Banerjee   Actions
        197.
        Add more tests for SCM Failover scenarios Sub-task Open Shashikant Banerjee   Actions
        198.
        Divide snapshot related work into notifyInstallSnapshotFromLeader and reinitialize for SCMStateMachine. Sub-task Open Glen Geng   Actions
        199.
        If primordial SCM id is set, a non-HA cluster can not be initialized. Sub-task Patch Available Mukul Kumar Singh   Actions

          Activity

            People

            • Assignee:
              licheng Li Cheng
              Reporter:
              Sammi Sammi Chen

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 40m
                40m

                  Issue deployment