Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-505

OzoneManager HA

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.0.0
    • OM HA, Ozone Manager

    Description

      OzoneManager can be a single point of failure in an Ozone cluster. We propose an HA implementation for OM using Ratis (Raft protocol).

      Attached the design document for the proposed implementation.

      Attachments

        1. Handling Write Requests with OM HA.pdf
          366 kB
          Bharat Viswanadham
        2. OM HA Cache Design.pdf
          146 kB
          Bharat Viswanadham
        3. OzoneManager HA.pdf
          296 kB
          Hanisha Koneru

        Issue Links

          1.
          Start a Standalone Ratis Server on OM Sub-task Resolved Hanisha Koneru  
          2.
          Encapsulate all client to OM requests into one request message Sub-task Resolved Hanisha Koneru  
          3.
          Submit client request to OM Ratis server Sub-task Resolved Hanisha Koneru  
          4.
          Implement OzoneManager State Machine Sub-task Resolved Hanisha Koneru  
          5.
          Add support for configuring multiple OMs Sub-task Resolved Hanisha Koneru  
          6.
          Generate RaftGroupId from OMServiceID Sub-task Resolved Aravindan Vijayan  
          7.
          Implement RetryProxy and FailoverProxy for OM client Sub-task Resolved Hanisha Koneru

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 40m
          8.
          Remove RaftClient from OM Sub-task Resolved Hanisha Koneru  
          9.
          Setup Failover Proxy Provider for OM client Sub-task Resolved Hanisha Koneru  
          10.
          Serve read requests directly from RocksDB Sub-task Resolved Hanisha Koneru

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h 10m
          11.
          Add Tracing back to OzoneManagerProtocol Sub-task Resolved Hanisha Koneru  
          12.
          Provide docker-compose for OM HA Sub-task Resolved Hanisha Koneru

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h 10m
          13.
          In OM HA AllocateBlock call where connecting to SCM from OM should not happen on Ratis Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 5h
          14.
          In OM HA OpenKey call Should happen only leader OM Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 4h 50m
          15.
          In OM HA InitiateMultipartUpload call Should happen only leader OM Sub-task Resolved Bharat Viswanadham  
          16.
          Implement Ratis Snapshots on OM Sub-task Resolved Hanisha Koneru

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 5h
          17.
          Convert all OM Volume related operations to HA model Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 4h 50m
          18.
          Convert all OM Bucket related operations to HA model Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 2h 50m
          19.
          Download RocksDB checkpoint from OM Leader to Follower Sub-task Resolved Hanisha Koneru

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 8h
          20.
          Convert all OM Key related operations to HA model Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 40m
          21.
          OzoneManager Cache Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 12h
          22.
          Implement DoubleBuffer in OzoneManager Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 10h 10m
          23.
          Implement Bucket Write Requests to use Cache and DoubleBuffer Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 11h 40m
          24.
          Add userName and IPAddress as part of OMRequest. Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 5.5h
          25.
          Implement AuditLogging for OM HA Bucket write requests Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 2h 40m
          26.
          Implement updating lastAppliedIndex after buffer flush to OM DB. Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 2h 20m
          27.
          Add metrics and AuditLogging for newly added OM HA methods Sub-task Resolved Bharat Viswanadham  
          28.
          Implement Volume Write Requests to use Cache and DoubleBuffer Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 18.5h
          29.
          Create OMDoubleBuffer metrics Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 4.5h
          30.
          Implement Key Write Requests to use Cache and DoubleBuffer Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 13h 10m
          31.
          Implement File CreateDirectory Request to use Cache and DoubleBuffer Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 3h 40m
          32.
          Use ExecutorService in OzoneManagerStateMachine Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h 10m
          33.
          Implement File CreateFile Request to use Cache and DoubleBuffer Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 6h
          34.
          Fix class hierarchy for KeyRequest and FileRequest classes. Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 40m
          35.
          Cleanup 2phase old HA code for Key requests. Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1.5h
          36.
          Make OM KeyDeletingService compatible with HA model Sub-task Resolved Hanisha Koneru

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 3.5h
          37.
          Cleanup Volume Request 2 phase old code Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h 40m
          38.
          Add Eviction policy for table cache Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 4.5h
          39.
          Implement S3 Create Bucket request to use Cache and DoubleBuffer Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 7h 20m
          40.
          Fix numKeys metrics in OM HA Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 40m
          41.
          Implement S3 Delete Bucket request to use Cache and DoubleBuffer Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 3h 20m
          42.
          Implement S3 Initiate MPU request to use Cache and DoubleBuffer Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 2.5h
          43.
          On installSnapshot notification from OM leader, download checkpoint and reload OM state Sub-task Resolved Hanisha Koneru

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 13h 10m
          44.
          Fix OMVolumeSetQuota|OwnerRequest#validateAndUpdateCache return response. Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h
          45.
          Fix TestOzoneManagerHA and TestOzoneManagerSnapShotProvider Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h 20m
          46.
          Implement S3 Commit MPU request to use Cache and DoubleBuffer Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 2h 10m
          47.
          Implement S3 Abort MPU request to use Cache and DoubleBuffer Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h 10m
          48.
          OzoneManagerDoubleBuffer#stop should wait for daemon thread to die Sub-task Resolved Siyao Meng

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h 20m
          49.
          Merge code for HA and Non-HA OM requests for bucket Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 3h
          50.
          Fix TableCacheImpl cleanup logic Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 0.5h
          51.
          Make changes required for Non-HA to use new HA code in OM. Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 4h 40m
          52.
          Implement S3 Complete MPU request to use Cache and DoubleBuffer Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h 50m
          53.
          Fix failures in TestS3MultipartUploadAbortResponse Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 0.5h
          54.
          Convert all MPU related operations to HA model Sub-task Resolved Bharat Viswanadham  
          55.
          Support volume acl operations for OM HA. Sub-task Resolved Xiaoyu Yao

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 10h 20m
          56.
          Support Bucket ACL operations for OM HA. Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 9h 40m
          57.
          On OM reload/restart OmMetrics#numKeys should be updated Sub-task Resolved Siyao Meng

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 5h 20m
          58.
          Support Key ACL operations for OM HA. Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 3h 20m
          59.
          In OM HA getDelegation call Should happen only leader OM Sub-task Resolved Bharat Viswanadham  
          60.
          Support Prefix ACL operations for OM HA. Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h
          61.
          Implement OM GetDelegationToken request to use Cache and DoubleBuffer Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1.5h
          62.
          Implement OM CancelDelegationToken request to use Cache and DoubleBuffer Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h 50m
          63.
          Implement OM RenewDelegationToken request to use Cache and DoubleBuffer Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 2h 40m
          64.
          Implement GetS3Secret to use double buffer and cache. Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 2h 40m
          65.
          Load Snapshot info when OM Ratis server starts Sub-task Resolved Hanisha Koneru

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 3h 40m
          66.
          Implement default acls for bucket/volume/key for OM HA code Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 3h 40m
          67.
          Handle Set DtService of token for OM HA Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h 40m
          68.
          Make ozone fs shell command work with OM HA service ids Sub-task Resolved Siyao Meng

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 7h
          69.
          Add nullable annotation for OMResponse classes Sub-task Resolved Yi-Sheng Lien

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 40m
          70.
          Make ozone sh command work with OM HA service ids Sub-task Resolved Siyao Meng

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h 50m
          71.
          Fix loadup cache for cache cleanup policy NEVER Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 50m
          72.
          Make OM Generic related configuration support HA style config Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 7.5h
          73.
          Handle Set DtService of token in S3Gateway for OM HA Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 5h 40m
          74.
          Fix listBucket API Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 3h 20m
          75.
          Fix listkeys API Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 7h 10m
          76.
          Update OzoneServiceProvider in s3 gateway to handle OM ha Sub-task Resolved Bharat Viswanadham  
          77.
          Add Volume check in KeyManager and File Operations Sub-task Resolved Yi-Sheng Lien

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 5.5h
          78.
          Fix listParts API Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          79.
          Fix listVolumes API Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 0.5h
          80.
          Run S3 test suite on OM HA cluster Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h
          81.
          Command line tool for OM Admin Sub-task Resolved Hanisha Koneru

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 3.5h
          82.
          Acceptance tests for OM HA Sub-task Resolved Hanisha Koneru

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 40m
          83.
          Send hostName also part of OMRequest Sub-task Resolved Yi-Sheng Lien

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          84.
          Fix logic of RetryPolicy in OzoneClientSideTranslatorPB Sub-task Resolved Hanisha Koneru

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          85.
          Add tests for incorrect OM HA config when node ID or RPC address is not configured Sub-task Resolved Siyao Meng

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 2h 10m
          86.
          Fix listStatus API Sub-task Resolved Siyao Meng

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          87.
          Refactor OMFailoverProxyProvider#loadOMClientConfigs Sub-task Resolved Siyao Meng  
          88.
          Add support for Registered id as service identifier for CSR. Sub-task Resolved Abhishek Purohit

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          89.
          Add ServiceName support for getting Signed Cert. Sub-task Resolved Abhishek Purohit

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          90.
          Add ozone.om.internal.service.id to OM HA configuration Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          91.
          Fix listMultipartupload API Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          92.
          Remove RatisClient in OM HA Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          93.
          Merge OzoneManagerRequestHandler and OzoneManagerHARequestHandlerImpl Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          94.
          Merge OzoneClientFactory#getRpcClient functions Sub-task Resolved Siyao Meng

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          95.
          Ozone S3 CLI commands not working on HA cluster Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          96.
          Fix ApplyTransaction error handling in OzoneManagerStateMachine Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 40m
          97.
          Generate renewTime on OMLeader for GetDelegationToken Sub-task Resolved Hanisha Koneru

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          98.
          Make expiry of Delegation tokens to OM HA model. Sub-task Resolved Hanisha Koneru  
          99.
          HA failover attempt log level should be set to DEBUG Sub-task Resolved Hanisha Koneru  
          100.
          Allocating Blocks is not happening for Multipart upload part key in createMultipartKey call Sub-task Resolved Unassigned  
          101.
          Describe how ozoneManagerDoubleBuffer works in ascii art in code Sub-task Resolved Unassigned  
          102.
          Compare transactionID and updateID of Volume operations to avoid replaying transactions Sub-task Resolved Hanisha Koneru  
          103.
          Handle replay of KeyCreate requests Sub-task Resolved Hanisha Koneru

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          104.
          Handle replay of KeyDelete and KeyRename Requests Sub-task Resolved Hanisha Koneru

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 10m
          105.
          Handle Replay of AllocateBlock request Sub-task Resolved Hanisha Koneru

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          106.
          Add ObjectID and updateID to BucketInfo to avoid replaying transactions Sub-task Resolved Hanisha Koneru

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 10m
          107.
          Handle replay of KeyPurge Request Sub-task Resolved Hanisha Koneru

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          108.
          Consolidate ObjectID and UpdateID from Info objects into one class Sub-task Resolved Hanisha Koneru

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          109.
          Handle replay of KeyCommitRequest and DirectoryCreateRequest Sub-task Resolved Hanisha Koneru

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          110.
          Handle replay of S3 requests Sub-task Resolved Hanisha Koneru

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          111.
          Handle replay of OM Volume ACL requests Sub-task Resolved Hanisha Koneru

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          112.
          Handle replay of OM Key ACL requests Sub-task Resolved Hanisha Koneru

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          113.
          Handle replay of OM Prefix ACL requests Sub-task Resolved Hanisha Koneru

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          114.
          ACL checks should be done after acquiring lock Sub-task Resolved Unassigned  
          115.
          Delete replayed entry from OpenKeyTable during commit Sub-task Resolved Hanisha Koneru

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          116.
          Ozone S3 CLI path command not working on HA cluster Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          117.
          OM RpcClient fail with java.lang.IllegalArgumentException Sub-task Resolved Bharat Viswanadham

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          118.
          Add unit tests for OMGetDelegationToken Request and Response Sub-task Resolved Yi-Sheng Lien

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 10m
          119.
          OM Client failover to next OM on NotLeaderException Sub-task Resolved Hanisha Koneru

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          120.
          Add wait time between client retries to OM Sub-task Resolved Hanisha Koneru

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m

          Activity

            People

              hanishakoneru Hanisha Koneru
              hanishakoneru Hanisha Koneru
              Votes:
              1 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 286h 40m
                  286h 40m