Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-6462

Phase II : Erasure Coding Offline Recovery & Read/Write Improvements

    XMLWordPrintableJSON

Details

    • Epic
    • Status: Resolved
    • Major
    • Resolution: Implemented
    • None
    • 1.4.0
    • None
    • None

    Description

      This is an umbrella Jira for EC offline recovery work.

      As part of Phase-I, we have finished the functionality of erasure coding write and reads as part of the Jira HDDS-3816. That being stabilized in a parallel effort. 

      So, this Jira to start pending recovery work to finish end-end EC MVP.

      Requirements in brief: 

      1. The SCM to identify the lost containers and schedule for the reconstructions.
      2. DNs to start reconstructing the containers upon the request from DN.
      3. We can decide whether we create new RM at SCM for EC work or we just reuse existing one. Currently there are interest to start a new RM to start it clean as the existing one already complex enough.
      4. DNs to figure out the blocks them self by interacting with multiple EC block containers as single EC container may not have full set of blocks. Either first container or parity containers should have full block set. 

      I am splitting the offline recovery part of design from HDDS-3816 and post here soon.

      Stay tuned for the updated doc.

      We will also create new branch for this work in some time 

      Attachments

        Issue Links

          1.
          EC: Fix Datanode block file INCONSISTENCY during heavy load. Sub-task Resolved Mark Gui
          2.
          EC: EC pipeline records are not removed after close. Sub-task Resolved Mark Gui
          3.
          EC: Make ECBlockReconstructedStripeInputStream to be used by DNs as well Sub-task Resolved Attila Doroszlai
          4.
          EC: [Refactor-2] Check and write parity cells inside handleDataWrite Sub-task Resolved Kaijie Chen
          5.
          EC: Scm CheckAndRecoverECContainer command Sub-task Resolved cchenaxchen
          6.
          EC: Implement the EC Reconstruction Command with necessary information Sub-task Resolved Uma Maheswara Rao G
          7.
          Add a new replication manager and change the existing one to legacy Sub-task Resolved Jie Yao
          8.
          EC: Add BlockGroupLen info as part of PutBlock in EC Writes for helping in recovery. Sub-task Resolved Uma Maheswara Rao G
          9.
          EC: Refactor ECKeyOutputStream for better code reuse Sub-task Resolved Kaijie Chen
          10.
          EC: Add rpc for EC recovery in replication service Sub-task Resolved Kaijie Chen
          11.
          EC: ReplicationManager - create version of ContainerReplicaCounts applicable to EC Sub-task Resolved Stephen O'Donnell
          12.
          EC: Add EC pipeline minimum to MiniOzoneCluster Sub-task Resolved Kaijie Chen
          13.
          EC: Add the DN side Reconstruction Handler class. Sub-task Resolved Uma Maheswara Rao G
          14.
          EC: DN ability to create container in temp location and write blocks to it. Sub-task Resolved Mark Gui
          15.
          EC: Support ListBlock from CoordinatorDN Sub-task Resolved Kaijie Chen
          16.
          EC: ReplicationManager - create ContainerReplicaPendingOps class and integrate with ContainerManager Sub-task Resolved Stephen O'Donnell
          17.
          EC: ReplicationManager - make ContainerReplicaPendingOps into a SCM service Sub-task Resolved Jie Yao
          18.
          EC: PipelineStateMap#addPipeline should not have precondition checks post db updates Sub-task Resolved Uma Maheswara Rao G
          19.
          SCMContainerPlacementRackScatter should use original required node num to validate placement policy Sub-task Resolved Jie Yao
          20.
          EC: Fix datanode exclusion check in client Sub-task Resolved Kaijie Chen
          21.
          EC: DN ability to create RECOVERING containers for EC reconstruction. Sub-task Resolved Uma Maheswara Rao G
          22.
          EC: Add listBlock command MockDatanodeStorage for mocking in reconstruction work. Sub-task Resolved Uma Maheswara Rao G
          23.
          EC: Extend BlockReconstructedInputStreams to recover parity block buffers as well if missing Sub-task Resolved Attila Doroszlai
          24.
          Need proper error message when "RATIS" replication-type is passed with EC codec Sub-task Resolved Kaijie Chen
          25.
          EC: getFileCheckSum should return null EC files until ECFileChecksum implemented. Sub-task Resolved Uma Maheswara Rao G
          26.
          EC: SCMContainerPlacementRackScatter#chooseDatanodes may choose less nodes than required in unknown cases. Sub-task Resolved Attila Doroszlai
          27.
          EC: Validate the server default configuration on Ozone manager startup. Sub-task Resolved Aswin Shakil
          28.
          EC: Provide correct example for EC in ozone.server.default.replication Sub-task Resolved Swaminathan Balachandran
          29.
          EC: Fix potential wrong replica read with over-replicated container. Sub-task Resolved Mark Gui
          30.
          EC: Standalone containers should not move to quasi closed Sub-task Resolved Stephen O'Donnell
          31.
          EC: ReplicationManager - create class to detect container health issues Sub-task Resolved Stephen O'Donnell
          32.
          EC: Implement the EC Reconstruction coordinator Sub-task Resolved Uma Maheswara Rao G
          33.
          EC: Analyze and add the block token support for ECReconstructionCoordinator Sub-task Resolved Attila Doroszlai
          34.
          EC: Handle reconstructECContainersCommand in heartbeat Sub-task Resolved Attila Doroszlai
          35.
          EC: ReplicationManager - collect under and over replicated containers Sub-task Resolved Stephen O'Donnell
          36.
          EC: EC Reconstruction Command count queues should be included in DN heartbeat Sub-task Resolved Uma Maheswara Rao G
          37.
          EC: ReplicationManager - Add class to handle under-replication and form a command for a datanode Sub-task Resolved Uma Maheswara Rao G
          38.
          EC: ReplicationManager - create class to form a replicate command for under replicated containers Sub-task Resolved Uma Maheswara Rao G
          39.
          EC: Ensure DatanodeAdminMonitor can handle EC containers during decommission Sub-task Resolved Attila Doroszlai
          40.
          EC: Remove references to ContainerReplicaPendingOps in TestECContainerReplicaCount Sub-task Resolved Stephen O'Donnell
          41.
          EC: [Code Quality] Add more tests to get closure to 100% code coverage Sub-task Resolved Unassigned
          42.
          EC: ReplicationManager - priortise under replicated containers Sub-task Resolved Stephen O'Donnell
          43.
          EC: Implement the Over-replication Handler Sub-task Resolved Jie Yao
          44.
          EC: Analyze and add putBlock even on non writing node in the case of partial single stripe. Sub-task Resolved Uma Maheswara Rao G
          45.
          EC: Attempt to cleanup the RECOVERING container when reconstruction failed at coordinator. Sub-task Resolved Uma Maheswara Rao G
          46.
          EC: Cleanup RECOVERING container on DN restarts. Sub-task Resolved Jie Yao
          47.
          EC: put key command with EC replication can use ReplicationConfig validator Sub-task Resolved Swaminathan Balachandran
          48.
          EC: Add Test for RECOVERING container cleanup when failure. Sub-task Resolved Uma Maheswara Rao G
          49.
          EC: Skip the EC container for balancer Sub-task Resolved Siddhant Sangwan
          50.
          EC: CreateBucketHandler should use ReplicationConfig Validator Sub-task Resolved Attila Doroszlai
          51.
          EC: Implement RECOVERING Container Scrubber. Sub-task Resolved Jie Yao
          52.
          EC: ReplicationManager - Logic to process the over replicated queues and assign work to DNs Sub-task Resolved Stephen O'Donnell
          53.
          EC: ReplicationManager - skip processing open containers Sub-task Resolved Stephen O'Donnell
          54.
          EC: Implement the Over replication Processor Sub-task Resolved Uma Maheswara Rao G
          55.
          EC: Add ec write channel Sub-task Resolved cchenaxchen
          56.
          EC : support EC stripe shuffle Sub-task Resolved Unassigned
          57.
          EC: Improve write performance by pipelining encode and flush Sub-task Resolved Kaijie Chen
          58.
          EC: Ensure replica index is maintained when replicating a container Sub-task Resolved Stephen O'Donnell
          59.
          EC: ReplicationManager - Over replication handler should set repIndex on delete cmds Sub-task Resolved Stephen O'Donnell
          60.
          NPE in ec.reconstruction.TokenHelper Sub-task Resolved Attila Doroszlai
          61.
          Exception in Replication Monitor Thread: java.lang.IllegalArgumentException Sub-task Closed George Huang
          62.
          EC - ReplicationManager - handle maintenance only indexes in the under replication handler Sub-task Resolved Unassigned
          63.
          EC: Prematurely re-throwed the exception in reconstruction cleanup loop. Sub-task Resolved Uma Maheswara Rao G
          64.
          EC: Add debug logging with exception info when stripe write failed Sub-task Resolved Swaminathan Balachandran
          65.
          Support balancing EC container. Sub-task Resolved Jie Yao
          66.
          [Ozone EC] remove warnings and errors from console during online reconstruction of data. Sub-task Resolved Swaminathan Balachandran
          67.
          EC: ReplicationManager - Track nodes already used when handing under replication Sub-task Resolved Stephen O'Donnell
          68.
          EC: Fix block deletion not allowed due to missing pipelineID Sub-task Resolved Kaijie Chen
          69.
          EC: ReplicationManager - UnderRep handler should handle duplicate indexes Sub-task Resolved Stephen O'Donnell
          70.
          EC: Fix offset Condition in ECKeyOutputStream Sub-task Resolved Swaminathan Balachandran
          71.
          EC: DN reported Open EC container may not get closed if SCM container was already closed state? Sub-task Resolved Aswin Shakil
          72.
          EC: ReplicationManager - extend EC Container health check for mis-replication Sub-task Resolved Unassigned
          73.
          EC: Add tests for erasure coding with MiniOzoneChaosCluster Sub-task Resolved Nilotpal Nandi
          74.
          EC: Handle maintenance replicas in ECUnderReplicationHandler Sub-task Resolved Siddhant Sangwan
          75.
          EC: decommission compatible offline recovery Sub-task Resolved cchenaxchen
          76.
          EC: Define the value of Maintenance Redundancy for EC containers Sub-task Resolved Stephen O'Donnell
          77.
          EC: Schedule UnderReplicatedProcessor and OverReplicatedProcessor threads in RM instead of StorageContainerManager Sub-task Resolved Siddhant Sangwan
          78.
          EC: ReplicationManager - create handlers to perform various container checks Sub-task Resolved Stephen O'Donnell
          79.
          EC: ReplicationManager - Encapsulate the under and over rep queues into a queue object Sub-task Resolved Stephen O'Donnell
          80.
          Add Ratis tests for HealthCheck handlers of Replication Manager Sub-task Resolved Siddhant Sangwan
          81.
          EC: Add a Handler for CLOSING containers in Replication Manager Sub-task Resolved Siddhant Sangwan
          82.
          EC: Handle the placement policy check in ECUnderReplicationHandler Sub-task Resolved Swaminathan Balachandran
          83.
          Add a handler for Quasi Closed containers to RM Sub-task Resolved Siddhant Sangwan
          84.
          EC: Change the placement policy interface to allow existing nodes to be specified. Sub-task Resolved Swaminathan Balachandran
          85.
          EC: Fix tests for HealthCheck handlers of RM that use Replica Indexes for Ratis Containers Sub-task Resolved Siddhant Sangwan
          86.
          Erasure coding and encryption are not flagged on FileStatus Sub-task Resolved Swaminathan Balachandran
          87.
          EC: EC Decode can fail when byteBuffer from elastic pool is larger than chunksize Sub-task Resolved Stephen O'Donnell
          88.
          Fixing exception handling in case of non positive replica index Sub-task Resolved Swaminathan Balachandran
          89.
          EC: ECBlockReconstructedStripeInputStream should set initialized false on re-init Sub-task Resolved Stephen O'Donnell
          90.
          EC: ReplicationManager - move the empty container handling into RM from Legacy Sub-task Resolved Siddhant Sangwan
          91.
          EC: ReplicationManager - Implement ratis container health checker Sub-task Resolved Stephen O'Donnell
          92.
          EC: Close pipelines with unregistered nodes Sub-task Resolved Stephen O'Donnell
          93.
          EC: Add EC metrics Sub-task Resolved Aswin Shakil
          94.
          EC: Block allocation should not be stripped across the EC group Sub-task Resolved Kaijie Chen
          95.
          EC: ReplicationManager - implement deleting container handler Sub-task Resolved Jie Yao
          96.
          EC: ReplicationManager - LegacyReplicationManager should use the ContainerReplicaPendingOps service Sub-task Resolved Unassigned
          97.
          EC: ReplicationManager - refactor Legacy RM to a container health detector Sub-task Resolved Unassigned
          98.
          EC: Add a tool to schedule the EC Offline Reconstruction at any node. Sub-task Resolved Unassigned
          99.
          EC: delete empty closed EC container Sub-task Resolved Jie Yao
          100.
          EC: ReplicationManager - Add relevant metrics to the various ReplicationManager classes Sub-task Resolved Aswin Shakil
          101.
          EC: ReplicationManager - handle UNHEALTHY replicas Sub-task Resolved Siddhant Sangwan
          102.
          EC: Add EC block checksum computer Sub-task Resolved Aswin Shakil
          103.
          EC: Notify ReplicationManager when a heartbeat updates datanode command counts Sub-task Resolved Stephen O'Donnell
          104.
          EC: ReplicationManager: Move Mis-Replicated into a separate unhealthy state Sub-task Resolved Stephen O'Donnell
          105.
          Cannot set bucket args when the volume has quota set Sub-task Resolved Stephen O'Donnell
          106.
          du command does not return correct disk consumed with replica for both ratis and EC Sub-task Resolved Dave Teng
          107.
          EC: Offline Recovery with simultaneous Over Replication & Under Replication Sub-task Resolved Stephen O'Donnell
          108.
          EC: Fix Reconstruction Issue with StaleRecoveringContainerScrubbingService Sub-task Resolved Swaminathan Balachandran
          109.
          EC: ReplicationManager - refactor logic to send datanode commands into a central place Sub-task Resolved Stephen O'Donnell
          110.
          EC: Retry failed writes before rewrite to a new block group Sub-task Resolved Aswin Shakil
          111.
          EC: ReplicationManager - remove calls to ECHealthCheck from under and over replication processing Sub-task Resolved Stephen O'Donnell
          112.
          Eliminate duplicated config in LegacyReplicationManager Sub-task Resolved Attila Doroszlai
          113.
          Add a handler for under replicated Ratis containers in RM Sub-task Resolved Siddhant Sangwan
          114.
          EC: Fix the NSSummaryEndpoint#getDiskUsage should be fixed for EC keys Sub-task Resolved Dave Teng
          115.
          Extend Placement Policy Interface to select mis-replicated replicas to copy Sub-task Resolved Swaminathan Balachandran
          116.
          Add a handler for over replicated Ratis containers to RM Sub-task Resolved Siddhant Sangwan
          117.
          EC: Misreplication Handler changes for Placement Policy interface changes Sub-task Resolved Swaminathan Balachandran
          118.
          Add subscription mechanism to ContainerReplicaPendingOps Sub-task Resolved Siddhant Sangwan
          119.
          ECUnderReplicationHandler does not consider pending adds when finding targets Sub-task Resolved Siddhant Sangwan
          120.
          EC: Refactor Unhealthy Replicated Processor Sub-task Resolved Swaminathan Balachandran
          121.
          EC: Add debug logging to the Replication Manager check handlers Sub-task Resolved Stephen O'Donnell
          122.
          EC: Bug fix for calculating Misreplication Count Sub-task Resolved Swaminathan Balachandran
          123.
          EC: ReplicationManager - merge mis-rep queue into under replicated queue Sub-task Resolved Stephen O'Donnell
          124.
          MisReplicationHandler does not consider QUASI_CLOSED replicas as sources Sub-task Resolved Stephen O'Donnell
          125.
          EC: "Missing" EC containers with some remaining replicas may block decommissioning Sub-task Resolved Stephen O'Donnell
          126.
          EC produces some unknown 1MB blocks without the control of deleting service Sub-task Resolved Unassigned
          127.
          EC: SCM unregistered event handler for DatanodeCommandCountUpdated Sub-task Resolved Attila Doroszlai
          128.
          EC: Enable balancer for EC containers. Sub-task Resolved Hemant Kumar
          129.
          EC: Increase the information in the RM sending command log message Sub-task Resolved Stephen O'Donnell
          130.
          EC metrics related to replication commands don't add up Sub-task Resolved Stephen O'Donnell
          131.
          EC: ECContainerReplicaCount should handle pending delete of unhealthy replicas Sub-task Resolved Stephen O'Donnell
          132.
          EC: Handle the placement policy satisfaction in HealthChecks handling Sub-task Resolved Swaminathan Balachandran
          133.
          EC: Enhance datanode reconstruction log message Sub-task Resolved Attila Doroszlai
          134.
          Placement Policy Interface changes to handle Overreplication Sub-task Resolved Swaminathan Balachandran
          135.
          EC: ReplicationManager - Use placementPolicy.replicasToRemoveToFixOverreplication in EC Over replication handler Sub-task Resolved Stephen O'Donnell
          136.
          Refactor DefaultReplicationConfig Sub-task Resolved Attila Doroszlai
          137.
          EC: GetChecksum for EC files can fail intermittently with IndexOutOfBounds exception Sub-task Resolved Stephen O'Donnell
          138.
          EC: Refactor ReplicationSupervisor to allow Replication and Reconstruction tasks Sub-task Resolved Stephen O'Donnell
          139.
          EC: Remove ECReconstructionSupervisor and send reconstruction commands to ReplicationSupervisor Sub-task Resolved Stephen O'Donnell
          140.
          EC: Add normal and low priority to replication supervisor and commands Sub-task Resolved Stephen O'Donnell
          141.
          EC: ECBlockInputStream should try spare replicas on error Sub-task Resolved Stephen O'Donnell
          142.
          EC: Change ContainerReplicaPendingOps to store deadline rather than scheduled time Sub-task Resolved Stephen O'Donnell
          143.
          [EC] Reconstruction is failing with IndexOutOfBoundsException Sub-task Resolved Uma Maheswara Rao G
          144.
          EC: ECPipelineProvider.createForRead should filter out dead replicas and sort replicas Sub-task Resolved Attila Doroszlai
          145.
          EC: ECBlockReconstructedStripeInputStream should check for spare replicas before failing an index Sub-task Resolved Attila Doroszlai
          146.
          EC: Add validation for EC chunk size Sub-task Resolved Attila Doroszlai
          147.
          EC: SCM should throttle the reconstruction/replication tasks. Sub-task Resolved Stephen O'Donnell
          148.
          EC: Validate replication config at server-side Sub-task Resolved Attila Doroszlai
          149.
          Test in ec/ozonefs.robot is not executed in CI Sub-task Resolved Attila Doroszlai
          150.
          EC: Verify unrecoverable EC containers which are empty transition to deleting Sub-task Resolved Stephen O'Donnell
          151.
          EC: Offline reconstruction needs better logging Sub-task Resolved Attila Doroszlai
          152.
          EC: Reconstruction could fail with orphan blocks. Sub-task Resolved Stephen O'Donnell
          153.
          TestContainerCommandsEC should close ECReconstructionCoordinator Sub-task Resolved Attila Doroszlai
          154.
          Allow more EC pipelines based on number of volumes Sub-task Resolved Attila Doroszlai
          155.
          Optimize getting open pipelines from pipelineManager Sub-task Resolved Stephen O'Donnell
          156.
          EC: WritableEcContainerProvider should dynamically adjust the open container groups Sub-task Resolved Stephen O'Donnell
          157.
          EC: Avoid O(n) array.remove(element) when filtering pipelines in WritableECContainerProvider Sub-task Resolved Stephen O'Donnell
          158.
          Defer non-critical partial EC reconstruction Sub-task Resolved Attila Doroszlai
          159.
          EC: Avoid unbounded pipeline creation if all existing pipelines don't meet criteria Sub-task Resolved Stephen O'Donnell
          160.
          Use empty BufferPool for EC reconstruction Sub-task Resolved Attila Doroszlai
          161.
          Split EC acceptance tests Sub-task Resolved Attila Doroszlai
          162.
          Create acceptance test for offline recovery Sub-task Resolved Attila Doroszlai
          163.
          UnsupportedOperationException when there are more replication tasks than limit Sub-task Resolved Attila Doroszlai
          164.
          java.lang.IllegalArgumentException: ECContainerReconstructionThread Sub-task Resolved Stephen O'Donnell
          165.
          ECReconstructionCoordinator is not closed Sub-task Resolved Attila Doroszlai

          Activity

            People

              Unassigned Unassigned
              umamaheswararao Uma Maheswara Rao G
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: