Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-6462

Phase II : Erasure Coding Offline Recovery & Read/Write Improvements

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      This is an umbrella Jira for EC offline recovery work.

      As part of Phase-I, we have finished the functionality of erasure coding write and reads as part of the Jira HDDS-3816. That being stabilized in a parallel effort. 

      So, this Jira to start pending recovery work to finish end-end EC MVP.

      Requirements in brief: 

      1. The SCM to identify the lost containers and schedule for the reconstructions.
      2. DNs to start reconstructing the containers upon the request from DN.
      3. We can decide whether we create new RM at SCM for EC work or we just reuse existing one. Currently there are interest to start a new RM to start it clean as the existing one already complex enough.
      4. DNs to figure out the blocks them self by interacting with multiple EC block containers as single EC container may not have full set of blocks. Either first container or parity containers should have full block set. 

      I am splitting the offline recovery part of design from HDDS-3816 and post here soon.

      Stay tuned for the updated doc.

      We will also create new branch for this work in some time 

      Attachments

        Issue Links

          1.
          EC: Fix Datanode block file INCONSISTENCY during heavy load. Sub-task Resolved Mark Gui
          2.
          EC: EC pipeline records are not removed after close. Sub-task Resolved Mark Gui
          3.
          EC: Make ECBlockReconstructedStripeInputStream to be used by DNs as well Sub-task Resolved Attila Doroszlai
          4.
          EC: [Refactor-2] Check and write parity cells inside handleDataWrite Sub-task Resolved Kaijie Chen
          5.
          EC: Scm CheckAndRecoverECContainer command Sub-task Resolved cchenaxchen
          6.
          EC: Implement the EC Reconstruction Command with necessary information Sub-task Resolved Uma Maheswara Rao G
          7.
          Add a new replication manager and change the existing one to legacy Sub-task Resolved Jie Yao
          8.
          EC: Add BlockGroupLen info as part of PutBlock in EC Writes for helping in recovery. Sub-task Resolved Uma Maheswara Rao G
          9.
          EC: Refactor ECKeyOutputStream for better code reuse Sub-task Resolved Kaijie Chen
          10.
          EC: Add rpc for EC recovery in replication service Sub-task Resolved Kaijie Chen
          11.
          EC: ReplicationManager - create version of ContainerReplicaCounts applicable to EC Sub-task Resolved Stephen O'Donnell
          12.
          EC: Add EC pipeline minimum to MiniOzoneCluster Sub-task Resolved Kaijie Chen
          13.
          EC: Add the DN side Reconstruction Handler class. Sub-task Resolved Uma Maheswara Rao G
          14.
          EC: DN ability to create container in temp location and write blocks to it. Sub-task Resolved Mark Gui
          15.
          EC: Support ListBlock from CoordinatorDN Sub-task Resolved Kaijie Chen
          16.
          EC: ReplicationManager - create ContainerReplicaPendingOps class and integrate with ContainerManager Sub-task Resolved Stephen O'Donnell
          17.
          EC: ReplicationManager - make ContainerReplicaPendingOps into a SCM service Sub-task Resolved Jie Yao
          18.
          EC: PipelineStateMap#addPipeline should not have precondition checks post db updates Sub-task Resolved Uma Maheswara Rao G
          19.
          SCMContainerPlacementRackScatter should use original required node num to validate placement policy Sub-task Resolved Jie Yao
          20.
          EC: Fix datanode exclusion check in client Sub-task Resolved Kaijie Chen
          21.
          EC: DN ability to create RECOVERING containers for EC reconstruction. Sub-task Resolved Uma Maheswara Rao G
          22.
          EC: Add listBlock command MockDatanodeStorage for mocking in reconstruction work. Sub-task Resolved Uma Maheswara Rao G
          23.
          EC: Extend BlockReconstructedInputStreams to recover parity block buffers as well if missing Sub-task Resolved Attila Doroszlai
          24.
          Need proper error message when "RATIS" replication-type is passed with EC codec Sub-task Resolved Kaijie Chen
          25.
          EC: getFileCheckSum should return null EC files until ECFileChecksum implemented. Sub-task Resolved Uma Maheswara Rao G
          26.
          EC: SCMContainerPlacementRackScatter#chooseDatanodes may choose less nodes than required in unknown cases. Sub-task Resolved Attila Doroszlai
          27.
          EC: Validate the server default configuration on Ozone manager startup. Sub-task Resolved Aswin Shakil Balasubramanian
          28.
          EC: Provide correct example for EC in ozone.server.default.replication Sub-task Resolved Swaminathan Balachandran
          29.
          EC: Fix potential wrong replica read with over-replicated container. Sub-task Resolved Mark Gui
          30.
          EC: Standalone containers should not move to quasi closed Sub-task Resolved Stephen O'Donnell
          31.
          EC: ReplicationManager - create class to detect container health issues Sub-task Resolved Stephen O'Donnell
          32.
          EC: Implement the EC Reconstruction coordinator Sub-task Resolved Uma Maheswara Rao G
          33.
          EC: Analyze and add the block token support for ECReconstructionCoordinator Sub-task Resolved Attila Doroszlai
          34.
          EC: Handle reconstructECContainersCommand in heartbeat Sub-task Resolved Attila Doroszlai
          35.
          EC: ReplicationManager - collect under and over replicated containers Sub-task Resolved Stephen O'Donnell
          36.
          EC: EC Reconstruction Command count queues should be included in DN heartbeat Sub-task Resolved Uma Maheswara Rao G
          37.
          EC: ReplicationManager - Add class to handle under-replication and form a command for a datanode Sub-task Resolved Uma Maheswara Rao G
          38.
          EC: ReplicationManager - create class to form a replicate command for under replicated containers Sub-task Resolved Uma Maheswara Rao G
          39.
          EC: Ensure DatanodeAdminMonitor can handle EC containers during decommission Sub-task Resolved Attila Doroszlai
          40.
          EC: Remove references to ContainerReplicaPendingOps in TestECContainerReplicaCount Sub-task Resolved Stephen O'Donnell
          41.
          EC: [Code Quality] Add more tests to get closure to 100% code coverage Sub-task Resolved Unassigned
          42.
          EC: ReplicationManager - priortise under replicated containers Sub-task Resolved Stephen O'Donnell
          43.
          EC: Implement the Over-replication Handler Sub-task Resolved Jie Yao
          44.
          EC: Analyze and add putBlock even on non writing node in the case of partial single stripe. Sub-task Resolved Uma Maheswara Rao G
          45.
          EC: Attempt to cleanup the RECOVERING container when reconstruction failed at coordinator. Sub-task Resolved Uma Maheswara Rao G
          46.
          EC: Cleanup RECOVERING container on DN restarts. Sub-task Resolved Jie Yao
          47.
          EC: put key command with EC replication can use ReplicationConfig validator Sub-task Resolved Swaminathan Balachandran
          48.
          EC: Add Test for RECOVERING container cleanup when failure. Sub-task Resolved Uma Maheswara Rao G
          49.
          EC: Skip the EC container for balancer Sub-task Resolved Siddhant Sangwan
          50.
          EC: CreateBucketHandler should use ReplicationConfig Validator Sub-task Resolved Attila Doroszlai
          51.
          EC: Implement RECOVERING Container Scrubber. Sub-task Resolved Jie Yao
          52.
          EC: ReplicationManager - Logic to process the over replicated queues and assign work to DNs Sub-task Resolved Stephen O'Donnell
          53.
          EC: ReplicationManager - skip processing open containers Sub-task Resolved Stephen O'Donnell
          54.
          EC: Implement the Over replication Processor Sub-task Resolved Uma Maheswara Rao G
          55.
          EC: Add ec write channel Sub-task Resolved cchenaxchen
          56.
          EC : support EC stripe shuffle Sub-task Resolved Unassigned
          57.
          EC: improve write performance of using double buffer Sub-task Resolved cchenaxchen
          58.
          EC: Ensure replica index is maintained when replicating a container Sub-task Resolved Stephen O'Donnell
          59.
          EC: ReplicationManager - Over replication handler should set repIndex on delete cmds Sub-task Resolved Stephen O'Donnell
          60.
          NPE in ec.reconstruction.TokenHelper Sub-task Resolved Attila Doroszlai
          61.
          Exception in Replication Monitor Thread: java.lang.IllegalArgumentException Sub-task Resolved Unassigned
          62.
          EC - ReplicationManager - handle maintenance only indexes in the under replication handler Sub-task Resolved Unassigned
          63.
          EC: Prematurely re-throwed the exception in reconstruction cleanup loop. Sub-task Resolved Uma Maheswara Rao G
          64.
          EC: Add debug logging with exception info when stripe write failed Sub-task Resolved Swaminathan Balachandran
          65.
          Support balancing EC container. Sub-task Resolved Jie Yao
          66.
          [Ozone EC] remove warnings and errors from console during online reconstruction of data. Sub-task Resolved Swaminathan Balachandran
          67.
          EC: ReplicationManager - Track nodes already used when handing under replication Sub-task Resolved Stephen O'Donnell
          68.
          EC: Fix block deletion not allowed due to missing pipelineID Sub-task Resolved Kaijie Chen
          69.
          EC: ReplicationManager - UnderRep handler should handle duplicate indexes Sub-task Resolved Stephen O'Donnell
          70.
          EC: Fix offset Condition in ECKeyOutputStream Sub-task Resolved Swaminathan Balachandran
          71.
          EC: DN reported Open EC container may not get closed if SCM container was already closed state? Sub-task Resolved Aswin Shakil Balasubramanian
          72.
          EC: Add tests for erasure coding with MiniOzoneChaosCluster Sub-task In Progress Nilotpal Nandi
          73.
          EC: Handle maintenance replicas in ECUnderReplicationHandler Sub-task In Progress Siddhant Sangwan
          74.
          EC: Add EC metrics Sub-task Open Aswin Shakil Balasubramanian
          75.
          EC: Investigate use of container token from SCMCommand Sub-task Open Attila Doroszlai
          76.
          EC: Handle the placement policy satisfaction in HealthChecks handling Sub-task Open Swaminathan Balachandran
          77.
          EC: Handle the placement policy check in ECUnderReplicationHandler Sub-task Open Swaminathan Balachandran
          78.
          EC: ReplicationManager - move the empty container handling into RM from Legacy Sub-task Open Attila Doroszlai
          79.
          EC: Update the EC documentation about FS clients. Sub-task Open Aswin Shakil Balasubramanian
          80.
          EC: Integration test to verify the Offline Container Recovery functionality. Sub-task Open Aswin Shakil Balasubramanian
          81.
          EC: ReplicationManager - split moveSchudler from legacy RM Sub-task Open Jie Yao
          82.
          EC: WritableEcContainerProvider should dynamically adjust the open container groups Sub-task Open Unassigned
          83.
          EC: Adopt the native EC part of libhdfs in a libozone and use that one. Sub-task Open Marton Elek
          84.
          EC: Enhance EC replication config parsing from string and enable validation of it Sub-task Open István Fajth
          85.
          EC: Add EC forward compat tests Sub-task Open István Fajth
          86.
          EC: ReplicationManager - LegacyReplicationManager should use the ContainerReplicaPendingOps service Sub-task Open Unassigned
          87.
          EC: ReplicationManager - refactor Legacy RM to a container health detector Sub-task Open Unassigned
          88.
          EC: ReplicationManager - extend EC Container health check for mis-replication Sub-task Open Unassigned
          89.
          EC: ReplicationManager - Add relevant metrics to the various ReplicationManager classes Sub-task Open Unassigned
          90.
          EC: decommission compatible offline recovery Sub-task Open cchenaxchen
          91.
          EC: Add EC block checksum computer Sub-task Open Aswin Shakil Balasubramanian
          92.
          EC: Define the value of Maintenance Redundancy for EC containers Sub-task Open Stephen O'Donnell
          93.
          EC: Add a tool to schedule the EC Offline Reconstruction at any node. Sub-task Open Unassigned
          94.
          EC: Recon UI to expose the options for setting ECRepConfig on bucket. Sub-task Open Unassigned
          95.
          EC: ReplicationManager - Implement ratis container health checker Sub-task Patch Available Jie Yao
          96.
          EC: delete empty closed EC container Sub-task In Progress Jie Yao
          97.
          EC: Offline Recovery with simultaneous Over Replication & Under Replication Sub-task In Progress Swaminathan Balachandran
          98.
          EC: ReplicationManager - handle UNHEALTY replicas Sub-task Open Siddhant Sangwan
          99.
          Add validation for chunk size option when using ozone CLI to create new bucket Sub-task Open Unassigned

          Activity

            People

              umamaheswararao Uma Maheswara Rao G
              umamaheswararao Uma Maheswara Rao G
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: