Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-10239

Storage Container Reconciliation

    XMLWordPrintableJSON

Details

    Description

      Ideally, a healthy Ozone cluster would contain only open and closed containers. However, container replicas commonly end up with a mix of states including quasi-closed and unhealthy that the current system is not able to resolve to cleanly closed replicas. The cause of these states is often bugs or broad failure handling on the write path. While we should fix these causes, they raise the problem that Ozone is not able to reconcile these mismatched container states on its own, regardless of their cause. This has lead to significant complexity in the replication manager for how to handle cases where only quasi-closed and unhealthy replicas are available, especially in the case of decommissioning.

      Even when all replicas are closed, the system assumes that these closed container replicas are equal with no way to verify this. Checksumming is done for individual chunks within each container, but if two container replicas somehow end up with chunks that differ in length or content despite being marked closed with local checksums matching, the system has no way to detect or resolve this anomaly.

      This Jira proposes a container reconciliation protocol to solve these problems. After implementing the proposal:
      1. It should be possible for a cluster to progress to a state where it has only properly replicated closed and open containers.
      2. We can verify the equality and integrity of all closed containers.

      The design doc is linked here as a markdown pull request for inline comments.

      Attachments

        Issue Links

          1.
          Add GitHub actions labeler for the reconciliation feature branch Sub-task Resolved Ethan Rose
          2.
          Datanode reports Merkel Tree container summary to SCM during heartbeats Sub-task Resolved Unassigned
          3.
          SCM and Datanode communication for reconciliation Sub-task Resolved Ethan Rose
          4.
          Container Scanner should still scan unhealthy containers Sub-task Resolved Aswin Shakil
          5.
          Implement a basic Merkle Tree Manager Sub-task Resolved Ethan Rose
          6.
          Implement framework for capturing Merkle Tree Metrics Sub-task Resolved Aswin Shakil
          7.
          Block deletion should update container merkle tree Sub-task Resolved Ethan Rose
          8.
          Add a Datanode API to supply a merkle tree for a given container Sub-task Resolved Aswin Shakil
          9.
          Reconcile commands should be handled by datanode ReplicationSupervisor Sub-task Resolved Ethan Rose
          10.
          Datanodes should generate initial container merkle tree during container close Sub-task Resolved Aswin Shakil
          11.
          Handle corrupted merkle tree files Sub-task Resolved Ethan Rose
          12.
          Container scanner should keep scanning after non-fatal errors Sub-task Resolved Ethan Rose
          13.
          Allow datanodes to do chunk level modifications to closed containers Sub-task Resolved Aswin Shakil
          14.
          Implement container comparison and repair logic within datanodes Sub-task Patch Available Aswin Shakil
          15.
          Add new tests for container scanner detecting multiple errors in one container Sub-task Patch Available Ethan Rose
          16.
          Make container scanner generate merkle trees during the scan Sub-task Open Ethan Rose
          17.
          Coordinate container reconciliation with container deletion and replication Sub-task Open Unassigned
          18.
          Allow reconciliation and scanner to move replicas out of the UNHEALTHY state Sub-task Open Aswin Shakil
          19.
          Handle backwards compatibility for containers created before reconciliation Sub-task Open Unassigned
          20.
          Consider allowing reconciliation when not all replicas have reached closed state Sub-task Open Unassigned
          21.
          SCMExceptions resulting from admin CLI commands are treated as retriable Sub-task Open Unassigned
          22.
          Restrict reconciliation requests by datanode status Sub-task Open Unassigned
          23.
          Extend container repair capabilities to the block level Sub-task Open Unassigned
          24.
          Combine datanode clients for reconciliation and EC reconstruction Sub-task Open Unassigned
          25.
          Basic SCM co-ordination Sub-task Open Unassigned
          26.
          Optimize checksum calculations in container merkle tree Sub-task Open Ritesh Shukla
          27.
          Add metrics specific to reconciliation tasks Sub-task Open Unassigned
          28.
          Use zero-copy for readMerkleTree API Sub-task Open Unassigned

          Activity

            People

              erose Ethan Rose
              erose Ethan Rose
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: