Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-9323

Apply expiry of excluded datanodes to writing Ratis keys

    XMLWordPrintableJSON

Details

    Description

      Better datanode exclude list handling for long-lived clients

      Currently it is possible that a long lived client can add most or all nodes of a small cluster to its exclude list, and further writes using that client instance will fail. There are two ways this can be improved:

      1.  A timeout to remove nodes from the exclude list after so that they can be retried. For EC, this exists and is configured to 10 minutes by default. Ratis does not currently have this but it should be added. (this task)
      2. Allow the write to fall back to nodes in the exclude list if that is all that is available. This could be implemented on the server side, or as a retry from the client based on the server's initial response. (extracted to HDDS-9551)

      These issues are especially relevant for S3 gateway, which uses a persistent Ozone client to connect to the cluster while it is up.

      Attachments

        Issue Links

          Activity

            People

              dteng Dave Teng
              erose Ethan Rose
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: