Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-15566

Repair coordinator can hang under some cases

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Normal
    • Resolution: Unresolved
    • Fix Version/s: 4.x
    • Component/s: Consistency/Repair
    • Labels:
      None
    • Change Category:
      Operability
    • Complexity:
      Normal
    • Platform:
      All
    • Impacts:
      None

      Description

      Repair coordination makes a few assumptions about message delivery which cause it to hang forever when those assumptions don’t hold true: fire and forget will not get rejected (participate has an issue and rejects the message), and a very delayed message will one day be seen (messaging can be dropped under load or when failure detector thinks a node is bad but is just GCing).

      Given this and the desire to have better observability with repair (see CASSANDRA-15399), coordination should be changed into a request/response pattern (with retries) and polling (validation status and MerkleTree sending). This would allow the coordinator to detect changes in state (it was known participate was working on validation, but it no longer knows about the validation task), and to be able to recover from ephemeral issues.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                dcapwell David Capwell
                Reporter:
                dcapwell David Capwell
                Authors:
                David Capwell
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: