Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-13910

Remove read_repair_chance/dclocal_read_repair_chance

Agile BoardAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Low
    • Resolution: Fixed
    • 3.0.17, 3.11.3, 4.0-alpha1, 4.0
    • None
    • None

    Description

      First, let me clarify so this is not misunderstood that I'm not at all suggesting to remove the read-repair mechanism of detecting and repairing inconsistencies between read responses: that mechanism is imo fine and useful. But the read_repair_chance and dclocal_read_repair_chance have never been about enabling that mechanism, they are about querying all replicas (even when this is not required by the consistency level) for the sole purpose of maybe read-repairing some of the replica that wouldn't have been queried otherwise. Which btw, bring me to reason 1 for considering their removal: their naming/behavior is super confusing. Over the years, I've seen countless users (and not only newbies) misunderstanding what those options do, and as a consequence misunderstand when read-repair itself was happening.

      But my 2nd reason for suggesting this is that I suspect read_repair_chance/dclocal_read_repair_chance are, especially nowadays, more harmful than anything else when enabled. When those option kick in, what you trade-off is additional resources consumption (all nodes have to execute the read) for a fairly remote chance of having some inconsistencies repaired on some replica a bit faster than they would otherwise be. To justify that last part, let's recall that:

      1. most inconsistencies are actually fixed by hints in practice; and in the case where a node stay dead for a long time so that hints ends up timing-out, you really should repair the node when it comes back (if not simply re-bootstrapping it). Read-repair probably don't fix that much stuff in the first place.
      2. again, read-repair do happen without those options kicking in. If you do reads at QUORUM, inconsistencies will eventually get read-repaired all the same. Just a tiny bit less quickly.
      3. I suspect almost everyone use a low "chance" for those options at best (because the extra resources consumption is real), so at the end of the day, it's up to chance how much faster this fixes inconsistencies.

      Overall, I'm having a hard time imagining real cases where that trade-off really make sense. Don't get me wrong, those options had their places a long time ago when hints weren't working all that well, but I think they bring more confusion than benefits now.

      And I think it's sane to reconsider stuffs every once in a while, and to clean up anything that may not make all that much sense anymore, which I think is the case here.

      Tl;dr, I feel the benefits brought by those options are very slim at best and well overshadowed by the confusion they bring, and not worth maintaining the code that supports them (which, to be fair, isn't huge, but getting rid of ReadCallback.AsyncRepairRunner wouldn't hurt for instance).

      Lastly, if the consensus here ends up being that they can have their use in weird case and that we fill supporting those cases is worth confusing everyone else and maintaining that code, I would still suggest disabling them totally by default.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            aleksey Aleksey Yeschenko Assign to me
            slebresne Sylvain Lebresne
            Aleksey Yeschenko
            Blake Eggleston
            Votes:
            1 Vote for this issue
            Watchers:
            27 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment