Details

    • Type: Task Task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      I'd love to use Apache Kafka, but for my application data loss is not acceptable. Even at the expense of availability (i.e. I need C not A in CAP).

      I think there are two things that I need to change to get a quorum model:

      1) Make sure I set request.required.acks to 2 (for a 3 node cluster) or 3 (for a 5 node cluster) on every request, so that I can only write if a quorum is active.

      2) Prevent the behaviour where a non-ISR can become the leader if all ISRs die. I think this is as easy as tweaking core/src/main/scala/kafka/controller/PartitionLeaderSelector.scala, essentially to throw an exception around line 64 in the "data loss" case.

      I haven't yet implemented / tested this. I'd love to get some input from the Kafka-experts on whether my plan is:
      (a) correct - will this work?
      (b) complete - have I missed any cases?
      (c) recommended - is this a terrible idea

      Thanks for any pointers!

        Activity

        Justin SB created issue -
        Hide
        Jay Kreps added a comment -

        I think there may be some confusion. Does this section of the documentation help:
        http://kafka.apache.org/documentation.html#replication

        Show
        Jay Kreps added a comment - I think there may be some confusion. Does this section of the documentation help: http://kafka.apache.org/documentation.html#replication
        Hide
        Jay Kreps added a comment -

        Or actually, perhaps I am confused. The data loss case is (2) and it would definitely be good to implement a case where we wait on an in-sync replica rather than allowing an unsafe election. I don't think (1) actually helps though. Basically in a majority vote scenario (which I think you are calling quorum, though I think both approaches are quorums) this is the equivalent to throwing an exception if you can only get 3/5 to commit. In a sense this does increase the chance of durability since more servers wrote it, but it is also a bit odd to intentionally fail your own successful writes.

        Show
        Jay Kreps added a comment - Or actually, perhaps I am confused. The data loss case is (2) and it would definitely be good to implement a case where we wait on an in-sync replica rather than allowing an unsafe election. I don't think (1) actually helps though. Basically in a majority vote scenario (which I think you are calling quorum, though I think both approaches are quorums) this is the equivalent to throwing an exception if you can only get 3/5 to commit. In a sense this does increase the chance of durability since more servers wrote it, but it is also a bit odd to intentionally fail your own successful writes.
        Hide
        Justin SB added a comment -

        You're definitely right Jay that I'm conflating a few ideas. I may also be more deeply confused

        #2 is definitely the really unacceptable scenario in my book. For my use case, I can't allow a non-ISR to become the leader, because that is certain to involve data loss (by definition, I think).

        You're right that #1 is just ensuring that we can still tolerate failures, when we are imposing #2. Without it, we'd likely get into a scenario where e.g. only one node was alive, and if we allowed it to make progress then we wouldn't be able to recover from failure of that node.

        I think you're right, that I'm really trying to get majority vote semantics. I don't see why I'd be intentionally failing successful writes though. Does the leader count in "request.required.acks"? If I can write to 3/5, I do want to treat that as a success. I also want 2/5 to be considered a failure. I think I get that by setting request.required.acks=3, though maybe I need to set request.required.acks=2 if the leader is not counted as an ack. And maybe I'm just reading the ack-counting code wrong generally...

        It's also occurred to me that I would probably need to add rollback, as the current Kafka model wouldn't ever rollback a write on the leader because of a lack of sufficient acks (it would just remove the replicas instead)?

        Show
        Justin SB added a comment - You're definitely right Jay that I'm conflating a few ideas. I may also be more deeply confused #2 is definitely the really unacceptable scenario in my book. For my use case, I can't allow a non-ISR to become the leader, because that is certain to involve data loss (by definition, I think). You're right that #1 is just ensuring that we can still tolerate failures, when we are imposing #2. Without it, we'd likely get into a scenario where e.g. only one node was alive, and if we allowed it to make progress then we wouldn't be able to recover from failure of that node. I think you're right, that I'm really trying to get majority vote semantics. I don't see why I'd be intentionally failing successful writes though. Does the leader count in "request.required.acks"? If I can write to 3/5, I do want to treat that as a success. I also want 2/5 to be considered a failure. I think I get that by setting request.required.acks=3, though maybe I need to set request.required.acks=2 if the leader is not counted as an ack. And maybe I'm just reading the ack-counting code wrong generally... It's also occurred to me that I would probably need to add rollback, as the current Kafka model wouldn't ever rollback a write on the leader because of a lack of sufficient acks (it would just remove the replicas instead)?
        Hide
        Neha Narkhede added a comment -

        For 2, we have KAFKA-1028 filed and I had some initial work going on it. Let me clean it up and upload a patch.

        >> Does the leader count in "request.required.acks"?

        Yes.

        Show
        Neha Narkhede added a comment - For 2, we have KAFKA-1028 filed and I had some initial work going on it. Let me clean it up and upload a patch. >> Does the leader count in "request.required.acks"? Yes.
        Hide
        Jay Kreps added a comment -

        Hey Justin, yeah the two things I wanted to clarify:

        1. Consider two setups: Kafka with replication factor 2 and 1 failed node, zookeeper with replication factor 3 and 1 failed node. Isn't taking writes on either of these equally dangerous in the sense that one more failure will leave you in an unrecoverable situation? You refer to "majority vote semantics" but I think the semantics (disregarding unsafe leader election) are the same, no? Can you clarify what you are looking for?

        2. Yeah, it's worth clarifying that what we report back is just whether the write is just whether the write is "guaranteed". Not guaranteed is not the same as "did not occur". So, for example if the client issues a write and then dies or becomes partitioned from the cluster the write may succeed even though the ack cannot be sent back to the producer. So in the case of the acks=3 with only 2 available servers we would tell the client "sorry we couldn't get 3x replication during the time we waited" but that doesn't mean we guarantee no write it just means we got fewer than 3.

        Show
        Jay Kreps added a comment - Hey Justin, yeah the two things I wanted to clarify: 1. Consider two setups: Kafka with replication factor 2 and 1 failed node, zookeeper with replication factor 3 and 1 failed node. Isn't taking writes on either of these equally dangerous in the sense that one more failure will leave you in an unrecoverable situation? You refer to "majority vote semantics" but I think the semantics (disregarding unsafe leader election) are the same, no? Can you clarify what you are looking for? 2. Yeah, it's worth clarifying that what we report back is just whether the write is just whether the write is "guaranteed". Not guaranteed is not the same as "did not occur". So, for example if the client issues a write and then dies or becomes partitioned from the cluster the write may succeed even though the ack cannot be sent back to the producer. So in the case of the acks=3 with only 2 available servers we would tell the client "sorry we couldn't get 3x replication during the time we waited" but that doesn't mean we guarantee no write it just means we got fewer than 3.
        Neha Narkhede made changes -
        Field Original Value New Value
        Assignee Neha Narkhede [ nehanarkhede ]
        Hide
        Jay Kreps added a comment -

        I think both of these are done on other tickets.

        Show
        Jay Kreps added a comment - I think both of these are done on other tickets.
        Jay Kreps made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]

          People

          • Assignee:
            Neha Narkhede
            Reporter:
            Justin SB
          • Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development