Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-14402

Transactions Server Side Defense

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.5.0
    • 4.0.0
    • None
    • None

    Description

      We have seen hanging transactions in Kafka where the last stable offset (LSO) does not update, we can’t clean the log (if the topic is compacted), and read_committed consumers get stuck.

      This can happen when a message gets stuck or delayed due to networking issues or a network partition, the transaction aborts, and then the delayed message finally comes in. The delayed message case can also violate EOS if the delayed message comes in after the next addPartitionsToTxn request comes in. Effectively we may see a message from a previous (aborted) transaction become part of the next transaction.

      Another way hanging transactions can occur is that a client is buggy and may somehow try to write to a partition before it adds the partition to the transaction. In both of these cases, we want the server to have some control to prevent these incorrect records from being written and either causing hanging transactions or violating Exactly once semantics (EOS) by including records in the wrong transaction.

      The best way to avoid this issue is to:

      1. Uniquely identify transactions by bumping the producer epoch after every commit/abort marker. That way, each transaction can be identified by (producer id, epoch). 
      1. Remove the addPartitionsToTxn call and implicitly just add partitions to the transaction on the first produce request during a transaction.

      We avoid the late arrival case because the transaction is uniquely identified and fenced AND we avoid the buggy client case because we remove the need for the client to explicitly add partitions to begin the transaction.

      Of course, 1 and 2 require client-side changes, so for older clients, those approaches won’t apply.

      3. To cover older clients, we will ensure a transaction is ongoing before we write to a transaction. We can do this by querying the transaction coordinator and caching the result.

       

      See KIP-890 for more information: ** https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense

      Attachments

        Issue Links

          1.
          Update AddPartitionsToTxn protocol to batch and handle verifyOnly requests Sub-task Resolved Justine Olshan
          2.
          Improve transactions experience for older clients by ensuring ongoing transaction Sub-task Resolved Justine Olshan
          3.
          Fix code that assumes transactional ID implies all records are transactional Sub-task Resolved Justine Olshan
          4.
          Include check transaction is still ongoing right before append Sub-task Resolved Justine Olshan
          5.
          AddPartitionsToTxnManager metrics Sub-task Resolved Justine Olshan
          6.
          Address timeouts and out of order sequences Sub-task Resolved Justine Olshan
          7.
          Verify transactional offset commits (KIP-890 part 1) Sub-task Resolved Justine Olshan
          8.
          Make verification a dynamic configuration Sub-task Resolved Justine Olshan
          9.
          Implement epoch bump after every transaction Sub-task Resolved Justine Olshan
          10.
          Remove AddPartitionsToTxn call for newer clients as optimization Sub-task Resolved Calvin Liu
          11.
          Refactor inter broker send thread to handle all interbroker requests on one thread Sub-task In Progress Justine Olshan
          12.
          Move AddPartitionsToTxnManager files to java Sub-task Open Justine Olshan
          13.
          Revisit Action Queue Sub-task Open Justine Olshan
          14.
          Try complete actions after callback Sub-task Resolved Justine Olshan
          15.
          Convert coordinator retriable errors to a known producer response error. Sub-task Resolved Justine Olshan
          16.
          Replace verification guard object with an specific type Sub-task Resolved Justine Olshan
          17.
          Address Transactions Errors Sub-task Resolved Justine Olshan
          18.
          Consider making transactional apis more compatible with topic IDs Sub-task Open Justine Olshan
          19.
          Do not advertise v4 AddPartitionsToTxn to clients Sub-task Open Unassigned
          20.
          Always schedule wrapped callbacks Sub-task Open Unassigned
          21.
          Ensure atomicity of in memory update and write when transactionally committing offsets Sub-task Resolved Justine Olshan
          22.
          Refactor ReplicaManager code for transaction verification Sub-task Resolved Justine Olshan
          23.
          Add the new ABORTABLE_ERROR Sub-task Resolved Sanskar Jhajharia
          24.
          Add Transactions V2 system tests and mark as production ready Sub-task In Progress Justine Olshan
          25.
          Seperate Epoch Bump Scenarios and Error Handling in TV2 Sub-task Resolved Ritika Reddy
          26.
          Convert INVALID_PRODUCER_ID_MAPPING from abortable error to fatal error Sub-task Resolved Ritika Reddy
          27.
          Reject the produce request with lower producer epoch early. Sub-task Open Unassigned
          28.
          Reject non-zero sequences when there is no producer ID state on the partition for transactions v2 idempotent producers Sub-task Open Justine Olshan
          29.
          Ensure v2 partitions are not added to last transaction during upgrade Sub-task Resolved Justine Olshan

          Activity

            People

              jolshan Justine Olshan
              jolshan Justine Olshan
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: