Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-13174

Log Compaction Blocked Forever by Unstable Offset/Unclosed Transaction

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.5.1
    • None
    • core, log cleaner
    • None

    Description

      Our production cluster experienced a single __consumer_offsets partition that was growing without ever being compacted. A closer inspection of the cleaner logs showed that the log was forever uncleanable at an offset from July 28, which had been written ~7 days previously:

      [2021-08-02 19:08:39,650] DEBUG Finding range of cleanable offsets for log=__consumer_offsets-9. Last clean offset=Some(75795702964) now=1627956519650 => firstDirtyOffset=75795702964 firstUncleanableOffset=75868740168 activeSegment.baseOffset=76691318694 (kafka.log.LogCleanerManager$)
      

      Using the log dumper tool, we were able to examine the records/batches around this offset and determined that the proximate cause was an "open" transaction that was never committed or aborted. We saw this:

      • a consumer group offset commit for group foo-group for topic-partition foo-topic-46 from pid 1686325 (log offset 75868731509)
      • a transactional COMMIT marker from pid 1686325 (log offset 75868731579)
      • another consumer group offset commit for group foo-group for topic-partition foo-topic-46 for pid 1686325 (log offset 75868740168, our first uncleanable offset)

      Here's the raw log dumper output:

      baseOffset: 75868731509 lastOffset: 75868731509 count: 1 baseSequence: 0 lastSequence: 0 producerId: 1686325 producerEpoch: 249 partitionLeaderEpoch: 320 isTransactional: true isControl: false position: 98725764 CreateTime: 1627495733656 size: 232 magic
      : 2 compresscodec: NONE crc: 485368943 isvalid: true
      | offset: 75868731509 CreateTime: 1627495733656 keysize: 126 valuesize: 36 sequence: 0 headerKeys: [] key: offset_commit::group=foo-group,partition=foo-topic-46 payload: offset=59016695,metadata=AQAAAXruS8Fg
      ...
      
      baseOffset: 75868731579 lastOffset: 75868731579 count: 1 baseSequence: -1 lastSequence: -1 producerId: 1686325 producerEpoch: 249 partitionLeaderEpoch: 320 isTransactional: true isControl: true position: 98732634 CreateTime: 1627495733700 size: 78 magic
      : 2 compresscodec: NONE crc: 785483064 isvalid: true
      | offset: 75868731579 CreateTime: 1627495733700 keysize: 4 valuesize: 6 sequence: -1 headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 143
      ...
      
      baseOffset: 75868740168 lastOffset: 75868740168 count: 1 baseSequence: 0 lastSequence: 0 producerId: 1686325 producerEpoch: 249 partitionLeaderEpoch: 320 isTransactional: true isControl: false position: 99599843 CreateTime: 1627495737629 size: 232 magic: 2 compresscodec: NONE crc: 1222829008 isvalid: true
      | offset: 75868740168 CreateTime: 1627495737629 keysize: 126 valuesize: 36 sequence: 0 headerKeys: [] key: offset_commit::group=foo-group,partition=foo-topic-46 payload: offset=59016695,metadata=AQAAAXruS8Fg
      ...
      

      There was no further activity from that pid 1686325. In fact, the KStream application in question stalled on that partition because of this unstable offset/open transaction: The following partitions still have unstable offsets which are not cleared on the broker side: [foo-topic-46], this could be either transactional offsets waiting for completion, or normal offsets waiting for replication after appending to local log

      We then captured the producer snapshot file from the broker data directory and wrote a quick tool to dump it as text. From its output, we found that the transactional producer in question (pid 1686325) was still considered alive with its hanging transaction at 75868740168:

      ArraySeq(ProducerStateEntry(producerId=1686325, producerEpoch=249, currentTxnFirstOffset=Some(75868740168), coordinatorEpoch=143, lastTimestamp=1627495737629, batchMetadata=Queue(BatchMetadata(firstSeq=0, lastSeq=0, firstOffset=75868740168, lastOffset=75868740168, timestamp=1627495737629)))
      

      This was very perplexing. As far as we can tell, the code in both Apache Kafka 2.5.1 and in trunk essentially treats an open transaction like we had as uncleanable, which in practice blocks the log from ever being compacted again, for all eternity. Once a pid has an open transaction - a defined currentTxnFirstOffset - ProducerStateManager will never expire the producer, even after transactional.id.expiration.ms has passed. This, obviously, has severe repercussions on a topic like __consumer_offsets (long coordinator load times, always-growing disk usage).

      While we're not sure what led to this hanging open transaction (note: we were running partition reassignments at that time so perhaps the transaction coordinator moved?), it seems just very possible to end up with a started, but never committed or aborted, transaction like in our example for myriad reasons: network issues; broker deaths, etc.

      Are we misunderstanding something here? How could a log like we had ever get compacted? If we're not misunderstanding, it looks like this is a potentially severe bug and the log cleaner should be able to handle it. Perhaps the implementation of isProducerExpired should be an or-condition, not an and-condition?

      Please let us know if we can provide any more details that would help shed light on what is going on here!

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              mjaschob@twilio.com Michael Jaschob
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: