Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-13727

Edge case in cleaner can result in premature removal of ABORT marker

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.8.2, 3.1.1, 3.0.2
    • None
    • None

    Description

      The log cleaner works by first building a map of the active keys beginning from the dirty offset, and then scanning forward from the beginning of the log to decide which records should be retained based on whether they are included in the map. The map of keys has a limited size. As soon as it fills up, we stop building it. The offset corresponding to the last record that was included in the map becomes the next dirty offset. Then when we are cleaning, we stop scanning forward at the dirty offset. Or to be more precise, we continue scanning until the end of the segment which includes the dirty offset, but all records above that offset are coped as is without checking the map of active keys. 

      Compaction is complicated by the presence of transactions. The cleaner must keep track of which transactions have data remaining so that it can tell when it is safe to remove the respective markers. It works a bit like the consumer. Before scanning a segment, the cleaner consults the aborted transaction index to figure out which transactions have been aborted. All other transactions are considered committed.

      The problem we have found is that the cleaner does not take into account the range of offsets between the dirty offset and the end offset of the segment containing it when querying ahead for aborted transactions. This means that when the cleaner is scanning forward from the dirty offset, it does not have the complete set of aborted transactions. The main consequence of this is that abort markers associated with transactions which start within this range of offsets become eligible for deletion even before the corresponding data has been removed from the log.

      Here is an example. Suppose that the log contains the following entries:

      offset=0, key=a

      offset=1, key=b

      offset=2, COMMIT

      offset=3, key=c

      offset=4, key=d

      offset=5, COMMIT

      offset=6, key=b

      offset=7, ABORT

      Suppose we have an offset map which can only contain 2 keys and the dirty offset starts at 0. The first time we scan forward, we will build a map with keys a and b, which will allow us to move the dirty offset up to 3. Due to the issue documented here, we will not detect the aborted transaction starting at offset 6. But it will not be eligible for deletion on this round of cleaning because it is bound by `delete.retention.ms`. Instead, our new logic will set the deletion horizon for this batch based to the current time plus the configured `delete.retention.ms`.

      offset=0, key=a

      offset=1, key=b

      offset=2, COMMIT

      offset=3, key=c

      offset=4, key=d

      offset=5, COMMIT

      offset=6, key=b

      offset=7, ABORT (deleteHorizon: N)

      Suppose that the time reaches N+1 before the next cleaning. We will begin from the dirty offset of 3 and collect keys c and d before stopping at offset 6. Again, we will not detect the aborted transaction beginning at offset 6 since it is out of the range. This time when we scan, the marker at offset 7 will be deleted because the transaction will be seen as empty and now the deletion horizon has passed. So we end up with this state:

      offset=0, key=a

      offset=1, key=b

      offset=2, COMMIT

      offset=3, key=c

      offset=4, key=d

      offset=5, COMMIT

      offset=6, key=b

      Effectively it becomes a hanging transaction. The interesting thing is that we might not even detect it. As far as the leader is concerned, it had already completed that transaction, so it is not expecting any additional markers. The transaction index would have been rewritten without the aborted transaction when the log was cleaned, so any consumer fetching the data would see the transaction as committed. On the other hand, if we did a reassignment to a new replica, or if we had to rebuild the full log state during recovery, then we would suddenly detect it.

      I am not sure how likely this scenario is in practice. I think it's fair to say it is an extremely rare case. The cleaner has to fail to clean a full segment at least two times and you still need enough time to pass for the marker's deletion horizon to be reached. Perhaps it is possible if the cardinality of keys is very high and the configured memory limit for the cleaner is low.

      Attachments

        Issue Links

          Activity

            People

              hachikuji Jason Gustafson
              hachikuji Jason Gustafson
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: