[KAFKA-9826] Log cleaning repeatedly picks same segment with no effect when first dirty offset is past start of active segment - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.1
Fix Version/s: 2.6.0, 2.4.2, 2.5.1
Component/s: log cleaner
Labels:
None

Description

Seen on a system where a given partition had a single segment, and for whatever reason (deleteRecords?), the logStartOffset was greater than the base segment of the log, there were a continuous series of

```

[2020-03-03 16:56:31,374] WARN Resetting first dirty offset of FOO-3 to log start offset 55649 since the checkpointed offset 0 is invalid. (kafka.log.LogCleanerManager$)

```

messages (partition name changed, it wasn't really FOO). This was expected to be resolved by ~~KAFKA-6266~~ but clearly wasn't.

Further investigation revealed that a few segments were continuously cleaning and generating messages in the `log-cleaner.log` of the form:

```

[2020-03-31 13:34:50,924] INFO Cleaner 1: Beginning cleaning of log FOO-3 (kafka.log.LogCleaner)

[2020-03-31 13:34:50,924] INFO Cleaner 1: Building offset map for FOO-3... (kafka.log.LogCleaner)

[2020-03-31 13:34:50,927] INFO Cleaner 1: Building offset map for log FOO-3 for 0 segments in offset range [55287, 54237). (kafka.log.LogCleaner)

[2020-03-31 13:34:50,927] INFO Cleaner 1: Offset map for log FOO-3 complete. (kafka.log.LogCleaner)

[2020-03-31 13:34:50,927] INFO Cleaner 1: Cleaning log FOO-3 (cleaning prior to Wed Dec 31 19:00:00 EST 1969, discarding tombstones prior to Tue Dec 10 13:39:08 EST 2019)... (kafka.log.LogCleaner)

[2020-03-31 13:34:50,927] INFO [kafka-log-cleaner-thread-1]: Log cleaner thread 1 cleaned log FOO-3 (dirty section = [55287, 55287])

0.0 MB of log processed in 0.0 seconds (0.0 MB/sec).

Indexed 0.0 MB in 0.0 seconds (0.0 Mb/sec, 100.0% of total time)

Buffer utilization: 0.0%

Cleaned 0.0 MB in 0.0 seconds (NaN Mb/sec, 0.0% of total time)

Start size: 0.0 MB (0 messages)

End size: 0.0 MB (0 messages) NaN% size reduction (NaN% fewer messages) (kafka.log.LogCleaner)

```

What seems to have happened here (data determined for a different partition) is:

There exist a number of partitions here which get relatively low traffic, including our friend FOO-5. For whatever reason, LogStartOffset of this partition has moved beyond the baseOffset of the active segment. (Notes in other issues indicate that this is a reasonable scenario.) So there’s one segment, starting at 166266, and a log start of 166301.

grabFilthiestCompactedLog runs and reads the checkpoint file. We see that this topicpartition needs to be cleaned, and call cleanableOffsets on it which returns an OffsetsToClean with firstDirtyOffset == logStartOffset == 166301 and firstUncleanableOffset = max(logStart, activeSegment.baseOffset) = 116301, and forceCheckpoint = true.

The checkpoint file is updated in grabFilthiestCompactedLog (this is the fix for ~~KAFKA-6266~~). We then create a LogToClean object based on the firstDirtyOffset and firstUncleanableOffset of 166301 (past the active segment’s base offset).

The LogToClean object has cleanBytes = logSegments(-1, firstDirtyOffset).map(_.size).sum → the size of this segment. It has firstUncleanableOffset and cleanableBytes determined by calculateCleanableBytes. calculateCleanableBytes returns:
{{}}
val firstUncleanableSegment = log.nonActiveLogSegmentsFrom(uncleanableOffset).headOption.getOrElse(log.activeSegment)
val firstUncleanableOffset = firstUncleanableSegment.baseOffset
{{val cleanableBytes = log.logSegments(firstDirtyOffset, math.max(firstDirtyOffset, firstUncleanableOffset)).map(_.size.toLong).sum

(firstUncleanableOffset, cleanableBytes)}}
firstUncleanableSegment is activeSegment. firstUncleanableOffset is the base offset, 166266. cleanableBytes is looking for logSegments(166301, max(166301, 166266) → which is the active segment

So there are “cleanableBytes” > 0.

We then filter out segments with totalbytes (clean + cleanable) > 0. This segment has totalBytes > 0, and it has cleanablebytes, so great! It’s filthiest.

The cleaner picks it, calls cleanLog on it, which then does cleaner.clean, which returns nextDirtyOffset and cleaner stats. cleaner.clean callls doClean, which builds an offsetMap. The offsetMap looks at non-active segments, when building, but there aren’t any. So the endOffset of the offsetMap is lastOffset (default -1) + 1 → 0. We record the stats (including logging to log-cleaner.log). After this we call cleanerManager.doneCleaning, which writes the checkpoint file with the latest value… of 0.

And then the process starts all over.

It appears that there's at least one bug here, where `log.logSegments(from, to)` will return an empty list if from == to and both are segment-aligned, but not if they are in the middle of a segment, and possibly that LogToClean does start=firstDirtyOffset, end = max(firstDirtyOffset, firstUncleanableOffset) – it can move the firstUncleanableOffset even when the firstDirtyOffset is past firstUncleanable.

Attachments

Issue Links

links to

GitHub Pull Request #8469

GitHub Pull Request #8542

GitHub Pull Request #8543

Log cleaning repeatedly picks same segment with no effect when first dirty offset is past start of active segment

Details

Description

Attachments

Issue Links

Activity

People

Dates