[KAFKA-5452] Aggressive log compaction ratio appears to have no negative effect on log-compacted topics - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.10.2.0, 0.10.2.1
Fix Version/s: None
Component/s: config, core, log
Labels:
- performance
Environment:
Ubuntu Trusty (14.04.5), Oracle JDK 8

Description

Some of our users are seeing unintuitive/unexpected behavior with log-compacted topics where they receive multiple records for the same key when consuming. This is a result of low throughput on log-compacted topics such that conditions (min.cleanable.dirty.ratio = 0.5, default) aren't met for compaction to kick in.

This prompted us to test and tune min.cleanable.dirty.ratio in our clusters. It appears that having more aggressive log compaction ratios don't have negative effects on CPU and memory utilization. If this is truly the case, we should consider changing the default from 0.5 to something more aggressive.

Setup:

8 brokers
5 zk nodes
32 partitions on a topic
replication factor 3
log roll 3 hours
log segment bytes 1 GB
log retention 24 hours
all messages to a single key
all messages to a unique key
all messages to a bounded key range [0, 999]
min.cleanable.dirty.ratio per topic = 0, 0.5, and 1
200 MB/s sustained, produce and consume traffic

Observations:

We were able to verify log cleaner threads were performing work by checking the logs and verifying the cleaner-offset-checkpoint file for all topics. We also observed the log cleaner's time-since-last-run-ms metric was normal, never going above the default of 15 seconds.

Under-replicated partitions stayed steady, same for replication lag.

Here's an example test run where we try out min.cleanable.dirty.ratio = 0, min.cleanable.dirty.ratio = 1, and min.cleanable.dirty.ratio = 0.5. Troughs in between the peaks represent zero traffic and reconfiguring of topics.

(200mbs-dirty-0-dirty1-dirty05.png attached)

Memory utilization is fine, but more interestingly, CPU doesn't appear to have much difference.

To get more detail, here is a flame graph (raw svg attached) of the run for min.cleanable.dirty.ratio = 0. The conservative and default ratio flame graphs are equivalent.

(flame-graph-200mbs-dirty0.png attached)

Notice that the majority of CPU is coming from:

SSL operations (on reads/writes)
KafkaApis::handleFetchRequest (ReplicaManager::fetchMessages)
KafkaApis::handleOffsetFetchRequest

We also have examples from small scale test runs which show similar behavior but with scaled down CPU usage.

It seems counterintuitive that there's no apparent difference in CPU whether it be aggressive or conservative compaction ratios, so we'd like to get some thoughts from the community.

We're looking for feedback on whether or not anyone else has experienced this behavior before as well or, if CPU isn't affected, has anyone seen something related instead.

If this is true, then we'd be happy to discuss further and provide a patch.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

flame-graph-200mbs-dirty0.png
14/Jun/17 21:45
235 kB
Jeff Chao
flame-graph-200mbs-dirty0.svg
14/Jun/17 21:45
1.39 MB
Jeff Chao
200mbs-dirty0-dirty-1-dirty05.png
14/Jun/17 21:44
163 kB
Jeff Chao

Activity

People

Assignee:: Unassigned

Reporter:: Jeff Chao

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 14/Jun/17 22:11

Updated:: 24/Feb/23 20:03

Resolved:: 24/Feb/23 20:03