Details
Description
We noticed a fair amount of stress on our filesystem in an environment with a large number of topics but low message activity. After some investigation, we realized that a short log.flush.interval coupled with a large number of topics resulted in a lot of unnecessary disk activity, even without events to be written.
This activity occurs because FileChannel.force(true) is called on the underlying FileMessageSet for each log, even when there are no messages to be written. This call forces unproductive writes to the underlying filesystem.
This case is especially stressed in an environment with a large number of low-activity topics for which low latency is still important. Here is the before-and-after output of `iostat -x 2` on a system with 1044 topics and a timed flush interval of 100ms. Note the reduction in %util and writes/second. In the "before" output, we see 40-80% util and ~260 writes/second. In the "after" output, we see 10-15% util and ~65 writes/second.
Pre-patch output: https://raw.github.com/gist/54d0f4c62753a6e2de1f/7ee1982bfa8e5c088bcf9ba953f01956443bd31e/iostat-pre-kafka-patch.txt
Post-patch output:
https://raw.github.com/gist/54d0f4c62753a6e2de1f/b939973c7fed642480856d9bdeb2e4cb0ada445b/iostat-post-kafka-patch.txt
The proposed patch (see attached) skips calling the underlying FileMessageSet flush operation if the log's atomic counter indicates that there are no messages to be written.