Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.10.2.1
-
None
Description
Environment
OS: CentOS6
Kernel version: 2.6.32-XX
Kafka version: 0.10.2.1, 0.11.1.2 (but reproduces with latest build from trunk (2.2.0-SNAPSHOT)
Phenomenon
Response time of Produce request (99th ~ 99.9th %ile) degrading to 50x ~ 100x more than usual.
Normally 99th %ile is lower than 20ms, but when this issue occurs it marks 50ms to 200ms.
At the same time we could see two more things in metrics:
1. Disk read coincidence from the volume assigned to log.dirs.
2. Raise in network threads utilization (by `kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent`)
As we didn't see increase of requests in metrics, we suspected blocking in event loop ran by network thread as the cause of raising network thread utilization.
Reading through Kafka broker source code, we understand that the only disk IO performed in network thread is reading log data through calling sendfile(2) (via FileChannel#transferTo).
To probe that the calls of sendfile(2) are blocking network thread for some moments, I ran following SystemTap script to inspect duration of sendfile syscalls.
# Systemtap script to measure syscall duration
global s
global records
probe syscall.$1 {
s[tid()] = gettimeofday_us()
}
probe syscall.$1.return {
elapsed = gettimeofday_us() - s[tid()]
delete s[tid()]
records <<< elapsed
}
probe end {
print(@hist_log(records))
}
$ stap -v syscall-duration.stp sendfile # value (us) value |---------------------------------------- count 0 | 0 1 | 71 2 |@@@ 6171 16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 29472 32 |@@@ 3418 2048 | 0 ... 8192 | 3
As you can see there were some cases taking more than few milliseconds, implies that it blocks network thread for that long and applying the same latency for all other request/response processing.
Hypothesis
Gathering the above observations, I made the following hypothesis.
Let's say network-thread-1 multiplexing 3 connections.
- producer-A
- follower-B (broker replica fetch)
- consumer-C
Broker receives requests from each of those clients, [Produce, FetchFollower, FetchConsumer].
They are processed well by request handler threads, and now the response queue of the network-thread contains 3 responses in following order: [FetchConsumer, Produce, FetchFollower].
network-thread-1 takes 3 responses and processes them sequentially (https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/network/SocketServer.scala#L632).
Ideally processing of these 3 responses completes in microseconds as in it just copies ready responses into client socket's buffer with non-blocking manner.
However, Kafka uses sendfile(2) for transferring log data to client sockets. The target data might be in page cache, but old data which has written a bit far ago and never read since then, are likely not.
If the target data isn't in page cache, kernel first needs to load the target page into cache. This takes more than few milliseconds to tens of milliseconds depending on disk hardware and current load being applied to it.
Linux kernel doesn't considers the moment loading data from disk into page cache as "blocked", hence it awaits completion of target data loading rather than returning EAGAIN.
Thus, if this occurs when processing the first FetchConsumer request, the latency from reading disk applies to all following responses as well. Not just those which has been received by network thread but also those in response queue of that network thread.
Experiment
For reproducing this issue, I did experiment on our test cluster.
Kafka cluster consist of 3 broker machines with following spec:
CPU: Intel(R) Xeon(R) 2.20GHz x 20 cores (HT) * 2
Memory: 256GiB
Network: 10Gbps
Disk: HDD x 12 RAID 1+0
Essential broker configs are below:
num.network.threads=10 num.io.threads=30 num.replica.fetchers=3
And created two topics both with replicas=3, retention.ms=2days and min.insync.replicas=2:
- large-topic: partitions=60
- regular-topic: partitions=30
Ran producer which produces 1kb data into large-topic and accumulated to have around 3GiB data for each partition, 60GiB data in total for each broker and stop the producer.
To simulate the situation that old data dropped from page cache, run echo 1 > /proc/sys/vm/drop_caches in each broker machine.
Start another producer which produces 512 bytes data into topic regular-topic.
Throughput is 56k messages/sec. 99.9th %ile response time for Produce is very good:
Then start consumer which reads from head of large-topic which is already not in page cache.
Response time of Produce degrade and throughput decreases:
Disk read is occurring and network threads got busier:
Solution
For fixing this issue, we've considered several ideas, but finally decided to take following approach.
The essential problem is blocking in the middle of event loop. If it does something which is not purely computation (like blocking IO), the latency from it directly applies for other irrelevant requests, leaving CPU cores stale which can do work for other requests while awaiting completion of disk IO.
To fix this, there should be two possible ways:
A. Make call of sendfile(2) not to wait even on waiting pages loaded from disks to memory.
B. Make sure that data is always available in memory when the response object passed to network-thread.
For A. Make call of sendfile(2) not to wait even on waiting pages loaded from disks to memory, we must have to have support from kernel.
Actually FreeBSD recently added such, nonblocking style sendfile system call cooperating with community: https://www.nginx.com/blog/nginx-and-netflix-contribute-new-sendfile2-to-freebsd/
It might be useful for case like this, but this time it wasn't an option because:
1. Linux doesn't have such support in syscall
2. Even with kernel support, we can't use it until JDK supports it. Otherwise we have to go towards introducing JNI code.
So we sought approach B and tried some ways.
The basic idea is to read the target data once in request handler thread so the page cache of target data gets populated and stays for a while until it gets transferred to client socket through sendfile(2) called from network thread.
The easiest way is to do read(2) on target data. However this is risky from the overhead point of view as in it would leads overhead of copying data from kernel to userland (which is what Kafka trying to avoid leveraging sendfile(2)). Also allocated buffer for read()ing data would cause heap pressure unless we implement it in optimal way (perhaps using off-heap).
While we tried several other ways (like calling readahead(2) syscall with checking completion of load calling mincore(2) in loop through JNI), we finally turned out that calling sendfile for target data with setting destination to /dev/null does exactly what we want.
When calling it setting the destination fd which is pointing /dev/null, linux kernel loads data from disk into page cache, but skips (not exactly the whole part) copying loaded data to the destination device.
To implement this, we just need Java code as in we can expect FileChannel#transferTo calls sendfile internally.
We patched our kafka broker to do this warmup, and it perfectly fixed this problem. The patched version is based on 0.10.2.1 but I'm now testing the same patch on latest build from trunk(905f8135078127e08633400277c5829b10971d42) and seems that it works effectively for now.
To avoid additional overhead calling one more syscall (sendfile for /dev/null) when processing all Fetch requests, I added bit optimization that is skipping the last log segment of topic partitions assuming it contains only newly written data which are still hot and has its data in page cache.
By this optimization, since most of Fetch requests are requesting data of the tail of topic-partition in our case, just few of Fetch requests less than 1% are the subject of this warmup operation. Hence we haven't seen any overhead in CPU utilization since we deployed this patch to our production system.
To test its effect in the last experiment scenario, I deployed our patch to experiment cluster and tested again.
When starting catch up read consumer, network thread got busier again, but utilization was bit lower. And per-client request-time metric showed that this time producer (normal-client) didn't affected:
The same amount of disk read was occurring.
However there was no Produce time degradation and throughput was stable:
Suggestion
The performance degradation explained in this issue can happen very easy in any Kafka deployments. Either one of following clients can cause Fetch for old data, and almost all clients connecting to the same network thread which processes that Fetch can be affected.
- Consumers delaying and processing offset far behind the latest offset.
- Consumers which starts periodically and consumes accumulated topic data at once.
- Follower broker which is attempting to restore its replica from current leaders.
This patch can fix performance degradation in this scenario. However we still have one concern. While the whole patch completes in pure Java code, its semantics heavily assumes underlying implementation in kernel. We haven't tested this in any OSes others than linux, so not sure if this effects well in other platforms as well.
Hence I think its better to add configuration which controls behavior of kafka broker wether to allow to do this or not.