Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.9.0.0
-
None
Description
I have a 3 node kafka cluster (N1,N2 and N3) with log.flush.interval.messages=1, min.insync.replicas=3 and unclean.leader.election.enable=false and a single Zookeeper node. My workload inserts few messages and on completion of the workload, the recovery-point-offset-checkpoint reflects the latest offset of the messages committed.
I have a small testing tool that drives distributed applications into corner cases by simulating possible error conditions like EIO, ENOSPC and EDQUOT that can be encountered in all modern file systems such as ext4. The tool also simulates on-disk silent data corruption.
When I introduce silent data corruption in a node (say N1) in the ISR, Kafka is able to detect corruption using checksum and ignores the log entry from that point onwards. Even though N1 has lost log entries and recovery-point-offset-checkpoint file in N1 indicates the latest offsets, N1 is allowed to become the leader because it is in the ISR. Also, the other nodes N2 and N3 crash with the following log message:
FATAL [ReplicaFetcherThread-0-1], Halting because log truncation is not allowed for topic my-topic1, Current leader 1's latest offset 0 is less than replica 3's latest offset 1 (kafka.server.ReplicaFetcherThread)
The end result is that a silent data corruption leads to data loss because querying the cluster returns only messages before the corrupted entry. Note that the cluster at this point has only N1. This situation could have been avoided if the node N1 which had to ignore the log entry wasn't allowed to become the leader. This scenario wouldn't happen in a majority based leader election as other nodes (N2 or N3) would have denied vote for N1 because N1's log is not complete compared to N2 or N3's log.
If this scenario happens in any of the followers, it ignores the log entry and copies data from the leader and there is no data loss.
Encountering an EIO thrown by the file system for a particular block results in the same consequence of data loss on querying the cluster and the remaining two nodes crash. An EIO on read could be thrown for a variety of reasons including a latent sector error of one or more sectors on disk.
Attachments
Issue Links
- relates to
-
KAFKA-4873 Investigate issues uncovered by CORDS
- Open