i have a strange problem with our uat cluster of 3 kafka brokers.
the __consumer_offsets topic was replicated to two instances and our disks ran full due to a wrong configuration of the log cleaner. We fixed the configuration and updated from 0.10.1.1 to 0.10.2.1 .
Today i increased the replication of the __consumer_offsets topic to 3 and triggered replication to the third cluster via kafka-reassign-partitions.sh.
That went well but i get many errors like
Which i think are due to the full disk event.
The log cleaner threads died on these wrong messages:
Looking at the file is see that some are truncated and some are jsut empty:
$ ls -lsh 00000000000000594653.log
rw-r r- 1 user user 100M Jun 12 11:00 00000000000000594653.log
Sadly i do not have the logs any more from the disk full event itsself.
I have three questions:
- What is the best way to clean this up? Deleting the old log files and restarting the brokers?
- Why did kafka not handle the disk full event well? Is this only affecting the cleanup or may we also loose data?
- Is this maybe caused by the combination of upgrade and disk full?
And last but not least: Keep up the good work. Kafka is really performing well while being easy to administer and has good documentation!