[KAFKA-1670] Corrupt log files for segment.bytes values close to Int.MaxInt - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 0.8.1.1
Fix Version/s: 0.8.2.0
Component/s: None
Labels:
None

Description

The maximum value for the topic-level config segment.bytes is Int.MaxInt (2147483647). Using this value causes brokers to corrupt their log files, leaving them unreadable.

We set segment.bytes to 2122317824 which is well below the maximum. One by one, the ISR of all partitions shrunk to 1. Brokers would crash when restarted, attempting to read from a negative offset in a log file. After discovering that many segment files had grown to 4GB or more, we were forced to shut down our entire production Kafka cluster for several hours while we split all segment files into 1GB chunks.

Looking into the kafka.log code, the segment.bytes parameter is used inconsistently. It is treated as a soft maximum for the size of the segment file (https://github.com/apache/kafka/blob/0.8.1.1/core/src/main/scala/kafka/log/LogConfig.scala#L26) with logs rolled only after (https://github.com/apache/kafka/blob/0.8.1.1/core/src/main/scala/kafka/log/Log.scala#L246) they exceed this value. However, much of the code that deals with log files uses ints to store the size of the file and the position in the file. Overflow of these ints leads the broker to append to the segments indefinitely, and to fail to read these segments for consuming or recovery.

This is trivial to reproduce:

$ bin/kafka-topics.sh --topic segment-bytes-test --create --replication-factor 2 --partitions 1 --zookeeper zkhost:2181
$ bin/kafka-topics.sh --topic segment-bytes-test --alter --config segment.bytes=2147483647 --zookeeper zkhost:2181
$ yes "Int.MaxValue is a ridiculous bound on file size in 2014" | bin/kafka-console-producer.sh --broker-list localhost:6667 zkhost:2181 --topic segment-bytes-test

After running for a few minutes, the log file is corrupt:

$ ls -lh data/segment-bytes-test-0/
total 9.7G
-rw-r--r-- 1 root root  10M Oct  3 19:39 00000000000000000000.index
-rw-r--r-- 1 root root 9.7G Oct  3 19:39 00000000000000000000.log

We recovered the data from the log files using a simple Python script: https://gist.github.com/also/9f823d9eb9dc0a410796

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

KAFKA-1670.patch
04/Oct/14 22:28
2 kB
Harsha
KAFKA-1670.patch
11/Oct/14 19:20
80 kB
Harsha
KAFKA-1670_2014-10-07_18:39:31.patch
08/Oct/14 01:39
22 kB
Harsha
KAFKA-1670_2014-10-07_13:49:10.patch
07/Oct/14 20:49
17 kB
Harsha
KAFKA-1670_2014-10-07_13:39:13.patch
07/Oct/14 20:39
16 kB
Harsha
KAFKA-1670_2014-10-06_09:48:25.patch
06/Oct/14 16:48
3 kB
Harsha
KAFKA-1670_2014-10-04_20:17:46.patch
05/Oct/14 03:17
4 kB
Harsha

Activity

People

Assignee:: Harsha

Reporter:: Ryan Berdeen

Reviewer:: Jun Rao

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 03/Oct/14 21:06

Updated:: 13/Oct/14 16:37

Resolved:: 12/Oct/14 15:54