Affects Version/s: None
Fix Version/s: 1.0.0
The current Kafka implementation will cause slow startup after unclean shutdown. The time to load a partition will be 10X or more than what it actually needs. Here is the explanation with example:
- Say we have a partition of 20 segments, each segment has 250 message starting with offset 0. And each message has 1 MB bytes.
- Broker experiences hard kill and the index file of the first segment is corrupted.
- When broker startup and load the first segment, it realizes that the index of the first segment is corrupted. So it calls `log.recoverSegment(...)` to recover this segment. This method will call `stateManager.truncateAndReload(...)` which deletes the snapshot files whose offset is larger than base offset of the first segment. Thus all snapshot files are deleted.
- To rebuild the snapshot files, the `log.loadSegmentFiles(...)` will have to read every message in this partition even if their log and index files are not corrupted. This will increase the time to load this partition by more than an order of magnitude.
In order to address this issue, one simple solution is not to delete snapshot files that are than the given offset if only the index files needs re-build. More specifically, we should not need to re-build producer state offset file unless the log file itself is corrupted or truncated.