Attached a draft patch for a first version of this for early feedback. A few details remain to work out.
This patch removes the per-data-directory .kafka_cleanshutdown file as well as the concept of a "clean shutdown". The concept of clean shutdown is replaced with the concept of "recovery point". The recovery point is the offset from which the log must be recovered. Recovery points are checkpointed in a per-data-directory file called recovery-point-offset-checkpoint. This uses normal offset checkpoint file format.
Previously we always recovered the last log segment unless a clean shutdown was recorded. Now we recover from the recovery point--which may mean recovering many segments. We do not, however, recover partial segments: if the recovery point falls in the middle of a segment we recover that segment from the beginning.
On shutdown we force a flush and checkpoint which has the same effect as the cleanshutdown file did before.
Deleting the recovery-point-offset-checkpoint file will cause running full recovery on your log on restart which is kind of a nice feature if you suspect any kind of corruption in the log.
Log.flush now takes an offset argument and flushes from the recovery point up to the given offset. This allows more granular control to avoid syncing (and hence locking) the active segment.
Log.roll() now uses the scheduler to make its flush asynchronous. This flush now only covers up to the segment that is just completed, not the newly created segment, so there should be no locking of the active segment any more.
The per-topic flush policy based on # messages and time still remains but now it defaults to off so we rely only on
I did some preliminary performance testing and we can indeed run with no application-level flush policy with reasonable latency which is both convenient (no tuning to do) and yields much better throughput. I will do more testing and report results.