Mike: I'd hate to add yet another file just for this purpose. Long-term it's perhaps worth it. Short-term for HDFS use case it would be enough to provide a method to write a header and a trailer. Codecs that can seek/overwrite would just use the header, codecs that can't would use both.
I think that's a good plan – abstract the header write/read methods so that another codec can easily subclass to change how/where these are written. I think Lucene's default (standard) codec should continue to do what it does now? And then HDFS can take the standard codec, and subclass StandardTermsDictWriter/Reader to put the header at the end.
Codecs that operate on filesystems with unreliable fileLength could write a sync marker before the trailer, and there could be a back-tracking mechanism that starts from the reported fileLength and then tries to find the sync marker (reading back, and/or ahead).
Can't we just use the current standard codec's approach by default? Back-tracking seems dangerous. Eg what if .fileLength() is too small on such filesystems?
Does this make it possible to add a good checksum?
A codec could easily do this, today – it's orthogonal to using HDFS. EG Lucene already has a ChecksumIndexOutput/Input, so this should be a simple cutover in standard codec (though we would need to fix up the classes, eg to make "get me the IndexOutput/Input" method, so a subclass could override).