Hey mike, thanks for reopening this.
I actually didn't reopen yet ... because I do think this really is
paranoia. The OS man pages make the semantics clear, and what we are
doing today (reopen the file for syncing) is correct.
I like the fact that we get rid of the general unsynced files stuff in Directory.
given the last point we move it in the right place inside IW that is where it should be
Yeah I really like that.
But, we could do that separately, i.e. add private tracking inside IW
of which newly written file names haven't been sync'd.
the problem that the current patch has is that is holds on to the buffers in BufferedIndexOutput. I think we need to work around this here are a couple of ideas:
introduce a SyncHandle class that we can pull from IndexOutput that allows to close the IndexOutput but lets you fsync after the fact
I think that's a good idea. For FSDir impls this is just a thin
wrapper around FileDescriptor.
this handle can be refcounted internally and we just decrement the count on IndexOutput#close() as well as on SyncHandle#close()
we can just hold on to the SyncHandle until we need to sync in IW
Ref counting may be overkill? Who else will be pulling/sharing this
sync handle? Maybe we can add a "IndexOutput.closeToSyncHandle", the
IndexOutput flushes and is unusable from then on, but returns the sync
handle which the caller must later close.
One downside of moving to this API is ... it rules out writing some
bytes, fsyncing, writing some more, fsyncing, e.g. if we wanted to add
a transaction log impl on top of Lucene. But I think that's OK
(design for today). There are other limitations in IndexOuput for
since this will basically close the underlying FD later we might want to think about size-bounding the number of unsynced files and maybe let indexing threads fsync them concurrently? maybe something we can do later.
if we know we flush for commit we can already fsync directly which might safe resources / time since it might be concurrent
Yeah we can pursue this in "phase 2". The OS will generally move
dirty buffers to stable storage anyway over time, so the cost of
fsyncing files written (relatively) long ago (10s of seconds; on linux
I think the default is usually 30 seconds) will usually be low. The
problem is on some filesystems fsync can be unexpectedly costly (there
was a "famous" case in ext3
https://bugzilla.mozilla.org/show_bug.cgi?id=421482 but this has been
fixed), so we need to be careful about this.