Lucene - Core
  1. Lucene - Core
  2. LUCENE-3237

FSDirectory.fsync() may not work properly

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/store
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Spinoff from LUCENE-3230. FSDirectory.fsync() opens a new RAF, sync() its FileDescriptor and closes RAF. It is not clear that this syncs whatever was written to the file by other FileDescriptors. It would be better if we do this operation on the actual RAF/FileOS which wrote the data. We can add sync() to IndexOutput and FSIndexOutput will do that.

      Directory-wise, we should stop syncing on file names, and instead sync on the IOs that performed the write operations.

      1. LUCENE-3237.patch
        30 kB
        Michael McCandless

        Activity

        Hide
        Simon Willnauer added a comment -

        Ref counting may be overkill? Who else will be pulling/sharing this
        sync handle? Maybe we can add a "IndexOutput.closeToSyncHandle", the
        IndexOutput flushes and is unusable from then on, but returns the sync
        handle which the caller must later close.

        good!

        One downside of moving to this API is ... it rules out writing some
        bytes, fsyncing, writing some more, fsyncing, e.g. if we wanted to add
        a transaction log impl on top of Lucene. But I think that's OK
        (design for today). There are other limitations in IndexOuput for
        xlog impl...

        I don't see what keeps us from adding a sync method to IndexOutput that allows us to bytes, fsyncing, writing some more, fsyncing. I think we should make this change nevertheless. This can go in today I independent from where we use it.

        Yeah we can pursue this in "phase 2".

        agreed

        Show
        Simon Willnauer added a comment - Ref counting may be overkill? Who else will be pulling/sharing this sync handle? Maybe we can add a "IndexOutput.closeToSyncHandle", the IndexOutput flushes and is unusable from then on, but returns the sync handle which the caller must later close. good! One downside of moving to this API is ... it rules out writing some bytes, fsyncing, writing some more, fsyncing, e.g. if we wanted to add a transaction log impl on top of Lucene. But I think that's OK (design for today). There are other limitations in IndexOuput for xlog impl... I don't see what keeps us from adding a sync method to IndexOutput that allows us to bytes, fsyncing, writing some more, fsyncing. I think we should make this change nevertheless. This can go in today I independent from where we use it. Yeah we can pursue this in "phase 2". agreed
        Hide
        Michael McCandless added a comment -

        Thanks Simon.

        Hey mike, thanks for reopening this.

        I actually didn't reopen yet ... because I do think this really is
        paranoia. The OS man pages make the semantics clear, and what we are
        doing today (reopen the file for syncing) is correct.

        I like the fact that we get rid of the general unsynced files stuff in Directory.

        given the last point we move it in the right place inside IW that is where it should be

        Yeah I really like that.

        But, we could do that separately, i.e. add private tracking inside IW
        of which newly written file names haven't been sync'd.

        the problem that the current patch has is that is holds on to the buffers in BufferedIndexOutput. I think we need to work around this here are a couple of ideas:

        introduce a SyncHandle class that we can pull from IndexOutput that allows to close the IndexOutput but lets you fsync after the fact

        I think that's a good idea. For FSDir impls this is just a thin
        wrapper around FileDescriptor.

        this handle can be refcounted internally and we just decrement the count on IndexOutput#close() as well as on SyncHandle#close()

        we can just hold on to the SyncHandle until we need to sync in IW

        Ref counting may be overkill? Who else will be pulling/sharing this
        sync handle? Maybe we can add a "IndexOutput.closeToSyncHandle", the
        IndexOutput flushes and is unusable from then on, but returns the sync
        handle which the caller must later close.

        One downside of moving to this API is ... it rules out writing some
        bytes, fsyncing, writing some more, fsyncing, e.g. if we wanted to add
        a transaction log impl on top of Lucene. But I think that's OK
        (design for today). There are other limitations in IndexOuput for
        xlog impl...

        since this will basically close the underlying FD later we might want to think about size-bounding the number of unsynced files and maybe let indexing threads fsync them concurrently? maybe something we can do later.

        if we know we flush for commit we can already fsync directly which might safe resources / time since it might be concurrent

        Yeah we can pursue this in "phase 2". The OS will generally move
        dirty buffers to stable storage anyway over time, so the cost of
        fsyncing files written (relatively) long ago (10s of seconds; on linux
        I think the default is usually 30 seconds) will usually be low. The
        problem is on some filesystems fsync can be unexpectedly costly (there
        was a "famous" case in ext3
        https://bugzilla.mozilla.org/show_bug.cgi?id=421482 but this has been
        fixed), so we need to be careful about this.

        Show
        Michael McCandless added a comment - Thanks Simon. Hey mike, thanks for reopening this. I actually didn't reopen yet ... because I do think this really is paranoia. The OS man pages make the semantics clear, and what we are doing today (reopen the file for syncing) is correct. I like the fact that we get rid of the general unsynced files stuff in Directory. given the last point we move it in the right place inside IW that is where it should be Yeah I really like that. But, we could do that separately, i.e. add private tracking inside IW of which newly written file names haven't been sync'd. the problem that the current patch has is that is holds on to the buffers in BufferedIndexOutput. I think we need to work around this here are a couple of ideas: introduce a SyncHandle class that we can pull from IndexOutput that allows to close the IndexOutput but lets you fsync after the fact I think that's a good idea. For FSDir impls this is just a thin wrapper around FileDescriptor. this handle can be refcounted internally and we just decrement the count on IndexOutput#close() as well as on SyncHandle#close() we can just hold on to the SyncHandle until we need to sync in IW Ref counting may be overkill? Who else will be pulling/sharing this sync handle? Maybe we can add a "IndexOutput.closeToSyncHandle", the IndexOutput flushes and is unusable from then on, but returns the sync handle which the caller must later close. One downside of moving to this API is ... it rules out writing some bytes, fsyncing, writing some more, fsyncing, e.g. if we wanted to add a transaction log impl on top of Lucene. But I think that's OK (design for today). There are other limitations in IndexOuput for xlog impl... since this will basically close the underlying FD later we might want to think about size-bounding the number of unsynced files and maybe let indexing threads fsync them concurrently? maybe something we can do later. if we know we flush for commit we can already fsync directly which might safe resources / time since it might be concurrent Yeah we can pursue this in "phase 2". The OS will generally move dirty buffers to stable storage anyway over time, so the cost of fsyncing files written (relatively) long ago (10s of seconds; on linux I think the default is usually 30 seconds) will usually be low. The problem is on some filesystems fsync can be unexpectedly costly (there was a "famous" case in ext3 https://bugzilla.mozilla.org/show_bug.cgi?id=421482 but this has been fixed), so we need to be careful about this.
        Hide
        Simon Willnauer added a comment -

        Hey mike, thanks for reopening this. I like the patch since it fixes multiple issues.

        • I like the fact that we get rid of the general unsynced files stuff in Directory.
        • given the last point we move it in the right place inside IW that is where it should be
        • the problem that the current patch has is that is holds on to the buffers in BufferedIndexOutput. I think we need to work around this here are a couple of ideas:
          • introduce a SyncHandle class that we can pull from IndexOutput that allows to close the IndexOutput but lets you fsync after the fact
          • this handle can be refcounted internally and we just decrement the count on IndexOutput#close() as well as on SyncHandle#close()
          • we can just hold on to the SyncHandle until we need to sync in IW
          • since this will basically close the underlying FD later we might want to think about size-bounding the number of unsynced files and maybe let indexing threads fsync them concurrently? maybe something we can do later.
          • if we know we flush for commit we can already fsync directly which might safe resources / time since it might be concurrent

        just a couple of ideas....

        Show
        Simon Willnauer added a comment - Hey mike, thanks for reopening this. I like the patch since it fixes multiple issues. I like the fact that we get rid of the general unsynced files stuff in Directory. given the last point we move it in the right place inside IW that is where it should be the problem that the current patch has is that is holds on to the buffers in BufferedIndexOutput. I think we need to work around this here are a couple of ideas: introduce a SyncHandle class that we can pull from IndexOutput that allows to close the IndexOutput but lets you fsync after the fact this handle can be refcounted internally and we just decrement the count on IndexOutput#close() as well as on SyncHandle#close() we can just hold on to the SyncHandle until we need to sync in IW since this will basically close the underlying FD later we might want to think about size-bounding the number of unsynced files and maybe let indexing threads fsync them concurrently? maybe something we can do later. if we know we flush for commit we can already fsync directly which might safe resources / time since it might be concurrent just a couple of ideas....
        Hide
        Michael McCandless added a comment -

        In fact, fsync syncs the whole file, because it relies on fsync() POSIX API or FlushFileBuffers() in Windows. Both really sync the file the descriptor is pointing to. Those functions don't sync the descriptor's buffers only.

        This is my impression as well, and as Yonik said, it's hard to imagine any [sane] operating system doing it differently ... so this really is paranoia.

        FSDirectory.FSIndexOutput#sync() should call flush() before syncing the underlying file.

        OK I'll move it there (I'm currently doing it in the first close "attempt").

        This does not do the for-loop we currently do to repeat the fsync 5 times if it fails.

        I'll add an IOUtils.sync that takes an fd and does the retry thing.

        Also, I would not remove Directory.sync(), we should maybe leave this for LUCENE-5588 to sync the directory itsself.

        Right, we should add it back, as a method taking no file args? Its purpose would be LUCENE-5588.

        Show
        Michael McCandless added a comment - In fact, fsync syncs the whole file, because it relies on fsync() POSIX API or FlushFileBuffers() in Windows. Both really sync the file the descriptor is pointing to. Those functions don't sync the descriptor's buffers only. This is my impression as well, and as Yonik said, it's hard to imagine any [sane] operating system doing it differently ... so this really is paranoia. FSDirectory.FSIndexOutput#sync() should call flush() before syncing the underlying file. OK I'll move it there (I'm currently doing it in the first close "attempt"). This does not do the for-loop we currently do to repeat the fsync 5 times if it fails. I'll add an IOUtils.sync that takes an fd and does the retry thing. Also, I would not remove Directory.sync(), we should maybe leave this for LUCENE-5588 to sync the directory itsself. Right, we should add it back, as a method taking no file args? Its purpose would be LUCENE-5588 .
        Hide
        Uwe Schindler added a comment -

        Hi another issue:

        In FSIndexOutput:

        file.getFD().sync();
        

        This does not do the for-loop we currently do to repeat the fsync 5 times if it fails. We should maybe add this here, too. Also, I would not remove Directory.sync(), we should maybe leave this for LUCENE-5588 to sync the directory itsself. But the method signature would change in any case.

        Show
        Uwe Schindler added a comment - Hi another issue: In FSIndexOutput: file.getFD().sync(); This does not do the for-loop we currently do to repeat the fsync 5 times if it fails. We should maybe add this here, too. Also, I would not remove Directory.sync(), we should maybe leave this for LUCENE-5588 to sync the directory itsself. But the method signature would change in any case.
        Hide
        Uwe Schindler added a comment -

        Hi Michael McCandless,
        FSDirectory.FSIndexOutput#sync() should call flush() before syncing the underlying file. Otherwise it could happen that not all buffers of BufferedIndexOutput are flushed. Maybe other code does this already, but to be safe, it should definitely call this!

        Show
        Uwe Schindler added a comment - Hi Michael McCandless , FSDirectory.FSIndexOutput#sync() should call flush() before syncing the underlying file. Otherwise it could happen that not all buffers of BufferedIndexOutput are flushed. Maybe other code does this already, but to be safe, it should definitely call this!
        Hide
        Uwe Schindler added a comment -

        Hi, I checked in the other issue already source code of OpenJDK.
        https://issues.apache.org/jira/browse/LUCENE-5570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964623#comment-13964623

        In fact, fsync syncs the whole file, because it relies on fsync() POSIX API or FlushFileBuffers() in Windows. Both really sync the file the descriptor is pointing to. Those functions don't sync the descriptor's buffers only.

        Show
        Uwe Schindler added a comment - Hi, I checked in the other issue already source code of OpenJDK. https://issues.apache.org/jira/browse/LUCENE-5570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964623#comment-13964623 In fact, fsync syncs the whole file, because it relies on fsync() POSIX API or FlushFileBuffers() in Windows. Both really sync the file the descriptor is pointing to. Those functions don't sync the descriptor's buffers only.
        Hide
        Michael McCandless added a comment -

        While it's hard to imagine an OS that would (could?) separately track
        which (of possibly many) file handles had made changes to the dirty
        buffers in the OS's IO cache for a given file, such that closing the
        open (for write) file, opening it again later, fsync'ing that, would
        actually fail to move some bytes to durable storage, it has always
        made me a bit nervous that Lucene closes a file and later reopens it
        to fsync. This may simply be paranoia

        Still, I tried a prototype patch:

        • Removed Directory.sync(Collection<String>) and added
          IndexOutput.sync. Ie, you now must sync before closing the
          IndexOutput.
        • Fixed IndexWriter to prevent .close() of all IndexOutputs (it
          wraps the incoming Directory to do this), until 1) the output is
          sync'd, or 2) the file is deleted.
        • During commit, IndexWriter retrieves the still-held-open
          IndexOutputs and asks them to sync.

        This also has the nice side effect of removing the sneaky stale file
        tracking that FSDirectory was doing (we had wanted to remove this in
        LUCENE-5570).

        One downside of this approach is it means IndexWriter holds file
        descriptors open, one per written but not sync'd/deleted file, so it
        sort of acts like an IndexReader in this regard. Though, the count
        will be lower, since it's only newly written files since open/last
        commit.

        Another big problem with this approach is: if you open near-real-time
        readers, this now means we are opening IndexInputs when there are
        still IndexOutputs open against those files, which will probably make
        some Directory impls unhappy. (I had to turn off this check in MDW).

        The patch is not well tested but I did get it to the point where
        TestDemo passes...

        There are serious nocommits too, e.g. NRTCachingDir is now broken, and
        MDW now longer crashes on close.

        Show
        Michael McCandless added a comment - While it's hard to imagine an OS that would (could?) separately track which (of possibly many) file handles had made changes to the dirty buffers in the OS's IO cache for a given file, such that closing the open (for write) file, opening it again later, fsync'ing that, would actually fail to move some bytes to durable storage, it has always made me a bit nervous that Lucene closes a file and later reopens it to fsync. This may simply be paranoia Still, I tried a prototype patch: Removed Directory.sync(Collection<String>) and added IndexOutput.sync. Ie, you now must sync before closing the IndexOutput. Fixed IndexWriter to prevent .close() of all IndexOutputs (it wraps the incoming Directory to do this), until 1) the output is sync'd, or 2) the file is deleted. During commit, IndexWriter retrieves the still-held-open IndexOutputs and asks them to sync. This also has the nice side effect of removing the sneaky stale file tracking that FSDirectory was doing (we had wanted to remove this in LUCENE-5570 ). One downside of this approach is it means IndexWriter holds file descriptors open, one per written but not sync'd/deleted file, so it sort of acts like an IndexReader in this regard. Though, the count will be lower, since it's only newly written files since open/last commit. Another big problem with this approach is: if you open near-real-time readers, this now means we are opening IndexInputs when there are still IndexOutputs open against those files, which will probably make some Directory impls unhappy. (I had to turn off this check in MDW). The patch is not well tested but I did get it to the point where TestDemo passes... There are serious nocommits too, e.g. NRTCachingDir is now broken, and MDW now longer crashes on close.
        Hide
        Shai Erera added a comment -

        Closing. If we ever see that this actually is a problem, we can reopen.

        Show
        Shai Erera added a comment - Closing. If we ever see that this actually is a problem, we can reopen.
        Hide
        Simon Willnauer added a comment -

        Shai, I think we should close this. we can still reopen if we run into issues?

        Show
        Simon Willnauer added a comment - Shai, I think we should close this. we can still reopen if we run into issues?
        Hide
        Shai Erera added a comment -

        It could be, and generally I think that it makes sense. Only the Java documentation is not so explicit about it - I wish it was. Because if Java was explicit, it would mean we wouldn't need to wonder what do the "man pages" say about this on Windows, AIX etc.

        If others think that it makes sense that sync() affects all existing buffers, then I don't mind if we close that issue. We haven't seen any problems with this implementation so far, so it kinda reassuring ...

        Show
        Shai Erera added a comment - It could be, and generally I think that it makes sense. Only the Java documentation is not so explicit about it - I wish it was. Because if Java was explicit, it would mean we wouldn't need to wonder what do the "man pages" say about this on Windows, AIX etc. If others think that it makes sense that sync() affects all existing buffers, then I don't mind if we close that issue. We haven't seen any problems with this implementation so far, so it kinda reassuring ...
        Hide
        Yonik Seeley added a comment -

        The man pages on fsync (which I assume FileDescriptor.sync() uses seem to say that it affects all in-core modified copies of buffers (as opposed to just modifications done through that specific descriptor). If you think about it, it's really the only sane behavior.

        Show
        Yonik Seeley added a comment - The man pages on fsync (which I assume FileDescriptor.sync() uses seem to say that it affects all in-core modified copies of buffers (as opposed to just modifications done through that specific descriptor). If you think about it, it's really the only sane behavior.

          People

          • Assignee:
            Unassigned
            Reporter:
            Shai Erera
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development