Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8048

Filesystems do not guarantee order of directories updates

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently when index is written to disk the following sequence of events is taking place:

      • write segment file
      • sync segment file
      • write segment file
      • sync segment file
        ...
      • write list of segments
      • sync list of segments
      • rename list of segments
      • sync index directory

      This sequence leads to potential window of opportunity for system to crash after 'rename list of segments' but before 'sync index directory' and depending on exact filesystem implementation this may potentially lead to 'list of segments' being visible in directory while some of the segments are not.

      Solution to this is to sync index directory after all segments have been written. This commit shows idea implemented. I'm fairly certain that I didn't find all the places this may be potentially happening.

      1. LUCENE-8048.patch
        0.6 kB
        Erick Erickson
      2. Screen Shot 2017-11-22 at 12.34.51 PM.png
        86 kB
        Erick Erickson

        Activity

        Hide
        erickerickson Erick Erickson added a comment -

        But this is when I have spare time ...

        Point taken, it can wait until next week. I'm a bit worried that the test case that now fails indicate something wouldn't be right anyway, it'd be good to give people a chance to look if I can't find anything...

        Show
        erickerickson Erick Erickson added a comment - But this is when I have spare time ... Point taken, it can wait until next week. I'm a bit worried that the test case that now fails indicate something wouldn't be right anyway, it'd be good to give people a chance to look if I can't find anything...
        Hide
        rcmuir Robert Muir added a comment -

        Erick, can you hold off a bit. I don't think its a good idea to rush in changes to commit over a holiday, these changes really need review.

        Show
        rcmuir Robert Muir added a comment - Erick, can you hold off a bit. I don't think its a good idea to rush in changes to commit over a holiday, these changes really need review.
        Hide
        erickerickson Erick Erickson added a comment -

        This change causes TestIndexWriter.testExceptionsDuringCommit() to fail.

        The first two conditions fail:
        new FailOnlyInCommit(false, FailOnlyInCommit.PREPARE_STAGE), // fail during global field map is written
        new FailOnlyInCommit(true, FailOnlyInCommit.PREPARE_STAGE), // fail after global field map is written
        new FailOnlyInCommit(false, FailOnlyInCommit.FINISH_STAGE) // fail while running finishCommit

        Are these two conditions something that should fail with this change? I've attached a screen shot of the exception in question. With this change, this exception is not thrown.

        Show
        erickerickson Erick Erickson added a comment - This change causes TestIndexWriter.testExceptionsDuringCommit() to fail. The first two conditions fail: new FailOnlyInCommit(false, FailOnlyInCommit.PREPARE_STAGE), // fail during global field map is written new FailOnlyInCommit(true, FailOnlyInCommit.PREPARE_STAGE), // fail after global field map is written new FailOnlyInCommit(false, FailOnlyInCommit.FINISH_STAGE) // fail while running finishCommit Are these two conditions something that should fail with this change? I've attached a screen shot of the exception in question. With this change, this exception is not thrown.
        Hide
        erickerickson Erick Erickson added a comment -

        So it seems like the consensus is to go ahead and make this change. one-line patch attached, I'll run precommit and test on it but don't expect any issues.

        Any objections to checking this in? If not I'll take care of it over the next day or two.

        Show
        erickerickson Erick Erickson added a comment - So it seems like the consensus is to go ahead and make this change. one-line patch attached, I'll run precommit and test on it but don't expect any issues. Any objections to checking this in? If not I'll take care of it over the next day or two.
        Hide
        mikemccand Michael McCandless added a comment -

        +1 to fsync directory metadata after writing segments and before writing segments file.

        Show
        mikemccand Michael McCandless added a comment - +1 to fsync directory metadata after writing segments and before writing segments file.
        Hide
        rcmuir Robert Muir added a comment -

        I think any filesystem that behaves in this way on crash is choosing to corrupt users data by reflecting some later directory entries but not former ones.

        But linux filesystems are straight up buggy as shit around this functionality so it wouldnt surprise me. So like i said, i like the extra sync before rename to try to coerce better behavior on crash during fsync(dir).

        Show
        rcmuir Robert Muir added a comment - I think any filesystem that behaves in this way on crash is choosing to corrupt users data by reflecting some later directory entries but not former ones. But linux filesystems are straight up buggy as shit around this functionality so it wouldnt surprise me. So like i said, i like the extra sync before rename to try to coerce better behavior on crash during fsync(dir).
        Hide
        mar-kolya Nikolay Martynov added a comment - - edited

        Just to clarify:

        • On Linux directory fsync does work, but it doesn't solve the problem currently because this fsync currently happens after segment list has been created. Essentially guarantees about fsync (weak as they are) are in case of failure: before fsync you can see changes in any combination, after fsync you are guaranteed to see exactly what was written.
        • This can potentially affect any FS that uses non trivial storage for directories (which is pretty much everything these days). Word on the internet is that btrfs is capable of doing out of order directory writes.
        • 'kernel automatically detects rename pattern' - I think this only works on some FSs (ext4) and only if certain mount options are present (auto_da_alloc). And I think generally this is about syncing file data with directory, not syncing directory as a whole on rename.
        Show
        mar-kolya Nikolay Martynov added a comment - - edited Just to clarify: On Linux directory fsync does work, but it doesn't solve the problem currently because this fsync currently happens after segment list has been created. Essentially guarantees about fsync (weak as they are) are in case of failure: before fsync you can see changes in any combination, after fsync you are guaranteed to see exactly what was written. This can potentially affect any FS that uses non trivial storage for directories (which is pretty much everything these days). Word on the internet is that btrfs is capable of doing out of order directory writes. 'kernel automatically detects rename pattern' - I think this only works on some FSs (ext4) and only if certain mount options are present (auto_da_alloc). And I think generally this is about syncing file data with directory, not syncing directory as a whole on rename.
        Hide
        thetaphi Uwe Schindler added a comment - - edited

        The guarantees about fsync are weak anyways (as Robert said), and on top - fsyncing directory metadata is a hack, the Java API does not allow to do it via API, you need a hack with a file handle - but it works in our testing (Michael McCandless had/has a test computer with a remote powerswitch to stress all this for weeks). The directory sync is at least documented in Linux Man Pages, for other UNIX-ial operating systems it not defined (lack of POSIX standard for it).

        In short:

        • On Linux, the fsync on the directory really works, but we only know about usual file systems (ext4 and xfs I think)
        • In addition because the atomic rename use case is very common in Unix world to commit stuff, the kernel already does the right thing. If it sees an atomic rename, it automatically fsyncs directory under certain conditions (read source code). Robert Muir, is this right? - it's long ago when I last looked at that code!
        • On MacOSX/Solaris the same applies like for linux, although it does not have the automatism in kernel. And we don't know if fsyncing directory is really done for all file systems. The Man page does not say anything and POSIX does not define it.
        • On Windows, the fsync on directory does not work at all (it is a no-op in Lucene -> we have a try-catch around it with an assertion on Windows in the exception block). But Windows file systems guarantee that after the atomic rename the directory is in an consistent state (it's documented). Happens-before also works.
        Show
        thetaphi Uwe Schindler added a comment - - edited The guarantees about fsync are weak anyways (as Robert said), and on top - fsyncing directory metadata is a hack, the Java API does not allow to do it via API, you need a hack with a file handle - but it works in our testing ( Michael McCandless had/has a test computer with a remote powerswitch to stress all this for weeks). The directory sync is at least documented in Linux Man Pages, for other UNIX-ial operating systems it not defined (lack of POSIX standard for it). In short: On Linux, the fsync on the directory really works, but we only know about usual file systems (ext4 and xfs I think) In addition because the atomic rename use case is very common in Unix world to commit stuff, the kernel already does the right thing. If it sees an atomic rename, it automatically fsyncs directory under certain conditions (read source code). Robert Muir , is this right? - it's long ago when I last looked at that code! On MacOSX/Solaris the same applies like for linux, although it does not have the automatism in kernel. And we don't know if fsyncing directory is really done for all file systems. The Man page does not say anything and POSIX does not define it. On Windows, the fsync on directory does not work at all (it is a no-op in Lucene -> we have a try-catch around it with an assertion on Windows in the exception block). But Windows file systems guarantee that after the atomic rename the directory is in an consistent state (it's documented). Happens-before also works.
        Hide
        rcmuir Robert Muir added a comment -

        Hi Nikolay Martynov, I like the idea of the patch. Actually not much about fsync is guaranteed on any OS, anywhere.So i agree with having an extra dir sync before the rename, just to be safe, although i don't know what filesystems might be affected by this.

        Show
        rcmuir Robert Muir added a comment - Hi Nikolay Martynov , I like the idea of the patch. Actually not much about fsync is guaranteed on any OS, anywhere.So i agree with having an extra dir sync before the rename, just to be safe, although i don't know what filesystems might be affected by this.
        Hide
        erickerickson Erick Erickson added a comment -

        Not sure how related these two are but they're at least in the same vicinity.

        Show
        erickerickson Erick Erickson added a comment - Not sure how related these two are but they're at least in the same vicinity.

          People

          • Assignee:
            Unassigned
            Reporter:
            mar-kolya Nikolay Martynov
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development