Lucene - Core
  1. Lucene - Core
  2. LUCENE-4870

Lucene deletes entire index if and exception is thrown due do TooManyOpenFiles and OpenMode.CREATE_OR_APPEND

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 4.0, 4.1, 4.2, 3.6.2
    • Fix Version/s: 4.3, 4.2.1, 5.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      The Lucene IndexWriter might delete an entire index if it hits a FileNotFoundException triggered by TooManyOpenFiles during IndexWriter creation. We try to figure out if the index exists already if the OpenMode.CREATE_OR_APPEND is set (which is default). Yet, the logic in DirectoryReader#indexExists(Directory) will just return false if we are not able to open the segment file. This will cause the IW to assume there is no index and it will try to create a new index there trashing all existing commit points treating this as a OpenMode.CREATE.

      1. LUCENE-4870.patch
        4 kB
        Simon Willnauer
      2. LUCENE-4870.patch
        5 kB
        Simon Willnauer
      3. LUCENE-4870.patch
        5 kB
        Simon Willnauer

        Activity

        Hide
        Simon Willnauer added a comment -

        here is a testcase showing the issue

        Show
        Simon Willnauer added a comment - here is a testcase showing the issue
        Hide
        Michael McCandless added a comment -

        Not good!

        Is there no reliable way to differentiate "file does not exist" from "I ran out of file handles" from the exception we get from trying to open RAF/FileChannel.open?

        Is File.exists() even trustworthy once you've run out of descriptors?

        Maybe ... we should fix SegmentInfos.read, where it calls dir.openInput, to throw a different exception (OpenSegmentsFailedIOException) if the open failed vs if a subsequent op (reading bytes from segments file, or opening the SegmentReaders) failed. This way we could catch this exception above and know that a segments file in fact exists yet we were unable to open it (and return "true" for indexExists in this case).

        Show
        Michael McCandless added a comment - Not good! Is there no reliable way to differentiate "file does not exist" from "I ran out of file handles" from the exception we get from trying to open RAF/FileChannel.open? Is File.exists() even trustworthy once you've run out of descriptors? Maybe ... we should fix SegmentInfos.read, where it calls dir.openInput, to throw a different exception (OpenSegmentsFailedIOException) if the open failed vs if a subsequent op (reading bytes from segments file, or opening the SegmentReaders) failed. This way we could catch this exception above and know that a segments file in fact exists yet we were unable to open it (and return "true" for indexExists in this case).
        Hide
        Simon Willnauer added a comment -

        this affects 3.6 too

        Show
        Simon Willnauer added a comment - this affects 3.6 too
        Hide
        Simon Willnauer added a comment -

        mike I agree we need to fix this. I think one way of looking into this is the fact that we already called the doBody method in SegmentInfos.read is an indicator that the file exists. If that read method still throws a FileNotFoundException its almost certain that we need to do something here. I think we should never go and just return true here we should rather fire up a true error here like a corrupt index exception or something like this?

        Show
        Simon Willnauer added a comment - mike I agree we need to fix this. I think one way of looking into this is the fact that we already called the doBody method in SegmentInfos.read is an indicator that the file exists. If that read method still throws a FileNotFoundException its almost certain that we need to do something here. I think we should never go and just return true here we should rather fire up a true error here like a corrupt index exception or something like this?
        Hide
        Simon Willnauer added a comment -

        in fact, I think mikes suggestion is not all that bad. I think that is a fair game here to return true if we actually can detect a false FileNotFound exception. Here is a patch to fix this issue.

        Show
        Simon Willnauer added a comment - in fact, I think mikes suggestion is not all that bad. I think that is a fair game here to return true if we actually can detect a false FileNotFound exception. Here is a patch to fix this issue.
        Hide
        Michael McCandless added a comment -

        +1, thanks Simon!

        Can you reference this Jira in the comment, and eg say "we may have hit false FNFE due to too many open files".

        Maybe drop the 1000 iters lower in the test? The test is a bit slowish ...

        Show
        Michael McCandless added a comment - +1, thanks Simon! Can you reference this Jira in the comment, and eg say "we may have hit false FNFE due to too many open files". Maybe drop the 1000 iters lower in the test? The test is a bit slowish ...
        Hide
        Simon Willnauer added a comment -

        I move the iterations to 10 (this really failed super quickly anyways) and referenced the issue in both the test and the fix. I think this is ready and I will commit shortly. I will also port this to 4.3 4.2.1 and 3.6 since this might be worth a 3.6.3 release too, thoughts?

        Show
        Simon Willnauer added a comment - I move the iterations to 10 (this really failed super quickly anyways) and referenced the issue in both the test and the fix. I think this is ready and I will commit shortly. I will also port this to 4.3 4.2.1 and 3.6 since this might be worth a 3.6.3 release too, thoughts?
        Hide
        Uwe Schindler added a comment -

        I can do the 3.6.3 release if it is very urgent - takes me a few minutes to set it up to build with Java 1.5 on my server (maybe a ES customer on 3.6 Lucene?).

        Show
        Uwe Schindler added a comment - I can do the 3.6.3 release if it is very urgent - takes me a few minutes to set it up to build with Java 1.5 on my server (maybe a ES customer on 3.6 Lucene?).
        Hide
        Simon Willnauer added a comment -

        I think we can resolve this. I already committed a fix for this to ES though. Unless anybody really needs this on 3.6 I am ok with just fixing on 4.2.1.

        Show
        Simon Willnauer added a comment - I think we can resolve this. I already committed a fix for this to ES though. Unless anybody really needs this on 3.6 I am ok with just fixing on 4.2.1.
        Hide
        Simon Willnauer added a comment -

        committed to 5.0 and ported to 4.3 and 4.2.1

        Show
        Simon Willnauer added a comment - committed to 5.0 and ported to 4.3 and 4.2.1
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.
        Hide
        Vassil Zorev added a comment -

        Hello, Lucene team,

        My team and I are currently running a quite old version of the lucene-core library - 3.0.0. As part of our internal memory optimization fixes, we are also planning to migrate to newer version. Research so far brought me to think our best risk-benefit trade off is Lucene 3.6.2. (We are planning to go to the 4.X branch - current development, but that would perhaps come in our own major version upgrade) However, due to this here issue, I would really like to know which version of the core library does not have it? 3.6.1 perhaps? Or 3.5.X?

        As I understand from your comments, you are not yet planning to provide a fix on the 3.6.X branch - please advise to which version we should migrate where this bug is not present?

        Thank you very much!
        Vassil Zorev, web developer

        Show
        Vassil Zorev added a comment - Hello, Lucene team, My team and I are currently running a quite old version of the lucene-core library - 3.0.0. As part of our internal memory optimization fixes, we are also planning to migrate to newer version. Research so far brought me to think our best risk-benefit trade off is Lucene 3.6.2. (We are planning to go to the 4.X branch - current development, but that would perhaps come in our own major version upgrade) However, due to this here issue, I would really like to know which version of the core library does not have it? 3.6.1 perhaps? Or 3.5.X? As I understand from your comments, you are not yet planning to provide a fix on the 3.6.X branch - please advise to which version we should migrate where this bug is not present? Thank you very much! Vassil Zorev, web developer
        Hide
        Simon Willnauer added a comment -

        this has been fixed in 4.2.1 - I'd recommend to upgrade to the latest 4.x version that is 4.4 at this point.

        simon

        Show
        Simon Willnauer added a comment - this has been fixed in 4.2.1 - I'd recommend to upgrade to the latest 4.x version that is 4.4 at this point. simon

          People

          • Assignee:
            Simon Willnauer
            Reporter:
            Simon Willnauer
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development