Lucene - Core
  1. Lucene - Core
  2. LUCENE-2585

DirectoryReader.isCurrent might fail to see the segments file during concurrent index changes

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 4.9, Trunk
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I could reproduce the issue several times but only by running long and stressfull benchmarks, the high number of files is likely part of the scenario.
      All tests run on local disk, using ext3.

      Sample stacktrace:

      java.io.FileNotFoundException: no segments* file found in org.apache.lucene.store.NIOFSDirectory@/home/sanne/infinispan-41/lucene-directory/tempIndexName: files:
      _2l3.frq _uz.fdt _1q4.fnm _1q0.fdx _4bc.fdt _v2.tis _4ll.fdx _2l8.tii _ux.fnm _3g7.fdx _4bb.tii _4bj.prx _uy.fdx _3g7.prx _2l7.frq _2la.fdt _3ge.nrm _2l6.prx 
      _1py.fdx _3g6.nrm _v0.prx _4bi.tii _2l2.tis _v2.fdx _2l3.nrm _2l8.fnm _4bg.tis _2la.tis _uu.fdx _3g6.fdx _1q3.frq _2la.frq _4bb.tis _3gb.tii _1pz.tis 
      _2lb.nrm _4lm.nrm _3g9.tii _v0.fdt _2l5.fnm _v2.prx _4ll.tii _4bd.nrm _2l7.fnm _2l4.nrm _1q2.tis _3gb.fdx _4bh.fdx _1pz.nrm _ux.fdx _ux.tii _1q6.nrm 
      _3gf.fdx _4lk.fdt _3gd.nrm _v3.fnm _3g8.prx _1q2.nrm _4bh.prx _1q0.frq _ux.fdt _1q7.fdt _4bb.fnm _4bf.nrm _4bc.nrm _3gb.fdt _4bh.fnm _2l5.tis 
      _1pz.fnm _1py.fnm _3gc.fnm _2l2.prx _2l4.frq _3gc.fdt _ux.tis _1q3.prx _2l7.fdx _4bj.nrm _4bj.fdx _4bi.tis _3g9.prx _1q4.prx _v3.fdt _1q3.fdx _2l9.fdt 
      _4bh.tis _3gb.nrm _v2.nrm _3gd.tii _2l7.nrm _2lb.tii _4lm.tis _3ga.fdx _1pz.fdt _3g7.fnm _2l3.fnm _4lk.fnm _uz.fnm _2l2.frq _4bd.fdx _1q2.fdt _3g7.tis 
      _4bi.frq _4bj.frq _2l7.prx _ux.prx _3gd.fnm _1q4.fdt _1q1.fdt _v1.fnm _1py.nrm _3gf.nrm _4be.fdt _1q3.tii _1q1.prx _2l3.fdt _4lk.frq _2l4.fdx _4bd.fnm 
      _uw.frq _3g8.fdx _2l6.tii _1q5.frq _1q5.tis _3g8.nrm _uw.nrm _v0.tii _v2.fdt _2l7.fdt _v0.tis _uy.tii _3ge.tii _v1.tii _3gb.tis _4lm.fdx _4bc.fnm _2lb.frq 
      _2l6.fnm _3g6.tii _3ge.prx _uu.frq _1pz.fdx _1q2.fnm _4bi.prx _3gc.frq _2l9.tis _3ge.fdt _uy.fdt _4ll.fnm _3gc.prx _1q7.tii _2l5.nrm _uy.nrm _uv.frq 
      _1q6.frq _4ba.tis _3g9.tis _4be.nrm _4bi.fnm _ux.frq _1q1.fnm _v0.fnm _2l4.fnm _4ba.fnm _4be.tis _uz.prx _1q6.fdx _uw.tii _2l6.nrm _1pz.prx _2l7.tis 
      _1q7.fdx _2l9.tii _4lk.tii _uz.frq _3g8.frq _4bb.prx _1q5.tii _1q5.prx _v2.frq _4bc.tii _1q7.prx _v2.tii _2lb.tis _4bi.fdt _uv.nrm _2l2.fnm _4bd.tii _1q7.tis 
      _4bg.fnm _3ga.frq _uu.fnm _2l9.fnm _3ga.fnm _uw.fnm _1pz.frq _1q1.fdx _3ge.fdx _2l3.prx _3ga.nrm _uv.fdt _4bb.nrm _1q7.fnm _uv.tis _3gb.fnm 
      _2l6.tis _1pz.tii _uy.fnm _3gf.fdt _3gc.nrm _4bf.tis _1q5.fnm _uu.tis _4bh.tii _2l5.fdt _1q6.tii _4bc.tis _3gc.tii _3g9.fnm _2l6.fdt _4bj.fnm _uu.tii _v3.frq 
      _3g9.fdx _v0.nrm _2l7.tii _1q0.fdt _3ge.fnm _4bf.fdt _1q6.prx _uz.nrm _4bi.fdx _3gf.fnm _4lm.frq _v0.fdx _4ba.fdt _1py.tii _4bf.tii _uw.fdx _2l5.frq 
      _3g9.nrm _v1.fdt _uw.fdt _4bd.frq _4bg.prx _3gd.tis _1q4.tis _2l9.nrm _2la.nrm _v3.tii _4bf.prx _1q1.nrm _4ba.tii _3gd.fdx _1q4.tii _4lm.tii _3ga.tis 
      _4bf.fnm write.lock _2l8.prx _2l8.fdt segments.gen _2lb.fnm _2l4.fdt _1q2.prx _4be.fnm _3gf.prx _2l6.fdx _3g6.fnm _4bb.fdt _4bd.tis _4lk.nrm _2l5.fdx 
      _2la.tii _4bd.prx _4ln.fnm _3gf.tis _4ba.nrm _v3.prx _uv.prx _1q3.fnm _3ga.tii _uz.tii _3g9.frq _v0.frq _3ge.tis _3g6.tis _4ln.prx _3g7.tii _3g8.fdt 
      _3g7.nrm _3ga.prx _2l2.fdx _2l8.fdx _4ba.prx _1py.frq _uz.fdx _2l3.tii _3g6.prx _v3.fdx _1q6.fdt _v1.nrm _2l2.tii _1q0.tis _4ba.fdx _4be.tii _4ba.frq 
      _4ll.fdt _4bh.nrm _4lm.fdt _1q7.frq _4lk.tis _4bc.frq _1q6.fnm _3g7.frq _uw.tis _3g8.tis _2l9.fdx _2l4.tii _1q4.fdx _4be.prx _1q3.nrm _1q0.tii _1q0.fnm 
      _v3.nrm _1py.tis _3g9.fdt _4bh.fdt _4ll.nrm _4lk.prx _3gd.prx _1q3.tis _1q2.tii _2l2.nrm _3gd.fdt _2l3.fdx _3g6.fdt _3gd.frq _1q1.tis _4bb.fdx _1q2.frq 
      _1q3.fdt _v1.tis _2l8.frq _3gc.fdx _1q1.frq _4bg.frq _4bb.frq _2la.fdx _2l9.frq _uy.tis _uy.prx _4bg.fdx _3gb.prx _uy.frq _1q2.fdx _4lm.prx _2la.prx 
      _2l4.prx _4bg.fdt _4be.frq _1q7.nrm _2l5.prx _4bf.frq _v1.prx _4bd.fdt _2l9.prx _1q6.tis _3g8.fnm _4ln.tis _2l3.tis _4bc.fdx _2lb.prx _3gb.frq _3gf.frq 
      _2la.fnm _3ga.fdt _uz.tis _4bg.nrm _uv.tii _4bg.tii _3g8.tii _4ll.frq _uv.fnm _2l8.tis _2l8.nrm _2l2.fdt _4bj.tis _4lk.fdx _uw.prx _4bc.prx _4bj.fdt _4be.fdx 
      _1q4.frq _uu.fdt _1q1.tii _2l5.tii _2lb.fdt _4bh.frq _3ge.frq _1py.prx _1q5.nrm _v1.fdx _3g7.fdt _4ln.fdt _1q4.nrm _1py.fdt _3gc.tis _4ll.prx _v3.tis _4bf.fdx 
      _1q5.fdx _1q0.prx _4bi.nrm _4ll.tis _2l4.tis _3gf.tii _v2.fnm _uu.nrm _1q0.nrm _4lm.fnm _uu.prx _2l6.frq _4ln.nrm _ux.nrm _3g6.frq _1q5.fdt _4bj.tii 
      _2lb.fdx _uv.fdx _v1.frq
              at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:634)
              at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:517)
              at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:306)
              at org.apache.lucene.index.SegmentInfos.readCurrentVersion(SegmentInfos.java:408)
              at org.apache.lucene.index.DirectoryReader.isCurrent(DirectoryReader.java:797)
              at org.apache.lucene.index.DirectoryReader.doReopenNoWriter(DirectoryReader.java:407)
              at org.apache.lucene.index.DirectoryReader.doReopen(DirectoryReader.java:386)
              at org.apache.lucene.index.DirectoryReader.reopen(DirectoryReader.java:348)
              at org.infinispan.lucene.profiling.LuceneReaderThread.refreshIndexReader(LuceneReaderThread.java:79)
              at org.infinispan.lucene.profiling.LuceneReaderThread.testLoop(LuceneReaderThread.java:60)
              at org.infinispan.lucene.profiling.LuceneUserThread.run(LuceneUserThread.java:60)
              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
              at java.lang.Thread.run(Thread.java:619)
      

        Issue Links

          Activity

          Hide
          Sanne Grinovero added a comment -

          I'm going to see if I can contribute a patch myself, but I don't think I'll be able to provide a unit test.

          Show
          Sanne Grinovero added a comment - I'm going to see if I can contribute a patch myself, but I don't think I'll be able to provide a unit test.
          Hide
          Yonik Seeley added a comment -

          Background: via irc we brainstormed that the most likely cause of the exception was that listing the files
          in a directory is probably not atomic (at the JVM level) - hence it's possible to miss the segments file in a rapidly changing index.
          The simplest fix would seem to be to retry the directory listing a few times if the segments file isn't found.

          Show
          Yonik Seeley added a comment - Background: via irc we brainstormed that the most likely cause of the exception was that listing the files in a directory is probably not atomic (at the JVM level) - hence it's possible to miss the segments file in a rapidly changing index. The simplest fix would seem to be to retry the directory listing a few times if the segments file isn't found.
          Hide
          Michael McCandless added a comment -

          Man it's hard to add a comment here. Had to scroll way to the right... silly Jira.

          The best guess here is that this is due to non-atomicity of listing a directory right? Ie, Lucene, in order to find the most recent segments_N file, lists the directory. But if, as the listing is happening, a commit is done from IndexWriter, writing a new segments_N+1 and removing the old one, it's possible that the directory listing would show no segments file.

          Lucene then falls back to reading segments.gen, but somehow this is also stale/unusable (probably because another commit kicked off after the dir listing and before we could read segments.gen).

          I'm not sure offhand how we can fix this...

          Can you describe the stress test you're running?

          Show
          Michael McCandless added a comment - Man it's hard to add a comment here. Had to scroll way to the right... silly Jira. The best guess here is that this is due to non-atomicity of listing a directory right? Ie, Lucene, in order to find the most recent segments_N file, lists the directory. But if, as the listing is happening, a commit is done from IndexWriter, writing a new segments_N+1 and removing the old one, it's possible that the directory listing would show no segments file. Lucene then falls back to reading segments.gen, but somehow this is also stale/unusable (probably because another commit kicked off after the dir listing and before we could read segments.gen). I'm not sure offhand how we can fix this... Can you describe the stress test you're running?
          Hide
          Sanne Grinovero added a comment - - edited

          sure, the test is totally open source; the directory implementation based on Infinispan is hosted as submodule of Infinispan:
          http://anonsvn.jboss.org/repos/infinispan/branches/4.1.x/lucene-directory/

          The test is
          org.infinispan.lucene.profiling.PerformanceCompareStressTest

          it is included in the default test suite but disabled in Maven's configuration, so you should run it manually
          mvn clean test -Dtest=PerformanceCompareStressTest
          (running it requires the jboss.org repositories to be enabled in maven settings)

          To describe it at higher level: there are 5 IndexRead-ing threads using reopen() before each search, 2 threads writing to the index, 1 additional thread as a coordinator and asserting that readers find what they expect to see in the index.
          Exactly the same test scenario is then applied in sequence to RAMDirectory (not having issues), NIOFSDirectory, and 4 differently configured Infinispan directories.
          Only the FSDirectory is affected by the issue, and it can never complete the full hour of stresstest succesfully, while all other implementations behave fine.

          IndexWriter is set to MaxMergeDocs(5000) and setUseCompoundFile(false); the issue is reveled both using SerialMergeScheduler and while using the default merger.

          During the last execution the test managed to perform 22,192,006 searches and 26,875 writes before hitting the exceptional case.

          If you deem it useful I'd be happy in contributing a similar testcase to Lucene, but I assume you won't be excited in having such a long running test. Open to ideas to build a simpler one.

          Show
          Sanne Grinovero added a comment - - edited sure, the test is totally open source; the directory implementation based on Infinispan is hosted as submodule of Infinispan: http://anonsvn.jboss.org/repos/infinispan/branches/4.1.x/lucene-directory/ The test is org.infinispan.lucene.profiling.PerformanceCompareStressTest it is included in the default test suite but disabled in Maven's configuration, so you should run it manually mvn clean test -Dtest=PerformanceCompareStressTest (running it requires the jboss.org repositories to be enabled in maven settings) To describe it at higher level: there are 5 IndexRead-ing threads using reopen() before each search, 2 threads writing to the index, 1 additional thread as a coordinator and asserting that readers find what they expect to see in the index. Exactly the same test scenario is then applied in sequence to RAMDirectory (not having issues), NIOFSDirectory, and 4 differently configured Infinispan directories. Only the FSDirectory is affected by the issue, and it can never complete the full hour of stresstest succesfully, while all other implementations behave fine. IndexWriter is set to MaxMergeDocs(5000) and setUseCompoundFile(false); the issue is reveled both using SerialMergeScheduler and while using the default merger. During the last execution the test managed to perform 22,192,006 searches and 26,875 writes before hitting the exceptional case. If you deem it useful I'd be happy in contributing a similar testcase to Lucene, but I assume you won't be excited in having such a long running test. Open to ideas to build a simpler one.
          Hide
          Sanne Grinovero added a comment -

          reformatted the description: all filenames where on the same line making this page hard to use.

          Show
          Sanne Grinovero added a comment - reformatted the description: all filenames where on the same line making this page hard to use.
          Hide
          Michael McCandless added a comment -

          Thanks for the details Sanne! Your Infinispan directories sounds interesting.

          http://fixunix.com/linux/356378-opendir-readdir-atomicity.html is relevant, assuming the JVM is using opendir/readdir on Linux (which I assume it is?).

          Basically Posix makes no guarantee that opendir/readir will see a "point in time" directory listing. Ie, file add/deletes can be seen out-of-order, much like write operations in different threads in Java if you don't sync.

          So maybe we should add an additional retry cycle in the case that we don't find a segments file? (We already have various retries if we do see a segments file in the listing, but, we hit an IOExc when trying to load it).

          Sanne do you want to work out a patch?

          Show
          Michael McCandless added a comment - Thanks for the details Sanne! Your Infinispan directories sounds interesting. http://fixunix.com/linux/356378-opendir-readdir-atomicity.html is relevant, assuming the JVM is using opendir/readdir on Linux (which I assume it is?). Basically Posix makes no guarantee that opendir/readir will see a "point in time" directory listing. Ie, file add/deletes can be seen out-of-order, much like write operations in different threads in Java if you don't sync. So maybe we should add an additional retry cycle in the case that we don't find a segments file? (We already have various retries if we do see a segments file in the listing, but, we hit an IOExc when trying to load it). Sanne do you want to work out a patch?
          Hide
          Sanne Grinovero added a comment -

          Hello, sorry for the late answer, for some reason I didn't see the notification.

          Sure I'm very interested in providing a patch for this.
          I've to say I was only able to reproduce this issue in synthetic benchmarks, so I thought it might be very unlikely in real world scenarios, but now I actually received reports of people having issues with this during real use cases, so I'll definitely give another look.

          Thanks for the pointers!

          Show
          Sanne Grinovero added a comment - Hello, sorry for the late answer, for some reason I didn't see the notification. Sure I'm very interested in providing a patch for this. I've to say I was only able to reproduce this issue in synthetic benchmarks, so I thought it might be very unlikely in real world scenarios, but now I actually received reports of people having issues with this during real use cases, so I'll definitely give another look. Thanks for the pointers!
          Hide
          Robert Muir added a comment -

          moving out... there is no patch

          Show
          Robert Muir added a comment - moving out... there is no patch
          Hide
          Shai Erera added a comment -

          There is no patch, moving to 3.2

          Show
          Shai Erera added a comment - There is no patch, moving to 3.2
          Hide
          Robert Muir added a comment -

          bulk move 3.2 -> 3.3

          Show
          Robert Muir added a comment - bulk move 3.2 -> 3.3
          Hide
          Hoss Man added a comment -

          Bulk changing fixVersion 3.6 to 4.0 for any open issues that are unassigned and have not been updated since March 19.

          Email spam suppressed for this bulk edit; search for hoss20120323nofix36 to identify all issues edited

          Show
          Hoss Man added a comment - Bulk changing fixVersion 3.6 to 4.0 for any open issues that are unassigned and have not been updated since March 19. Email spam suppressed for this bulk edit; search for hoss20120323nofix36 to identify all issues edited
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Uwe Schindler added a comment -

          Move issue to Lucene 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Lucene 4.9.
          Hide
          Ashutosh Deshpande added a comment -

          I seem to be hitting this issue in my application. The simplified use case is: we have about 1000 new documents being processed simultaneously by multiple threads and multiple threads potentially try to store/read the same document, but with different data. One thread searches, finds nothing and updates the document; and at the same time other thread tries to search for it before update and ends up not finding it, thus creating a duplicate document with different data instead of having a single merged document. This happens very frequently and we end up having over 50% duplicates at times. This causes an issue in search.

          Is there any fix available for this issue?

          Thanks and Regards,
          Ashutosh.

          Show
          Ashutosh Deshpande added a comment - I seem to be hitting this issue in my application. The simplified use case is: we have about 1000 new documents being processed simultaneously by multiple threads and multiple threads potentially try to store/read the same document, but with different data. One thread searches, finds nothing and updates the document; and at the same time other thread tries to search for it before update and ends up not finding it, thus creating a duplicate document with different data instead of having a single merged document. This happens very frequently and we end up having over 50% duplicates at times. This causes an issue in search. Is there any fix available for this issue? Thanks and Regards, Ashutosh.

            People

            • Assignee:
              Unassigned
              Reporter:
              Sanne Grinovero
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Development