Lucene - Core
  1. Lucene - Core
  2. LUCENE-565

Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.1
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      Patch Available

      Description

      Today, applications have to open/close an IndexWriter and open/close an
      IndexReader directly or indirectly (via IndexModifier) in order to handle a
      mix of inserts and deletes. This performs well when inserts and deletes
      come in fairly large batches. However, the performance can degrade
      dramatically when inserts and deletes are interleaved in small batches.
      This is because the ramDirectory is flushed to disk whenever an IndexWriter
      is closed, causing a lot of small segments to be created on disk, which
      eventually need to be merged.

      We would like to propose a small API change to eliminate this problem. We
      are aware that this kind change has come up in discusions before. See
      http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
      . The difference this time is that we have implemented the change and
      tested its performance, as described below.

      API Changes
      -----------
      We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
      Using this method, inserts and deletes can be interleaved using the same
      IndexWriter.

      Note that, with this change it would be very easy to add another method to
      IndexWriter for updating documents, allowing applications to avoid a
      separate delete and insert to update a document.

      Also note that this change can co-exist with the existing APIs for deleting
      documents using an IndexReader. But if our proposal is accepted, we think
      those APIs should probably be deprecated.

      Coding Changes
      --------------
      Coding changes are localized to IndexWriter. Internally, the new
      deleteDocuments() method works by buffering the terms to be deleted.
      Deletes are deferred until the ramDirectory is flushed to disk, either
      because it becomes full or because the IndexWriter is closed. Using Java
      synchronization, care is taken to ensure that an interleaved sequence of
      inserts and deletes for the same document are properly serialized.

      We have attached a modified version of IndexWriter in Release 1.9.1 with
      these changes. Only a few hundred lines of coding changes are needed. All
      changes are commented by "CHANGE". We have also attached a modified version
      of an example from Chapter 2.2 of Lucene in Action.

      Performance Results
      -------------------
      To test the performance our proposed changes, we ran some experiments using
      the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
      Xeon server running Linux. The disk storage was configured as RAID0 array
      with 5 drives. Before indexes were built, the input documents were parsed
      to remove the HTML from them (i.e., only the text was indexed). This was
      done to minimize the impact of parsing on performance. A simple
      WhitespaceAnalyzer was used during index build.

      We experimented with three workloads:

      • Insert only. 1.6M documents were inserted and the final
        index size was 2.3GB.
      • Insert/delete (big batches). The same documents were
        inserted, but 25% were deleted. 1000 documents were
        deleted for every 4000 inserted.
      • Insert/delete (small batches). In this case, 5 documents
        were deleted for every 20 inserted.

      current current new
      Workload IndexWriter IndexModifier IndexWriter
      -----------------------------------------------------------------------
      Insert only 116 min 119 min 116 min
      Insert/delete (big batches) – 135 min 125 min
      Insert/delete (small batches) – 338 min 134 min

      As the experiments show, with the proposed changes, the performance
      improved by 60% when inserts and deletes were interleaved in small batches.

      Regards,
      Ning

      Ning Li
      Search Technologies
      IBM Almaden Research Center
      650 Harry Road
      San Jose, CA 95120

      1. LUCENE-565.Feb2007.patch
        60 kB
        Michael McCandless
      2. NewIndexModifier.Jan2007.take3.patch
        33 kB
        Michael McCandless
      3. NewIndexModifier.Jan2007.take2.patch
        33 kB
        Michael McCandless
      4. NewIndexModifier.Jan2007.patch
        33 kB
        Ning Li
      5. NewIndexModifier.Sept21.patch
        18 kB
        Ning Li
      6. perf-test-res2.JPG
        103 kB
        Doron Cohen
      7. TestBufferedDeletesPerf.java
        10 kB
        Doron Cohen
      8. perf-test-res.JPG
        73 kB
        Doron Cohen
      9. perfres.log
        3 kB
        Doron Cohen

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          Can you please attach diffs rather than complete files? The diffs should not not contain CHANGE comments. To generate diffs, check Lucene out of Subversion, make your changes, then, from the Lucene trunk, run something like 'svn diff > my.patch'. New files should first be added with 'svn add' so that they're included in the diff. Thanks!

          Show
          Doug Cutting added a comment - Can you please attach diffs rather than complete files? The diffs should not not contain CHANGE comments. To generate diffs, check Lucene out of Subversion, make your changes, then, from the Lucene trunk, run something like 'svn diff > my.patch'. New files should first be added with 'svn add' so that they're included in the diff. Thanks!
          Hide
          Ning Li added a comment -

          Here is the diff file of IndexWriter.java.

          Show
          Ning Li added a comment - Here is the diff file of IndexWriter.java.
          Hide
          Otis Gospodnetic added a comment -

          I took a look at the patch and it looks good to me (anyone else had a look)?
          Unfortunately, I couldn't get the patch to apply

          $ patch -F3 < IndexWriter.patch
          (Stripping trailing CRs from patch.)
          patching file IndexWriter.java
          Hunk #1 succeeded at 58 with fuzz 1.
          Hunk #2 succeeded at 112 (offset 2 lines).
          Hunk #4 succeeded at 504 (offset 33 lines).
          Hunk #6 succeeded at 605 with fuzz 2 (offset 57 lines).
          missing header for unified diff at line 259 of patch
          (Stripping trailing CRs from patch.)
          can't find file to patch at input line 259
          Perhaps you should have used the -p or --strip option?
          The text leading up to this was:
          ...
          ...
          ...
          File to patch: IndexWriter.java
          patching file IndexWriter.java
          Hunk #1 FAILED at 802.
          Hunk #2 succeeded at 745 with fuzz 2 (offset -131 lines).
          1 out of 2 hunks FAILED – saving rejects to file IndexWriter.java.rej

          Would it be possible for you to regenerate the patch against IndexWriter in HEAD?

          Also, I noticed ^Ms in the patch, but I can take care of those easily (dos2unix).

          Finally, I noticed in 2-3 places that the simple logging via "infoStream" variable was removed, for example:

          • if (infoStream != null) infoStream.print("merging segments");

          Perhaps this was just an oversight?

          Looking forward to the new patch. Thanks!

          Show
          Otis Gospodnetic added a comment - I took a look at the patch and it looks good to me (anyone else had a look)? Unfortunately, I couldn't get the patch to apply $ patch -F3 < IndexWriter.patch (Stripping trailing CRs from patch.) patching file IndexWriter.java Hunk #1 succeeded at 58 with fuzz 1. Hunk #2 succeeded at 112 (offset 2 lines). Hunk #4 succeeded at 504 (offset 33 lines). Hunk #6 succeeded at 605 with fuzz 2 (offset 57 lines). missing header for unified diff at line 259 of patch (Stripping trailing CRs from patch.) can't find file to patch at input line 259 Perhaps you should have used the -p or --strip option? The text leading up to this was: ... ... ... File to patch: IndexWriter.java patching file IndexWriter.java Hunk #1 FAILED at 802. Hunk #2 succeeded at 745 with fuzz 2 (offset -131 lines). 1 out of 2 hunks FAILED – saving rejects to file IndexWriter.java.rej Would it be possible for you to regenerate the patch against IndexWriter in HEAD? Also, I noticed ^Ms in the patch, but I can take care of those easily (dos2unix). Finally, I noticed in 2-3 places that the simple logging via "infoStream" variable was removed, for example: if (infoStream != null) infoStream.print("merging segments"); Perhaps this was just an oversight? Looking forward to the new patch. Thanks!
          Hide
          Ning Li added a comment -

          For an overview of my changes, I'll repeat some of what I said in
          my earlier e-mail (see http://www.gossamer-threads.com/lists/lucene/java-dev/35143),
          then add more detail about specific coding changes.

          Overview
          --------
          Today, applications have to open/close an IndexWriter and
          open/close an IndexReader directly or indirectly (via IndexModifier)
          in order to handle a mix of inserts and deletes. This performs well
          when inserts and deletes come in fairly large batches. However, the
          performance can degrade dramatically when inserts and deletes are
          interleaved in small batches. This is because the ramDirectory is
          flushed to disk whenever an IndexWriter is closed, causing a lot of
          small segments to be created on disk, which eventually need to be
          merged.

          API Changes
          -----------
          We propose adding a "deleteDocuments(Term term)" method to
          IndexWriter. Using this method, inserts and deletes can be
          interleaved using the same IndexWriter.

          Coding Changes
          --------------
          Coding changes are localized to IndexWriter. Internally, the new
          deleteDocuments() method works by buffering the terms to be deleted.
          Deletes are deferred until the ramDirectory is flushed to disk,
          either because it becomes full or because the IndexWriter is closed.
          Using Java synchronization, care is taken to ensure that an
          interleaved sequence of inserts and deletes for the same document
          are properly serialized.

          For simplicity of explanation, let's assume the index resides in a
          disk-based directory.

          Changes to the IndexWriter variables:

          • segmentInfos used to store the info of all segments (on disk
            or in ram). Now it only stores the info of segments on disk.
          • ramSegmentInfos is a new variable which stores the info of just
            ram segments.
          • bufferedDeleteTerms is a new variable which buffers delete terms
            before they are applied.
          • maxBufferedDeleteTerms is similar to maxBufferedDocs. It controls
            the max number of delete terms that can be buffered before they
            must be flushed to disk.

          Changes to IndexWriter methods:

          • addDocument()
            The info of the new ram segment is added to ramSegmentInfos.
          • deleteDocuments(), batchDeleteDocuments()
            The terms are added to bufferedDeleteTerms. bufferedDeleteTerms
            also records the current number of documents buffered in ram, so
            the delete terms can be applied to ram segments as well as
            the segments on disk.
          • flushRamSegments()
            Step 1: Apply buffered delete terms to all the segments on disk.
            Step 2: Merge all the ram segments into one segment on disk.
            Step 3: Apply buffered delete terms to the new segment appropriately,
            so that a delete term is only applied to the documents
            buffered before it, but not to those buffered after it.
            Step 4: Clean up and commit the change to the index (both the new
            segment and the .del files if it applies).
          • maybeMergeSegments()
            Before, a flush would be triggered only if enough documents were
            buffered. Now a flush is triggered if enough documents are
            buffered OR if enough delete terms are buffered.
          Show
          Ning Li added a comment - For an overview of my changes, I'll repeat some of what I said in my earlier e-mail (see http://www.gossamer-threads.com/lists/lucene/java-dev/35143 ), then add more detail about specific coding changes. Overview -------- Today, applications have to open/close an IndexWriter and open/close an IndexReader directly or indirectly (via IndexModifier) in order to handle a mix of inserts and deletes. This performs well when inserts and deletes come in fairly large batches. However, the performance can degrade dramatically when inserts and deletes are interleaved in small batches. This is because the ramDirectory is flushed to disk whenever an IndexWriter is closed, causing a lot of small segments to be created on disk, which eventually need to be merged. API Changes ----------- We propose adding a "deleteDocuments(Term term)" method to IndexWriter. Using this method, inserts and deletes can be interleaved using the same IndexWriter. Coding Changes -------------- Coding changes are localized to IndexWriter. Internally, the new deleteDocuments() method works by buffering the terms to be deleted. Deletes are deferred until the ramDirectory is flushed to disk, either because it becomes full or because the IndexWriter is closed. Using Java synchronization, care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized. For simplicity of explanation, let's assume the index resides in a disk-based directory. Changes to the IndexWriter variables: segmentInfos used to store the info of all segments (on disk or in ram). Now it only stores the info of segments on disk. ramSegmentInfos is a new variable which stores the info of just ram segments. bufferedDeleteTerms is a new variable which buffers delete terms before they are applied. maxBufferedDeleteTerms is similar to maxBufferedDocs. It controls the max number of delete terms that can be buffered before they must be flushed to disk. Changes to IndexWriter methods: addDocument() The info of the new ram segment is added to ramSegmentInfos. deleteDocuments(), batchDeleteDocuments() The terms are added to bufferedDeleteTerms. bufferedDeleteTerms also records the current number of documents buffered in ram, so the delete terms can be applied to ram segments as well as the segments on disk. flushRamSegments() Step 1: Apply buffered delete terms to all the segments on disk. Step 2: Merge all the ram segments into one segment on disk. Step 3: Apply buffered delete terms to the new segment appropriately, so that a delete term is only applied to the documents buffered before it, but not to those buffered after it. Step 4: Clean up and commit the change to the index (both the new segment and the .del files if it applies). maybeMergeSegments() Before, a flush would be triggered only if enough documents were buffered. Now a flush is triggered if enough documents are buffered OR if enough delete terms are buffered.
          Hide
          Otis Gospodnetic added a comment -

          Thanks for all the information about coding changes, that makes it easier to understand the diff.
          Ideally this will become comments in the new diff, which I can commit.

          Yonik mentioned this in email. It does sound like a better place for this might be in a higher level class. IndexWriter would really not be just a writer/appender once delete functionality is added to it, even if it's the IndexReaders behind the scenes doing the work. So if you are going to be redoing the patch, consider this.

          Perhaps IndexModifier methods should be deprecated and it should get a new/your API?

          Show
          Otis Gospodnetic added a comment - Thanks for all the information about coding changes, that makes it easier to understand the diff. Ideally this will become comments in the new diff, which I can commit. Yonik mentioned this in email. It does sound like a better place for this might be in a higher level class. IndexWriter would really not be just a writer/appender once delete functionality is added to it, even if it's the IndexReaders behind the scenes doing the work. So if you are going to be redoing the patch, consider this. Perhaps IndexModifier methods should be deprecated and it should get a new/your API?
          Hide
          Ning Li added a comment -

          Hi Otis,

          I've attached two patch files:

          • IndexWriter.July09.patch is an updated version of the old patch.
          • NewIndexModifier.July09.patch makes minimal changes to IndexWriter and puts new functionalities in a new class called NewIndexModifier. I didn't name it IndexModifier because the two are unrelated and I don't want a diff of the two.

          All unit test succeeded except the following one:
          [junit] Testcase: testIndex(org.apache.lucene.index.TestIndexModifier): FAILED
          [junit] expected:<3> but was:<4>
          [junit] junit.framework.AssertionFailedError: expected:<3> but was:<4>
          [junit] at org.apache.lucene.index.TestIndexModifier.testIndex(TestIndexModifier.java:67)

          However, the unit test has a problem, not the patch: IndexWriter's docCount() does not tell the actual number of documents in an index, only IndexReader's numDocs() does. For example, in a similar test below, where 10 documents are added, then 1 deleted, then 2 added, the last call to docCount() returns 12, not 11, with or without the patch.

          public void testIndexSimple() throws IOException {
          Directory ramDir = new RAMDirectory();
          IndexModifier i = new IndexModifier(ramDir, new StandardAnalyzer(), true);
          // add 10 documents initially
          for (int count = 0; count < 10; count++)

          { i.addDocument(getDoc()); }

          i.flush();
          i.optimize();
          assertEquals(10, i.docCount());
          i.deleteDocument(0);
          i.flush();
          assertEquals(9, i.docCount());
          i.addDocument(getDoc());
          i.addDocument(getDoc());
          i.flush();
          assertEquals(12, i.docCount());
          }

          The reason for the docCount() difference in the unit test (which does not affect the correctness of the patch) is that flushRamSegments() in the patch merges all and only the segments in ram and write to disk, whereas the original flushRamSegments() merges not only the segments in ram but sometimes also one segment from disk (see in that function the comment "// add one FS segment?").

          Regards,
          Ning

          Show
          Ning Li added a comment - Hi Otis, I've attached two patch files: IndexWriter.July09.patch is an updated version of the old patch. NewIndexModifier.July09.patch makes minimal changes to IndexWriter and puts new functionalities in a new class called NewIndexModifier. I didn't name it IndexModifier because the two are unrelated and I don't want a diff of the two. All unit test succeeded except the following one: [junit] Testcase: testIndex(org.apache.lucene.index.TestIndexModifier): FAILED [junit] expected:<3> but was:<4> [junit] junit.framework.AssertionFailedError: expected:<3> but was:<4> [junit] at org.apache.lucene.index.TestIndexModifier.testIndex(TestIndexModifier.java:67) However, the unit test has a problem, not the patch: IndexWriter's docCount() does not tell the actual number of documents in an index, only IndexReader's numDocs() does. For example, in a similar test below, where 10 documents are added, then 1 deleted, then 2 added, the last call to docCount() returns 12, not 11, with or without the patch. public void testIndexSimple() throws IOException { Directory ramDir = new RAMDirectory(); IndexModifier i = new IndexModifier(ramDir, new StandardAnalyzer(), true); // add 10 documents initially for (int count = 0; count < 10; count++) { i.addDocument(getDoc()); } i.flush(); i.optimize(); assertEquals(10, i.docCount()); i.deleteDocument(0); i.flush(); assertEquals(9, i.docCount()); i.addDocument(getDoc()); i.addDocument(getDoc()); i.flush(); assertEquals(12, i.docCount()); } The reason for the docCount() difference in the unit test (which does not affect the correctness of the patch) is that flushRamSegments() in the patch merges all and only the segments in ram and write to disk, whereas the original flushRamSegments() merges not only the segments in ram but sometimes also one segment from disk (see in that function the comment "// add one FS segment?"). Regards, Ning
          Hide
          Ning Li added a comment -

          Hopefully, third time's a charm.

          I rewrote IndexWriter in such a way that semantically it's the same as before, but it provides extension points so that delete-by-term, delete-by-query, and more functionalities can be easily supported in a subclass. NewIndexModifier is such a subclass that supports delete-by-term.

          Here is an overview of the changes:

          Changes to IndexWriter
          Changes to IndexWriter variables:

          • segmentInfos used to store the info of all segments (on disk or in ram). Now it
            only stores the info of segments on disk.
          • ramSegmentInfos is a new variable which stores the info of just ram segments.
            Changes to IndexWriter methods:
          • addDocument()
            The info of the new ram segment is added to ramSegmentInfos.
          • maybeMergeSegments()
            toFlushRamSegments() is called at the beginning to decide whether a flush should take place.
          • flushRamSegments()
            doAfterFlushRamSegments() is called after all ram segments are merged and flushed to disk.

          NewIndexModifier
          New variables:

          • bufferedDeleteTerms is a new variable which buffers delete terms
            before they are applied.
          • maxBufferedDeleteTerms is similar to maxBufferedDocs. It controls
            the max number of delete terms that can be buffered before they
            must be flushed to disk.
            Overloaded/new methods:
          • deleteDocuments(), batchDeleteDocuments()
            The terms are added to bufferedDeleteTerms. bufferedDeleteTerms
            also records the current number of documents buffered in ram,
            so the delete terms can be applied to ram segments as well as
            the segments on disk.
          • toFlushRamSegments()
            In IndexWriter, a flush would be triggered only if enough documents were
            buffered. Now a flush is triggered if enough documents are
            buffered OR if enough delete terms are buffered.
          • doAfterlushRamSegments()
            Step 1: Apply buffered delete terms to all the segments on disk.
            Step 2: Apply buffered delete terms to the new segment appropriately,
            so that a delete term is only applied to the documents
            buffered before it, but not to those buffered after it.
            Step 3: Clean up the buffered delete terms.
          Show
          Ning Li added a comment - Hopefully, third time's a charm. I rewrote IndexWriter in such a way that semantically it's the same as before, but it provides extension points so that delete-by-term, delete-by-query, and more functionalities can be easily supported in a subclass. NewIndexModifier is such a subclass that supports delete-by-term. Here is an overview of the changes: Changes to IndexWriter Changes to IndexWriter variables: segmentInfos used to store the info of all segments (on disk or in ram). Now it only stores the info of segments on disk. ramSegmentInfos is a new variable which stores the info of just ram segments. Changes to IndexWriter methods: addDocument() The info of the new ram segment is added to ramSegmentInfos. maybeMergeSegments() toFlushRamSegments() is called at the beginning to decide whether a flush should take place. flushRamSegments() doAfterFlushRamSegments() is called after all ram segments are merged and flushed to disk. NewIndexModifier New variables: bufferedDeleteTerms is a new variable which buffers delete terms before they are applied. maxBufferedDeleteTerms is similar to maxBufferedDocs. It controls the max number of delete terms that can be buffered before they must be flushed to disk. Overloaded/new methods: deleteDocuments(), batchDeleteDocuments() The terms are added to bufferedDeleteTerms. bufferedDeleteTerms also records the current number of documents buffered in ram, so the delete terms can be applied to ram segments as well as the segments on disk. toFlushRamSegments() In IndexWriter, a flush would be triggered only if enough documents were buffered. Now a flush is triggered if enough documents are buffered OR if enough delete terms are buffered. doAfterlushRamSegments() Step 1: Apply buffered delete terms to all the segments on disk. Step 2: Apply buffered delete terms to the new segment appropriately, so that a delete term is only applied to the documents buffered before it, but not to those buffered after it. Step 3: Clean up the buffered delete terms.
          Hide
          Doron Cohen added a comment -

          I tried out this patch (July18), and have a few comments...

          First, it is nice to be able to add/remove documents with no need to care for switching between readers and writers, and without worrying for performance issues as result of that switching. I did not test for performance yet.

          This post is quite long, so here is an outline...
          (1) Compile error in test code
          (2) Failing tests - is this patch actually fixing a bug in current flushRamSegments()?
          (3) Additional tests I ran
          (4) Javadocs remarks
          (5) deleteDocument(int doc) not implemented
          (6) flush() not implemented
          (7) Method name - batchDeleteDocuments(Term[])
          (8) Class name and placement + What's Next for this patch

          ------------ (1) Compile error in test code
          The new TestWriterDelete does not reflect recent name change from IndexWriter to NewIndexModifier. Easily fixed by renaming accordingly in that file.

          ------------ (2) Failing tests - does this patch also fix a bug in current flushRamSegments()?
          "ant test" has one failure: TestIndexModifier.testIndex().
          This is the same issue that Ning described above. However I think it exposes a bug in current flushRamSegments(): when an index with 1 segment on disk that has 2 documents, one of which is deleted, and 1 segment in memory, is closed, this method decides to merge - prematurely - the two segments into one. This wrong behavior (if I understand things correctly) is - by "mistake" - causing TestIndexModifier.testIndex() to pass in the current implementation of flushRamSegments(). But this comes with the cost of too many merges. If one is interleaving adds and deletes this bug would become costly. I will post a separate question on this to the dev forum, to discuss if this is indeed a bug.

          ------------ (3) Additional tests I ran
          I wanted to verify that all existing functions (well, at least tested ones..) are working with the new class (NewIndexModifier). So I temporarily renamed the existing IndexWriter to IndexWriter0, and renamed NewIndexModifier to IndexWriter (now extending IndexWriter0). For compiling, now, also had to temporarily modify args from IndexWriter to IndexWriter0 in 3 classes - DocumentWriter, SegmentMerger, and also from NewIndexModifier to IndexWriter in the new TestWriteDelete. (Note again: these modifications are temporary, just for the sake of testing this new class as if it was the new IndexWriter, which it is not.) Now all the tests were using the new class instead of the original IndexWriter.

          All tests passed, except for TestIndexModifier.testIndex() - this is the same failure as above - so, no problem detected in new class.

          ------------ (4) Javadocs Remarks
          Current Javadocs for the new class focus on changes to the implementation. I think this description of implementation changes should be made regular Java comments (for developers), and instead should add a shorter javadoc that describes the API for users, and the implications on behavior as result of buffering deletes.

          ------------ (5) deleteDocument(int doc) not implemented
          Original IndexModifier has a delete(int docs), the new class doesn't. At first this seems ok, since internal doc IDs are not accessible through index writer (unlike index reader). But IndexModifier also does not provide access to doc-ids. So why was delete-by-id enabled in IndexModifier? Perhaps there's a good reason for it, that I fail to see - if so, it should probably be added to the new class as well. Adding this is required if the new class would eventually replace the implementation of current index modifier.

          ------------ (6) flush() not implemented
          Original IndexModifier has a flush(int docs) method, allowing to commit any pending changes. I think it would be nice to have this feature here as well, for forcing any pending changes (without caller having to modify the max-bufferred value). This would allow more control when using this class. Again, adding this is required if the new class would eventually replace the implementation of current index modifier.

          ------------ (7) Method name - batchDeleteDocuments(Term[])
          I would prefer it to be called deleteDocuments(Term[]), and let Java decide which method to call. Main reason is developers would expect that methods with similar semantics are named similarly, especially when using IDEs like Eclipse, where users type "i.del" and the IDE lets them select from all the methods that start with "del".

          ------------ (8) Class name and placement + What's Next for this patch
          Performance test should be added for this new class. Also, I did not code review the actual code changes to IndexWriter and the code of NewIndexModifier itself.

          It seems to me that this class would be very useful for users, either as a new class or if it replaces the current implementation of IndexModifier. Latter would be possible only if the 2 missing methods mentioned above are added. In this case, the "immediate delete" behavior of current IndexModifier should be possible to achieve by users, by setting maxBefferedDeleteTerms to 1.

          One disadvantage of this class vs. current IndexModifier is the ability to add access to further methods of IndexReader. With current IndexModifier this is very simple (though not always efficient) - just follow code template in existing methods, i.e. close writer/reader and open reader/writer as required. With the new class, exposing further methods of IndexReader would be more of a challenge. Perhaps having a multiReader on all segment readers can do. I am not sure to what extent this should be a consideration, so just bringing it up.

          So, If this class replaces IndexModifier - fine. If not, how about calling it BufferredIndexWriter?

          • Doron
          Show
          Doron Cohen added a comment - I tried out this patch (July18), and have a few comments... First, it is nice to be able to add/remove documents with no need to care for switching between readers and writers, and without worrying for performance issues as result of that switching. I did not test for performance yet. This post is quite long, so here is an outline... (1) Compile error in test code (2) Failing tests - is this patch actually fixing a bug in current flushRamSegments()? (3) Additional tests I ran (4) Javadocs remarks (5) deleteDocument(int doc) not implemented (6) flush() not implemented (7) Method name - batchDeleteDocuments(Term[]) (8) Class name and placement + What's Next for this patch ------------ (1) Compile error in test code The new TestWriterDelete does not reflect recent name change from IndexWriter to NewIndexModifier. Easily fixed by renaming accordingly in that file. ------------ (2) Failing tests - does this patch also fix a bug in current flushRamSegments()? "ant test" has one failure: TestIndexModifier.testIndex(). This is the same issue that Ning described above. However I think it exposes a bug in current flushRamSegments(): when an index with 1 segment on disk that has 2 documents, one of which is deleted, and 1 segment in memory, is closed, this method decides to merge - prematurely - the two segments into one. This wrong behavior (if I understand things correctly) is - by "mistake" - causing TestIndexModifier.testIndex() to pass in the current implementation of flushRamSegments(). But this comes with the cost of too many merges. If one is interleaving adds and deletes this bug would become costly. I will post a separate question on this to the dev forum, to discuss if this is indeed a bug. ------------ (3) Additional tests I ran I wanted to verify that all existing functions (well, at least tested ones..) are working with the new class (NewIndexModifier). So I temporarily renamed the existing IndexWriter to IndexWriter0, and renamed NewIndexModifier to IndexWriter (now extending IndexWriter0). For compiling, now, also had to temporarily modify args from IndexWriter to IndexWriter0 in 3 classes - DocumentWriter, SegmentMerger, and also from NewIndexModifier to IndexWriter in the new TestWriteDelete. (Note again: these modifications are temporary, just for the sake of testing this new class as if it was the new IndexWriter, which it is not.) Now all the tests were using the new class instead of the original IndexWriter. All tests passed, except for TestIndexModifier.testIndex() - this is the same failure as above - so, no problem detected in new class. ------------ (4) Javadocs Remarks Current Javadocs for the new class focus on changes to the implementation. I think this description of implementation changes should be made regular Java comments (for developers), and instead should add a shorter javadoc that describes the API for users, and the implications on behavior as result of buffering deletes. ------------ (5) deleteDocument(int doc) not implemented Original IndexModifier has a delete(int docs), the new class doesn't. At first this seems ok, since internal doc IDs are not accessible through index writer (unlike index reader). But IndexModifier also does not provide access to doc-ids. So why was delete-by-id enabled in IndexModifier? Perhaps there's a good reason for it, that I fail to see - if so, it should probably be added to the new class as well. Adding this is required if the new class would eventually replace the implementation of current index modifier. ------------ (6) flush() not implemented Original IndexModifier has a flush(int docs) method, allowing to commit any pending changes. I think it would be nice to have this feature here as well, for forcing any pending changes (without caller having to modify the max-bufferred value). This would allow more control when using this class. Again, adding this is required if the new class would eventually replace the implementation of current index modifier. ------------ (7) Method name - batchDeleteDocuments(Term[]) I would prefer it to be called deleteDocuments(Term[]), and let Java decide which method to call. Main reason is developers would expect that methods with similar semantics are named similarly, especially when using IDEs like Eclipse, where users type "i.del" and the IDE lets them select from all the methods that start with "del". ------------ (8) Class name and placement + What's Next for this patch Performance test should be added for this new class. Also, I did not code review the actual code changes to IndexWriter and the code of NewIndexModifier itself. It seems to me that this class would be very useful for users, either as a new class or if it replaces the current implementation of IndexModifier. Latter would be possible only if the 2 missing methods mentioned above are added. In this case, the "immediate delete" behavior of current IndexModifier should be possible to achieve by users, by setting maxBefferedDeleteTerms to 1. One disadvantage of this class vs. current IndexModifier is the ability to add access to further methods of IndexReader. With current IndexModifier this is very simple (though not always efficient) - just follow code template in existing methods, i.e. close writer/reader and open reader/writer as required. With the new class, exposing further methods of IndexReader would be more of a challenge. Perhaps having a multiReader on all segment readers can do. I am not sure to what extent this should be a consideration, so just bringing it up. So, If this class replaces IndexModifier - fine. If not, how about calling it BufferredIndexWriter? Doron
          Hide
          Jason Rutherglen added a comment -

          I tested just the IndexWriter from this code base, it does not seem to work. NewIndexModifier does work. I simply used IndexWriter to create several documents and then search for them. Nothing came back even though it seems something was written to disk.

          Show
          Jason Rutherglen added a comment - I tested just the IndexWriter from this code base, it does not seem to work. NewIndexModifier does work. I simply used IndexWriter to create several documents and then search for them. Nothing came back even though it seems something was written to disk.
          Hide
          Ning Li added a comment -

          > Yes I am including this patch as it is very useful for increasing
          > the efficiency of updates as you described. I will be conducting
          > more tests and will post any results. Yes a patch for IndexWriter
          > will be useful so that the entirety of this build will work.
          > Thanks!

          I've attached a patch that works with the current code. The
          implementation of IndexWriter and NewIndexModifier is the same as
          the last patch. I removed the "singleDocSegmentsCount" optimization
          from this patch since my IndexWriter checks singleDocSegmentsCount
          by simply calling ramSegmentInfos.size().

          This patch had evolved with the help of many good discussions
          (thanks!) since it came out in May. Here is the current state of
          the patch:

          • This patch aims at enabling users to do inserts and general
            deletes (delete-by-term, and later delete-by-query) without
            switching between writers and readers.
          • The goal is achieved by rewritting IndexWriter in such a way
            that semantically it's the same as before, but it provides
            extension points so that delete-by-term, delete-by-query, and
            more functionalities can be easily supported in a subclass.
          • NewIndexModifier extends IndexWriter and supports delete-by-term
            by simply overriding two methods: toFlushRamSegment() which
            decides if a flush should happen, and doAfterFlushRamSegments()
            which does proper work after a flush is done.

          Suggestions are welcome! Especially those that may help it get
          committed.

          Show
          Ning Li added a comment - > Yes I am including this patch as it is very useful for increasing > the efficiency of updates as you described. I will be conducting > more tests and will post any results. Yes a patch for IndexWriter > will be useful so that the entirety of this build will work. > Thanks! I've attached a patch that works with the current code. The implementation of IndexWriter and NewIndexModifier is the same as the last patch. I removed the "singleDocSegmentsCount" optimization from this patch since my IndexWriter checks singleDocSegmentsCount by simply calling ramSegmentInfos.size(). This patch had evolved with the help of many good discussions (thanks!) since it came out in May. Here is the current state of the patch: This patch aims at enabling users to do inserts and general deletes (delete-by-term, and later delete-by-query) without switching between writers and readers. The goal is achieved by rewritting IndexWriter in such a way that semantically it's the same as before, but it provides extension points so that delete-by-term, delete-by-query, and more functionalities can be easily supported in a subclass. NewIndexModifier extends IndexWriter and supports delete-by-term by simply overriding two methods: toFlushRamSegment() which decides if a flush should happen, and doAfterFlushRamSegments() which does proper work after a flush is done. Suggestions are welcome! Especially those that may help it get committed.
          Hide
          Ning Li added a comment -

          Doron, thank you very much for the review! I want to briefly comment
          on one of your comments:

          > (5) deleteDocument(int doc) not implemented

          I deliberately left that one out. This is because document ids are
          changing as documents are deleted and segments are merged. Users
          don't know exactly when segments are merged thus ids are changed
          when using IndexModifier. Thus I don't think it should be supported
          in IndexModifier at all.

          Show
          Ning Li added a comment - Doron, thank you very much for the review! I want to briefly comment on one of your comments: > (5) deleteDocument(int doc) not implemented I deliberately left that one out. This is because document ids are changing as documents are deleted and segments are merged. Users don't know exactly when segments are merged thus ids are changed when using IndexModifier. Thus I don't think it should be supported in IndexModifier at all.
          Hide
          Jason Rutherglen added a comment -

          This IndexWriter seems to work. Thanks. Great work!

          Show
          Jason Rutherglen added a comment - This IndexWriter seems to work. Thanks. Great work!
          Hide
          Doron Cohen added a comment -

          I ran a performance test for interleaved adds and removes - and compared between IndexModifier and NewIndexModifier.

          Few setups were tested, with a few combinations of "consecutive adds before a delete takes place", maxBufferredDocs, and "number of total test iterations", where each iteration does the conseutive adds and then does the deletes.

          Each setup ran in this order - orig indexModifier, new one, orig, new one, and the best time out of the two runs was used.

          Results indicate that NewIndexModifier is far faster for most setups.

          Attached is the performance test, the performance results, and the log of the run. The performance test is written as a Junit test, and it fails in case the original IndexModfier is faster than the new one by more than 1 second (smaller than 1 sec difference is considered noise).

          Test was run on XP (SP1) with IBM JDK 1.5.

          Test was first failing with "access denied" errors due to what seems to be an XP issue. So in order to run this test on XP (and probably other Windows platforms) the patch from http://issues.apache.org/jira/browse/LUCENE-665 should be applied first.

          It is interesting to notice that in addition to preformance gain, NewIndexModifier seems less sensitive to "access denied" XP problems, because it closes/reopens readers and writers less frequently, and indeed, at least in my runs, these errors had to be bypassed (by the "retry" patch) only for the current index-modifier.

          • Doron
          Show
          Doron Cohen added a comment - I ran a performance test for interleaved adds and removes - and compared between IndexModifier and NewIndexModifier. Few setups were tested, with a few combinations of "consecutive adds before a delete takes place", maxBufferredDocs, and "number of total test iterations", where each iteration does the conseutive adds and then does the deletes. Each setup ran in this order - orig indexModifier, new one, orig, new one, and the best time out of the two runs was used. Results indicate that NewIndexModifier is far faster for most setups. Attached is the performance test, the performance results, and the log of the run. The performance test is written as a Junit test, and it fails in case the original IndexModfier is faster than the new one by more than 1 second (smaller than 1 sec difference is considered noise). Test was run on XP (SP1) with IBM JDK 1.5. Test was first failing with "access denied" errors due to what seems to be an XP issue. So in order to run this test on XP (and probably other Windows platforms) the patch from http://issues.apache.org/jira/browse/LUCENE-665 should be applied first. It is interesting to notice that in addition to preformance gain, NewIndexModifier seems less sensitive to "access denied" XP problems, because it closes/reopens readers and writers less frequently, and indeed, at least in my runs, these errors had to be bypassed (by the "retry" patch) only for the current index-modifier. Doron
          Hide
          Jason Rutherglen added a comment -

          It seems this writer works, but then some mysterious happens to the index and the searcher can no longer read it. I am using this in conjunction with Solr. The index files look ok, however a search will return nothing. I have seen this repeatedly over about 1 weeks time.

          Show
          Jason Rutherglen added a comment - It seems this writer works, but then some mysterious happens to the index and the searcher can no longer read it. I am using this in conjunction with Solr. The index files look ok, however a search will return nothing. I have seen this repeatedly over about 1 weeks time.
          Hide
          Doron Cohen added a comment -

          Is it that results that were returned are suddenly (say after updates) not returned anymore (indicating something bad happened to existing index)?

          Or is it that the search does not reflect recent changes?

          I don't remember how often Solr closes and re-opens the writer/modifier... with this patch a delete does not immediately cause a "flush to disk" - so flushes are controlled by closing the NewIndexModifier (and re-opening, since there no flush() method) and by the limits for max-bufferred-docs and max-bufferred-deletes. If this seems relevant to your case, what limits are in effect?

          Show
          Doron Cohen added a comment - Is it that results that were returned are suddenly (say after updates) not returned anymore (indicating something bad happened to existing index)? Or is it that the search does not reflect recent changes? I don't remember how often Solr closes and re-opens the writer/modifier... with this patch a delete does not immediately cause a "flush to disk" - so flushes are controlled by closing the NewIndexModifier (and re-opening, since there no flush() method) and by the limits for max-bufferred-docs and max-bufferred-deletes. If this seems relevant to your case, what limits are in effect?
          Hide
          Jason Rutherglen added a comment -

          I started to flush the deletes after making them, which opens a new NewIndexModifier afterwards. I still see the same thing. I am starting off by deleting all documents by matching on a Term that all of them have. Commit (reopen), then perform a batch addDocuments. Then when a search is executed nothing is returned, and after an optimize the index goes down to 1K. Seems like some peculiarity in NewIndexModifier. Seems like the new documents are deleted even after they are added.

          Show
          Jason Rutherglen added a comment - I started to flush the deletes after making them, which opens a new NewIndexModifier afterwards. I still see the same thing. I am starting off by deleting all documents by matching on a Term that all of them have. Commit (reopen), then perform a batch addDocuments. Then when a search is executed nothing is returned, and after an optimize the index goes down to 1K. Seems like some peculiarity in NewIndexModifier. Seems like the new documents are deleted even after they are added.
          Hide
          Doron Cohen added a comment -

          Just to make sure on the scenario - are you -
          (1) using NewIndexModifier at all, or
          (2) just letting Solr use this IndexWriter (with the code changes introduced to ebable NewIndexModifier) instead of the Lucene's svn-head (or cetrain release) IndexModifier.

          As is, Solr would not use NewIndexModifier or IndexModifier at all.

          For case (2) above the bufferred deletes logic is not in effect at all.

          I wonder if it possibe to re-create this with a simple Lucene stand-alone (test) program rather than with Solr - it would be easier to analyze.

          Show
          Doron Cohen added a comment - Just to make sure on the scenario - are you - (1) using NewIndexModifier at all, or (2) just letting Solr use this IndexWriter (with the code changes introduced to ebable NewIndexModifier) instead of the Lucene's svn-head (or cetrain release) IndexModifier. As is, Solr would not use NewIndexModifier or IndexModifier at all. For case (2) above the bufferred deletes logic is not in effect at all. I wonder if it possibe to re-create this with a simple Lucene stand-alone (test) program rather than with Solr - it would be easier to analyze.
          Hide
          Jason Rutherglen added a comment -

          Good points... I've actually used both NewIndexModifier and the parent. I've tried writing a new UpdateHandler, and incorporating the new IndexWriter into DirectUpdateHandler2. I will create a non-Solr reproduction of the issue. I guess it has something to do with ths doc ids being reused and so the new documents that are added are also marked as deleted as the number of documents would match almost exactly after the rebuild. I am not an expert in regards to that aspect of Lucene.

          Show
          Jason Rutherglen added a comment - Good points... I've actually used both NewIndexModifier and the parent. I've tried writing a new UpdateHandler, and incorporating the new IndexWriter into DirectUpdateHandler2. I will create a non-Solr reproduction of the issue. I guess it has something to do with ths doc ids being reused and so the new documents that are added are also marked as deleted as the number of documents would match almost exactly after the rebuild. I am not an expert in regards to that aspect of Lucene.
          Hide
          Jason Rutherglen added a comment -

          Having trouble reproducing this. Probably something in the other code. Thanks for the help and the patch, I feel more confident in it now.

          Show
          Jason Rutherglen added a comment - Having trouble reproducing this. Probably something in the other code. Thanks for the help and the patch, I feel more confident in it now.
          Hide
          Jason Rutherglen added a comment -

          I figured out the problem, the Solr DirectUpdateHandler2 expects to delete only a certain number of documents specifically the oldest first in some cases by using TermDocs and deleting by the doc id. NewIndexModifier deletes at the level of the SegmentReader however. Any good way to do this?

          Show
          Jason Rutherglen added a comment - I figured out the problem, the Solr DirectUpdateHandler2 expects to delete only a certain number of documents specifically the oldest first in some cases by using TermDocs and deleting by the doc id. NewIndexModifier deletes at the level of the SegmentReader however. Any good way to do this?
          Hide
          Doron Cohen added a comment -

          Updated performance test results - perf-test-res2.JPG - in avarage, the new code is 9 times faster!

          What have changed? - in previous test I forgot to set max-buffered-deletes.

          After fixing so, I removed the test cases with max-buffer of 5,000 and up, because they consumed too much memory, and added more practical (I think) cases of 2000 and 3000.

          Here is a textual summary of the data in the attached image:

          max buf add/del 10 10 100 1000 2000 3000
          iterations 1 10 100 100 200 300
          adds/iteration 10 10 10 10 10 10
          dels/iteration 5 5 5 5 5 5
          orig time (sec) 0.13 0.86 9.57 8.88 22.74 44.01
          new time (sec) 0.20 0.95 1.74 1.30 2.16 3.08
          Improvement (sec) -0.07 -0.09 7.83 7.58 20.58 40.94
          Improvement (%) -55% -11% 82% 85% 90% 93%

          Note: for the first two cases new code is slower by 11% and 55%, but this is a very short test case, - the absolute difference here is less than 100ms, comparing to the other cases, where the difference is measured in seconds and 10's of seconds.

          Show
          Doron Cohen added a comment - Updated performance test results - perf-test-res2.JPG - in avarage, the new code is 9 times faster! What have changed? - in previous test I forgot to set max-buffered-deletes. After fixing so, I removed the test cases with max-buffer of 5,000 and up, because they consumed too much memory, and added more practical (I think) cases of 2000 and 3000. Here is a textual summary of the data in the attached image: max buf add/del 10 10 100 1000 2000 3000 iterations 1 10 100 100 200 300 adds/iteration 10 10 10 10 10 10 dels/iteration 5 5 5 5 5 5 orig time (sec) 0.13 0.86 9.57 8.88 22.74 44.01 new time (sec) 0.20 0.95 1.74 1.30 2.16 3.08 Improvement (sec) -0.07 -0.09 7.83 7.58 20.58 40.94 Improvement (%) -55% -11% 82% 85% 90% 93% Note: for the first two cases new code is slower by 11% and 55%, but this is a very short test case, - the absolute difference here is less than 100ms, comparing to the other cases, where the difference is measured in seconds and 10's of seconds.
          Hide
          Yonik Seeley added a comment -

          > the new code is 9 times faster!

          That's a bit apples-and-oranges I don't think people use IndexModifier when they need performance... they buffer things and do them in batches.

          IMO, performance of existing code using IndexWriter is more important, and what I would be interested in. Say indexing 10000 documents to a RamDirectory with default settings (no deletes at all). I haven't had a chance to review the new code, so I don't know if it's something to worry about or not.

          Show
          Yonik Seeley added a comment - > the new code is 9 times faster! That's a bit apples-and-oranges I don't think people use IndexModifier when they need performance... they buffer things and do them in batches. IMO, performance of existing code using IndexWriter is more important, and what I would be interested in. Say indexing 10000 documents to a RamDirectory with default settings (no deletes at all). I haven't had a chance to review the new code, so I don't know if it's something to worry about or not.
          Hide
          Yonik Seeley added a comment -

          I believe this patch probably also changes the merge behavior.
          I think we need to discuss what exactly the new merge behavior is, if it's OK, what we think the index invariants should be (no more than x segments of y size, etc), and I'd like to see some code to test those invariants.

          Keep in mind the difficulty of getting the last IndexWriter patch correct (and that was a very minor patch to keep track of the number of buffered docs!)

          Show
          Yonik Seeley added a comment - I believe this patch probably also changes the merge behavior. I think we need to discuss what exactly the new merge behavior is, if it's OK, what we think the index invariants should be (no more than x segments of y size, etc), and I'd like to see some code to test those invariants. Keep in mind the difficulty of getting the last IndexWriter patch correct (and that was a very minor patch to keep track of the number of buffered docs!)
          Hide
          Doron Cohen added a comment -

          I agree - I also suspected it might change the merge behavior (and also had reflections from the repeated trials to have that simple Indexwriter buffered-docs patch correct....

          Guess I just wanted to get a feeling if there is interest to include this patch before I delve into it too much - and the perf test was meant to see for my self if it really helps. I was a bit surprised that it speeds 9 times in an interleaving add/delete scenario. Guess this by itself now justifies delving into this patch, analyzing merge behavior as you suggest - will do - I think idealy should try this patch not to modify the merge behavior.

          About the test - l was trying to test what I thought is a realistic use scenario (max-buf, etc.) - I have a fixed version of the perf test that is easier to modify for different scenarios - can upload it here if there is interest.

          Show
          Doron Cohen added a comment - I agree - I also suspected it might change the merge behavior (and also had reflections from the repeated trials to have that simple Indexwriter buffered-docs patch correct... . Guess I just wanted to get a feeling if there is interest to include this patch before I delve into it too much - and the perf test was meant to see for my self if it really helps. I was a bit surprised that it speeds 9 times in an interleaving add/delete scenario. Guess this by itself now justifies delving into this patch, analyzing merge behavior as you suggest - will do - I think idealy should try this patch not to modify the merge behavior. About the test - l was trying to test what I thought is a realistic use scenario (max-buf, etc.) - I have a fixed version of the perf test that is easier to modify for different scenarios - can upload it here if there is interest.
          Hide
          Ning Li added a comment -

          This patch features the new more robust merge policy. Reference on the new policy is at http://www.gossamer-threads.com/lists/lucene/java-dev/35147

          • The patch passes all the tests except that one in TestIndexModifier (see an earlier comment on this issue).
          • Since the test itself has a problem, it is fixed (one line change) and the patch passes the fixed test.
          • A new test call TestIndexWriterMergePolicy is included which shows the robustness of the new merge policy.

          The following is a detailed description of the new merge policy and its properties.

          Overview of merge policy:

          A flush is triggered either by close() or by the number of ram segments
          reaching maxBufferedDocs. After a disk segment is created by the flush,
          further merges may be triggered.

          LowerBound and upperBound set the limits on the doc count of a segment
          which may be merged. Initially, lowerBound is set to 0 and upperBound
          to maxBufferedDocs. Starting from the rightmost* segment whose doc count
          > lowerBound and <= upperBound, count the number of consecutive segments
          whose doc count <= upperBound.

          Case 1: number of worthy segments < mergeFactor, no merge, done.
          Case 2: number of worthy segments == mergeFactor, merge these segments.
          If the doc count of the merged segment <= upperBound, done.
          Otherwise, set lowerBound to upperBound, and multiply upperBound
          by mergeFactor, go through the process again.
          Case 3: number of worthy segments > mergeFactor (in the case mergeFactor
          M changes), merge the leftmost* M segments. If the doc count of
          the merged segment <= upperBound, consider the merged segment for
          further merges on this same level. Merge the now leftmost* M
          segments, and so on, until number of worthy segments < mergeFactor.
          If the doc count of all the merged segments <= upperBound, done.
          Otherwise, set lowerBound to upperBound, and multiply upperBound
          by mergeFactor, go through the process again.
          Note that case 2 can be considerd as a special case of case 3.

          This merge policy guarantees two invariants if M does not change and
          segment doc count is not reaching maxMergeDocs:
          B for maxBufferedDocs, f defined as ceil(log_M(ceil(n/B)))
          1: If i (left*) and i+1 (right*) are two consecutive segments of doc
          counts x and y, then f >= f.
          2: The number of committed segments on the same level (f) <= M.

          Show
          Ning Li added a comment - This patch features the new more robust merge policy. Reference on the new policy is at http://www.gossamer-threads.com/lists/lucene/java-dev/35147 The patch passes all the tests except that one in TestIndexModifier (see an earlier comment on this issue). Since the test itself has a problem, it is fixed (one line change) and the patch passes the fixed test. A new test call TestIndexWriterMergePolicy is included which shows the robustness of the new merge policy. The following is a detailed description of the new merge policy and its properties. Overview of merge policy: A flush is triggered either by close() or by the number of ram segments reaching maxBufferedDocs. After a disk segment is created by the flush, further merges may be triggered. LowerBound and upperBound set the limits on the doc count of a segment which may be merged. Initially, lowerBound is set to 0 and upperBound to maxBufferedDocs. Starting from the rightmost* segment whose doc count > lowerBound and <= upperBound, count the number of consecutive segments whose doc count <= upperBound. Case 1: number of worthy segments < mergeFactor, no merge, done. Case 2: number of worthy segments == mergeFactor, merge these segments. If the doc count of the merged segment <= upperBound, done. Otherwise, set lowerBound to upperBound, and multiply upperBound by mergeFactor, go through the process again. Case 3: number of worthy segments > mergeFactor (in the case mergeFactor M changes), merge the leftmost* M segments. If the doc count of the merged segment <= upperBound, consider the merged segment for further merges on this same level. Merge the now leftmost* M segments, and so on, until number of worthy segments < mergeFactor. If the doc count of all the merged segments <= upperBound, done. Otherwise, set lowerBound to upperBound, and multiply upperBound by mergeFactor, go through the process again. Note that case 2 can be considerd as a special case of case 3. This merge policy guarantees two invariants if M does not change and segment doc count is not reaching maxMergeDocs: B for maxBufferedDocs, f defined as ceil(log_M(ceil(n/B))) 1: If i (left*) and i+1 (right*) are two consecutive segments of doc counts x and y, then f >= f . 2: The number of committed segments on the same level (f ) <= M.
          Hide
          Yonik Seeley added a comment -

          Thanks for separating out the new merge policy Ning! I'm reviewing the patch now...
          Assuming everything looks good (it does so far), I'm inclined to commit it. I'm just giving a heads up to other lucene developers as this is a change in behavior to core lucene.

          I think the new merge policy is a positive change because:

          • flushing all ram segments separately from disk segments allows more efficient implementations of combination reader/writers (like buffered deletes) because docids won't change from the flush alone (a merge is needed to change ids)
          • flushing all buffered docs together leaves more optimization possibilities... something other than single-doc segments could be used to buffer in-mem docs in the future.
          • increases indexing performance in the presence of deleted documents or partially full segments (merges are minimized while the number of segments are maximized).
          • fixes worst-case behavior that can cause the number of segments to grow too large (way more than mergefactor)

          Are there any concerns?

          Show
          Yonik Seeley added a comment - Thanks for separating out the new merge policy Ning! I'm reviewing the patch now... Assuming everything looks good (it does so far), I'm inclined to commit it. I'm just giving a heads up to other lucene developers as this is a change in behavior to core lucene. I think the new merge policy is a positive change because: flushing all ram segments separately from disk segments allows more efficient implementations of combination reader/writers (like buffered deletes) because docids won't change from the flush alone (a merge is needed to change ids) flushing all buffered docs together leaves more optimization possibilities... something other than single-doc segments could be used to buffer in-mem docs in the future. increases indexing performance in the presence of deleted documents or partially full segments (merges are minimized while the number of segments are maximized). fixes worst-case behavior that can cause the number of segments to grow too large (way more than mergefactor) Are there any concerns?
          Hide
          Yonik Seeley added a comment -

          I also did a quick indexing performance test w/ Solr:

          maxBufferedDocs=100, mergeFactor=4, did 100K random overwriting adds in batches of 75 (75 docs added, dups deleted).
          It was 12% faster with this new merge policy.

          Show
          Yonik Seeley added a comment - I also did a quick indexing performance test w/ Solr: maxBufferedDocs=100, mergeFactor=4, did 100K random overwriting adds in batches of 75 (75 docs added, dups deleted). It was 12% faster with this new merge policy.
          Hide
          Ning Li added a comment -

          This is to update the delete-support patch after the commit of the new merge policy.

          • Very few changes to IndexWriter.
          • The patch passes all tests.
          • A new test call TestNewIndexModifierDelete is added to show different scenarios when using delete methods in NewIndexModifier.
          Show
          Ning Li added a comment - This is to update the delete-support patch after the commit of the new merge policy. Very few changes to IndexWriter. The patch passes all tests. A new test call TestNewIndexModifierDelete is added to show different scenarios when using delete methods in NewIndexModifier.
          Hide
          Ning Li added a comment -

          [[ Old comment, sent by email on Thu, 6 Jul 2006 07:53:35 -0700 ]]

          Hi Otis,

          I will regenerate the patch and add more comments.

          Regards,
          Ning

          "Otis Gospodnetic
          (JIRA)"
          <jira@apache.org> To
          ningli@almaden.ibm.com
          07/05/2006 11:25 cc
          PM
          Subject
          [jira] Commented: (LUCENE-565)
          Supporting deleteDocuments in
          IndexWriter (Code and Performance
          Results Provided)

          [
          http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12419396
          ]

          Otis Gospodnetic commented on LUCENE-565:
          -----------------------------------------

          I took a look at the patch and it looks good to me (anyone else had a
          look)?
          Unfortunately, I couldn't get the patch to apply

          $ patch -F3 < IndexWriter.patch
          (Stripping trailing CRs from patch.)
          patching file IndexWriter.java
          Hunk #1 succeeded at 58 with fuzz 1.
          Hunk #2 succeeded at 112 (offset 2 lines).
          Hunk #4 succeeded at 504 (offset 33 lines).
          Hunk #6 succeeded at 605 with fuzz 2 (offset 57 lines).
          missing header for unified diff at line 259 of patch
          (Stripping trailing CRs from patch.)
          can't find file to patch at input line 259
          Perhaps you should have used the -p or --strip option?
          The text leading up to this was:
          ...
          ...
          ...
          File to patch: IndexWriter.java
          patching file IndexWriter.java
          Hunk #1 FAILED at 802.
          Hunk #2 succeeded at 745 with fuzz 2 (offset -131 lines).
          1 out of 2 hunks FAILED – saving rejects to file IndexWriter.java.rej

          Would it be possible for you to regenerate the patch against IndexWriter in
          HEAD?

          Also, I noticed ^Ms in the patch, but I can take care of those easily
          (dos2unix).

          Finally, I noticed in 2-3 places that the simple logging via "infoStream"
          variable was removed, for example:

          • if (infoStream != null) infoStream.print("merging segments");

          Perhaps this was just an oversight?

          Looking forward to the new patch. Thanks!

          Provided)
          ---------------------------------------------------------------------------------

          a
          IndexWriter
          http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049

          to
          deleting
          version
          using
          batches.


          This message is automatically generated by JIRA.
          -
          If you think it was sent incorrectly contact one of the administrators:
          http://issues.apache.org/jira/secure/Administrators.jspa
          -
          For more information on JIRA, see:
          http://www.atlassian.com/software/jira

          Show
          Ning Li added a comment - [[ Old comment, sent by email on Thu, 6 Jul 2006 07:53:35 -0700 ]] Hi Otis, I will regenerate the patch and add more comments. Regards, Ning "Otis Gospodnetic (JIRA)" <jira@apache.org> To ningli@almaden.ibm.com 07/05/2006 11:25 cc PM Subject [jira] Commented: ( LUCENE-565 ) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) [ http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12419396 ] Otis Gospodnetic commented on LUCENE-565 : ----------------------------------------- I took a look at the patch and it looks good to me (anyone else had a look)? Unfortunately, I couldn't get the patch to apply $ patch -F3 < IndexWriter.patch (Stripping trailing CRs from patch.) patching file IndexWriter.java Hunk #1 succeeded at 58 with fuzz 1. Hunk #2 succeeded at 112 (offset 2 lines). Hunk #4 succeeded at 504 (offset 33 lines). Hunk #6 succeeded at 605 with fuzz 2 (offset 57 lines). missing header for unified diff at line 259 of patch (Stripping trailing CRs from patch.) can't find file to patch at input line 259 Perhaps you should have used the -p or --strip option? The text leading up to this was: ... ... ... File to patch: IndexWriter.java patching file IndexWriter.java Hunk #1 FAILED at 802. Hunk #2 succeeded at 745 with fuzz 2 (offset -131 lines). 1 out of 2 hunks FAILED – saving rejects to file IndexWriter.java.rej Would it be possible for you to regenerate the patch against IndexWriter in HEAD? Also, I noticed ^Ms in the patch, but I can take care of those easily (dos2unix). Finally, I noticed in 2-3 places that the simple logging via "infoStream" variable was removed, for example: if (infoStream != null) infoStream.print("merging segments"); Perhaps this was just an oversight? Looking forward to the new patch. Thanks! Provided) --------------------------------------------------------------------------------- a IndexWriter http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 to deleting version using batches. – This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
          Hide
          Ning Li added a comment -

          With the recent commits to IndexWriter, this patch no longer applies cleanly. The 5 votes for this issue encourages
          me to submit yet another patch. But before I do that, I'd like to briefly describe the design again and welcome all
          suggestions that help improve it and help get it committed.

          With the new merge policy committed, the change to IndexWriter is minimal: three zero-or-one-line functions are
          added and used.
          1 timeToFlushRam(): return true if number of ram segments >= maxBufferedDocs and used in maybeFlushRamSegments()
          2 anythingToFlushRam(): return true if number of ram segments > 0 and used in flushRamSegments()
          3 doAfterFlushRamSegments(): do nothing and called in mergeSegments() if the merge is on ram segments

          The new IndexModifier is a subclass of IndexWriter and only overwrites the three functions described above.
          1 timeToFlushRam(): return true if number of ram segments >= maxBufferedDocs OR if number of buffered
          deletes >= maxBufferedDeletes
          2 anythingToFlushRam(): return true if number of ram segments > 0 OR if number of buffered deletes > 0
          3 doAfterFlushRamSegments(): properly flush buffered deletes

          The new IndexModifier supports all APIs from the current IndexModifier except one: deleteDocument(int doc).
          I had commented on this before: "I deliberately left that one out. This is because document ids are changing
          as documents are deleted and segments are merged. Users don't know exactly when segments are merged
          thus ids are changed when using IndexModifier."

          This behaviour is true for both the new IndexModifier and the current IndexModifier. If this is preventing this
          patch from getting accepted, I'm willing to add this, but I will detail this in the Java doc so users of this function
          are aware of this behaviour.

          Show
          Ning Li added a comment - With the recent commits to IndexWriter, this patch no longer applies cleanly. The 5 votes for this issue encourages me to submit yet another patch. But before I do that, I'd like to briefly describe the design again and welcome all suggestions that help improve it and help get it committed. With the new merge policy committed, the change to IndexWriter is minimal: three zero-or-one-line functions are added and used. 1 timeToFlushRam(): return true if number of ram segments >= maxBufferedDocs and used in maybeFlushRamSegments() 2 anythingToFlushRam(): return true if number of ram segments > 0 and used in flushRamSegments() 3 doAfterFlushRamSegments(): do nothing and called in mergeSegments() if the merge is on ram segments The new IndexModifier is a subclass of IndexWriter and only overwrites the three functions described above. 1 timeToFlushRam(): return true if number of ram segments >= maxBufferedDocs OR if number of buffered deletes >= maxBufferedDeletes 2 anythingToFlushRam(): return true if number of ram segments > 0 OR if number of buffered deletes > 0 3 doAfterFlushRamSegments(): properly flush buffered deletes The new IndexModifier supports all APIs from the current IndexModifier except one: deleteDocument(int doc). I had commented on this before: "I deliberately left that one out. This is because document ids are changing as documents are deleted and segments are merged. Users don't know exactly when segments are merged thus ids are changed when using IndexModifier." This behaviour is true for both the new IndexModifier and the current IndexModifier. If this is preventing this patch from getting accepted, I'm willing to add this, but I will detail this in the Java doc so users of this function are aware of this behaviour.
          Hide
          Michael Busch added a comment -

          What are the reasons to not add the NewIndexModifier to Lucene? This issue has already 6 votes, so it seems to be very popular amongst users (there is only one issue that has more votes). I can say that I'm using it for a couple of months already, it works flawlessly and made my life a lot easier

          I think the main objections were that too many changes to IndexWriter were made in the earliest versions of this patch, but with the new merge policy committed, most of the new code is in the new class NewIndexModifier whereas the changes to IndexWriter are minimal.

          So I would like to encourage committer(s) to take another look, I think this would be a nice feature for the next Lucene release.

          Show
          Michael Busch added a comment - What are the reasons to not add the NewIndexModifier to Lucene? This issue has already 6 votes, so it seems to be very popular amongst users (there is only one issue that has more votes). I can say that I'm using it for a couple of months already, it works flawlessly and made my life a lot easier I think the main objections were that too many changes to IndexWriter were made in the earliest versions of this patch, but with the new merge policy committed, most of the new code is in the new class NewIndexModifier whereas the changes to IndexWriter are minimal. So I would like to encourage committer(s) to take another look, I think this would be a nice feature for the next Lucene release.
          Hide
          Yonik Seeley added a comment -

          Lack of committer time... I've been busy enough that I've shied away from complexity and gravitated toward issues that I can handle in a single bite. I'm on PTO until the end of the year, so I expect my time to be more compressed.

          To minimize the number of reader open/closes on large persistent segments, I think the ability to apply deletes only before a merge is important. That might add a 4th method: doBeforeMerge()

          It would be nice to not have to continually open and close readers on segments that aren't involved in a merge. Is there a way to do this?

          Ning, please do produce another patch to the latest trunk (but you might want to wait until after LUCENE-702 is sorted out.

          Show
          Yonik Seeley added a comment - Lack of committer time... I've been busy enough that I've shied away from complexity and gravitated toward issues that I can handle in a single bite. I'm on PTO until the end of the year, so I expect my time to be more compressed. To minimize the number of reader open/closes on large persistent segments, I think the ability to apply deletes only before a merge is important. That might add a 4th method: doBeforeMerge() It would be nice to not have to continually open and close readers on segments that aren't involved in a merge. Is there a way to do this? Ning, please do produce another patch to the latest trunk (but you might want to wait until after LUCENE-702 is sorted out.
          Hide
          Yonik Seeley added a comment -

          > It would be nice to not have to continually open and close readers on segments
          > that aren't involved in a merge. Is there a way to do this?

          Hmmm, and what about segments that are involved in a merge?
          I assume it's a different reader that is used for deleting docs than used for merging, but it doesn't have to be...

          If SegmentInfos had a cached reader, that seems like it would solve both problems.
          I haven't thought about it enough to figure out how doable it is though.

          Show
          Yonik Seeley added a comment - > It would be nice to not have to continually open and close readers on segments > that aren't involved in a merge. Is there a way to do this? Hmmm, and what about segments that are involved in a merge? I assume it's a different reader that is used for deleting docs than used for merging, but it doesn't have to be... If SegmentInfos had a cached reader, that seems like it would solve both problems. I haven't thought about it enough to figure out how doable it is though.
          Hide
          Michael McCandless added a comment -

          > If SegmentInfos had a cached reader, that seems like it would solve both problems.
          > I haven't thought about it enough to figure out how doable it is though.

          Good idea! I think this could also be used by reopen (LUCENE-743 ) to re-use readers.

          Show
          Michael McCandless added a comment - > If SegmentInfos had a cached reader, that seems like it would solve both problems. > I haven't thought about it enough to figure out how doable it is though. Good idea! I think this could also be used by reopen ( LUCENE-743 ) to re-use readers.
          Hide
          Yonik Seeley added a comment -

          > Good idea! I think this could also be used by reopen (LUCENE-743 ) to re-use readers.

          Yes, although reopen() needs more support than what would be needed for this though (namely reference counting).
          One thing to probably watch out for is to avoid making the single-doc ram segments more expensive.

          Show
          Yonik Seeley added a comment - > Good idea! I think this could also be used by reopen ( LUCENE-743 ) to re-use readers. Yes, although reopen() needs more support than what would be needed for this though (namely reference counting). One thing to probably watch out for is to avoid making the single-doc ram segments more expensive.
          Hide
          Yonik Seeley added a comment -

          On 12/12/06, Ning Li <ning.li.li@gmail.com> wrote:
          > > To minimize the number of reader open/closes on large persistent segments, I think the ability to apply deletes only before a merge is important. That might add a 4th method: doBeforeMerge()
          >
          > I'm not sure I get this. Buffered deletes are only applied(flushed)
          > during ram flush. No buffered deletes are applied in the merges of
          > on-disk segments.

          What is important is to be able to apply deletes before any ids change.
          You could do it after every new lowest-level segment is written to the index (the flush), or you could choose to do it before a merge of the lowest level on-disk segments. If none of the lowest level segments have deletes, you could even defer the deletes until after all the lowest-level segments have been merged. This makes the deletes more efficient since it goes from O(mergeFactor * log(maxBufferedDocs)) to O(log(mergeFactor*maxBufferedDocs))

          If we can't reuse IndexReaders, this becomes more important.

          One could perhaps choose to defer deletes until a segment with deleted docs is involved in a merge.

          > > It would be nice to not have to continually open and close readers on segments that aren't involved in a merge. Is there a way to do this?
          > > If SegmentInfos had a cached reader, that seems like it would solve both problems.
          > > I haven't thought about it enough to figure out how doable it is though.
          >
          > This is a good idea! One concern, however, is that caching readers
          > will cause a larger memory footprint. Is it acceptable?

          As I said, I haven't had time to think about it at all, but at the lowest level of reuse, it wouldn't increase the footprint at all in the event that deletes are deferred until a merge:

          The specific scenario I'm thinking of is instead of
          doAfterFlushRamSegments()
          open readers
          delete docs
          close readers
          segmentMerger()
          open readers
          merge segments
          close readers

          It would be:
          doAfterFlushRamSegments()
          open readers
          delete docs
          segmentMerger()
          merge segments
          close readers

          This cutting out an additional open-close cycle.
          You are right that other forms of reader caching could increase the footprint, but it's nice to have the option of trading some memory for performance.

          Yet another strategy a subclass of IndexWriter could choose is to only apply deletes to segments actually involved in a merge. Then the bigger segments in the index wouldn't continually have an reader opened and closed on them.... it could all be deferred until a close, or until there are too many deletes buffered.

          Of course NewIndexModifier doesn't have to impliment all these options to start with, but it would be nice if the extension hooks in IndexWriter could support them.

          Whew, this is why I was slow to get involved in this again

          Show
          Yonik Seeley added a comment - On 12/12/06, Ning Li <ning.li.li@gmail.com> wrote: > > To minimize the number of reader open/closes on large persistent segments, I think the ability to apply deletes only before a merge is important. That might add a 4th method: doBeforeMerge() > > I'm not sure I get this. Buffered deletes are only applied(flushed) > during ram flush. No buffered deletes are applied in the merges of > on-disk segments. What is important is to be able to apply deletes before any ids change. You could do it after every new lowest-level segment is written to the index (the flush), or you could choose to do it before a merge of the lowest level on-disk segments. If none of the lowest level segments have deletes, you could even defer the deletes until after all the lowest-level segments have been merged. This makes the deletes more efficient since it goes from O(mergeFactor * log(maxBufferedDocs)) to O(log(mergeFactor*maxBufferedDocs)) If we can't reuse IndexReaders, this becomes more important. One could perhaps choose to defer deletes until a segment with deleted docs is involved in a merge. > > It would be nice to not have to continually open and close readers on segments that aren't involved in a merge. Is there a way to do this? > > If SegmentInfos had a cached reader, that seems like it would solve both problems. > > I haven't thought about it enough to figure out how doable it is though. > > This is a good idea! One concern, however, is that caching readers > will cause a larger memory footprint. Is it acceptable? As I said, I haven't had time to think about it at all, but at the lowest level of reuse, it wouldn't increase the footprint at all in the event that deletes are deferred until a merge: The specific scenario I'm thinking of is instead of doAfterFlushRamSegments() open readers delete docs close readers segmentMerger() open readers merge segments close readers It would be: doAfterFlushRamSegments() open readers delete docs segmentMerger() merge segments close readers This cutting out an additional open-close cycle. You are right that other forms of reader caching could increase the footprint, but it's nice to have the option of trading some memory for performance. Yet another strategy a subclass of IndexWriter could choose is to only apply deletes to segments actually involved in a merge. Then the bigger segments in the index wouldn't continually have an reader opened and closed on them.... it could all be deferred until a close, or until there are too many deletes buffered. Of course NewIndexModifier doesn't have to impliment all these options to start with, but it would be nice if the extension hooks in IndexWriter could support them. Whew, this is why I was slow to get involved in this again
          Hide
          Ning Li added a comment -

          > or you could choose to do it before a merge of the lowest level on-disk
          > segments. If none of the lowest level segments have deletes, you could
          > even defer the deletes until after all the lowest-level segments have been
          > merged. This makes the deletes more efficient since it goes from
          > O(mergeFactor * log(maxBufferedDocs)) to O(log(mergeFactor*maxBufferedDocs))

          I don't think I like this semantics, though. With the semantics in the patch,
          an update can be easily supported. With this semantics, an insert is flushed
          yet a delete before the insert may or may not have been flushed.

          > You are right that other forms of reader caching could increase the footprint,
          > but it's nice to have the option of trading some memory for performance.

          Agree. It'd be nice to cache all readers...

          Thanks again for your comments. Enjoy your PTO!

          Show
          Ning Li added a comment - > or you could choose to do it before a merge of the lowest level on-disk > segments. If none of the lowest level segments have deletes, you could > even defer the deletes until after all the lowest-level segments have been > merged. This makes the deletes more efficient since it goes from > O(mergeFactor * log(maxBufferedDocs)) to O(log(mergeFactor*maxBufferedDocs)) I don't think I like this semantics, though. With the semantics in the patch, an update can be easily supported. With this semantics, an insert is flushed yet a delete before the insert may or may not have been flushed. > You are right that other forms of reader caching could increase the footprint, > but it's nice to have the option of trading some memory for performance. Agree. It'd be nice to cache all readers... Thanks again for your comments. Enjoy your PTO!
          Hide
          Yonik Seeley added a comment -

          Hmmm, I see your point... If deletes are deferred, a different reader could go and open the index and see the additions but not the deletions.

          Can the same thing happen with your patch (with a smaller window), or are deletes applied between writing the new segment and writing the new segments file that references it? (hard to tell from current diff in isolation)

          Anyway, it's less of a problem if opening a new reader is coordinated with the writing. That still does leave the crash scenario though.

          Show
          Yonik Seeley added a comment - Hmmm, I see your point... If deletes are deferred, a different reader could go and open the index and see the additions but not the deletions. Can the same thing happen with your patch (with a smaller window), or are deletes applied between writing the new segment and writing the new segments file that references it? (hard to tell from current diff in isolation) Anyway, it's less of a problem if opening a new reader is coordinated with the writing. That still does leave the crash scenario though.
          Hide
          Ning Li added a comment -

          > Can the same thing happen with your patch (with a smaller window), or are deletes applied between writing the new segment and writing the new segments file that references it? (hard to tell from current diff in isolation)

          No, it does not happen with the patch, no matter what the window size is.
          This is because results of flushing ram - both inserts and deletes - are committed in the same transaction.

          Show
          Ning Li added a comment - > Can the same thing happen with your patch (with a smaller window), or are deletes applied between writing the new segment and writing the new segments file that references it? (hard to tell from current diff in isolation) No, it does not happen with the patch, no matter what the window size is. This is because results of flushing ram - both inserts and deletes - are committed in the same transaction.
          Hide
          Yonik Seeley added a comment -

          > both inserts and deletes - are committed in the same transaction.

          OK, cool. I agree that's the ideal default behavior.

          Show
          Yonik Seeley added a comment - > both inserts and deletes - are committed in the same transaction. OK, cool. I agree that's the ideal default behavior.
          Hide
          Yonik Seeley added a comment -

          Minor question... in the places that you use Vector, is there a reason you aren't using ArrayList?
          And in methods that pass a Vector, that could be changed to a List .

          Show
          Yonik Seeley added a comment - Minor question... in the places that you use Vector, is there a reason you aren't using ArrayList? And in methods that pass a Vector, that could be changed to a List .
          Hide
          Ning Li added a comment -

          > Minor question... in the places that you use Vector, is there a reason you aren't using ArrayList?
          > And in methods that pass a Vector, that could be changed to a List .

          ArrayList and List can be used, respectively.

          Show
          Ning Li added a comment - > Minor question... in the places that you use Vector, is there a reason you aren't using ArrayList? > And in methods that pass a Vector, that could be changed to a List . ArrayList and List can be used, respectively.
          Hide
          Paul Elschot added a comment -

          I'd like to give this a try over the upcoming holidays.
          Would it be possible to post a single patch?
          A single patch can be made by locally svn add'ing all new files
          and then doing an svn diff on all files involved from the top directory.

          Regards,
          Paul Elschot

          Show
          Paul Elschot added a comment - I'd like to give this a try over the upcoming holidays. Would it be possible to post a single patch? A single patch can be made by locally svn add'ing all new files and then doing an svn diff on all files involved from the top directory. Regards, Paul Elschot
          Hide
          Ning Li added a comment -

          Many versions of the patch were submitted as new code was committed to IndexWriter.java. For each version, all changes made were included in a single patch file.

          I removed all but the latest version of the patch. Even this one is outdated by the commit of LUCENE-701 (lock-less commits). I was waiting for the commit of LUCENE-702 before submitting another patch. LUCENE-702 was committed this morning. So I'll submit an up-to-date patch over the holidays.

          On 12/18/06, Paul Elschot (JIRA) <jira@apache.org> wrote:
          > I'd like to give this a try over the upcoming holidays.

          That's great! We can discuss/compare the designs then. Or, we can discuss/compare the designs before submitting new patches.

          Show
          Ning Li added a comment - Many versions of the patch were submitted as new code was committed to IndexWriter.java. For each version, all changes made were included in a single patch file. I removed all but the latest version of the patch. Even this one is outdated by the commit of LUCENE-701 (lock-less commits). I was waiting for the commit of LUCENE-702 before submitting another patch. LUCENE-702 was committed this morning. So I'll submit an up-to-date patch over the holidays. On 12/18/06, Paul Elschot (JIRA) <jira@apache.org> wrote: > I'd like to give this a try over the upcoming holidays. That's great! We can discuss/compare the designs then. Or, we can discuss/compare the designs before submitting new patches.
          Hide
          Ning Li added a comment -

          Here is the design overview. Minor changes were made because of lock-less commits.

          In the current IndexWriter, newly added documents are buffered in ram in the form of one-doc segments.
          When a flush is triggered, all ram documents are merged into a single segment and written to disk.
          Further merges of disk segments may be triggered.

          NewIndexModifier extends IndexWriter and supports document deletion in addition to document addition.
          NewIndexModifier not only buffers newly added documents in ram, but also buffers deletes in ram.
          The following describes what happens when a flush is triggered:

          1 merge ram documents into one segment and written to disk
          do not commit - segmentInfos is updated in memory, but not written to disk

          2 for each disk segment to which a delete may apply
          open reader
          delete docs*, write new .delN file (* Care is taken to ensure that an interleaved sequence of
          inserts and deletes for the same document are properly serialized.)
          close reader, but do not commit - segmentInfos is updated in memory, but not written to disk

          3 commit - write new segments_N to disk

          Further merges for disk segments work the same as before.

          As an option, we can cache readers to minimize the number of reader opens/closes. In other words,
          we can trade memory for better performance. The design would be modified as follows:

          1 same as above

          2 for each disk segment to which a delete may apply
          open reader and cache it if not already opened/cached
          delete docs*, write new .delN file

          3 commit - write new segments_N to disk

          The logic for disk segment merge changes accordingly: open reader if not already opened/cached;
          after a merge is complete, close readers for the segments that have been merged.

          Show
          Ning Li added a comment - Here is the design overview. Minor changes were made because of lock-less commits. In the current IndexWriter, newly added documents are buffered in ram in the form of one-doc segments. When a flush is triggered, all ram documents are merged into a single segment and written to disk. Further merges of disk segments may be triggered. NewIndexModifier extends IndexWriter and supports document deletion in addition to document addition. NewIndexModifier not only buffers newly added documents in ram, but also buffers deletes in ram. The following describes what happens when a flush is triggered: 1 merge ram documents into one segment and written to disk do not commit - segmentInfos is updated in memory, but not written to disk 2 for each disk segment to which a delete may apply open reader delete docs*, write new .delN file (* Care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized.) close reader, but do not commit - segmentInfos is updated in memory, but not written to disk 3 commit - write new segments_N to disk Further merges for disk segments work the same as before. As an option, we can cache readers to minimize the number of reader opens/closes. In other words, we can trade memory for better performance. The design would be modified as follows: 1 same as above 2 for each disk segment to which a delete may apply open reader and cache it if not already opened/cached delete docs*, write new .delN file 3 commit - write new segments_N to disk The logic for disk segment merge changes accordingly: open reader if not already opened/cached; after a merge is complete, close readers for the segments that have been merged.
          Hide
          Jeremy F. Kassis added a comment -

          Happy New Year everyone. I'm personally very excited about this improvement to Lucene. It really begins to open Lucene up to service highly mutable data, important for the application I'm developing. Following the thread, it looks like quite a few people have favorably reviewed the patch. Perhaps it's time for a blessing and commit?

          Show
          Jeremy F. Kassis added a comment - Happy New Year everyone. I'm personally very excited about this improvement to Lucene. It really begins to open Lucene up to service highly mutable data, important for the application I'm developing. Following the thread, it looks like quite a few people have favorably reviewed the patch. Perhaps it's time for a blessing and commit?
          Hide
          Ning Li added a comment -

          The patch is updated because of the code committed to IndexWriter since the last patch. The high-level design is the same as before. See comments on 18/Dec/06.

          Care has been taken to make sure if writer/modifier tries to commit but hits disk full that writer/modifier remains consistent and usable. A test case is added to TestNewIndexModifierDelete to test this.

          All tests pass.

          Show
          Ning Li added a comment - The patch is updated because of the code committed to IndexWriter since the last patch. The high-level design is the same as before. See comments on 18/Dec/06. Care has been taken to make sure if writer/modifier tries to commit but hits disk full that writer/modifier remains consistent and usable. A test case is added to TestNewIndexModifierDelete to test this. All tests pass.
          Hide
          Michael McCandless added a comment -

          Thanks for redoing the patch Ning! I like the added test case for
          disk full.

          I've reviewed this and it looks great. I fixed a few small typos and
          whitespace issues (looks like a line-wrapper had jumped in at some
          point) and attached NewIndexModifier.Jan2007.take2.patch

          I think this is the only issue holding up a Lucene 2.1 release (so
          far?). Yonik (or anyone) do you have any objections / questions about
          this patch? It's basically unchanged from before, just modified to
          accommodate recent fixes to IndexWriter.

          Show
          Michael McCandless added a comment - Thanks for redoing the patch Ning! I like the added test case for disk full. I've reviewed this and it looks great. I fixed a few small typos and whitespace issues (looks like a line-wrapper had jumped in at some point) and attached NewIndexModifier.Jan2007.take2.patch I think this is the only issue holding up a Lucene 2.1 release (so far?). Yonik (or anyone) do you have any objections / questions about this patch? It's basically unchanged from before, just modified to accommodate recent fixes to IndexWriter.
          Hide
          Yonik Seeley added a comment -

          I just reviewed this, and it looks good to me.
          I like how you managed to enable parallel analysis in updateDocument() too.

          Show
          Yonik Seeley added a comment - I just reviewed this, and it looks good to me. I like how you managed to enable parallel analysis in updateDocument() too.
          Hide
          Michael Busch added a comment -

          I tried the new patch out and everything looks good to me. One comment though: The public method NewIndexModifier.flush() just calls the public method flushRamSegments(). It might be confusing to have two public methods that do exactly the same?

          Besides this minor question I'm all for committing this patch.

          Show
          Michael Busch added a comment - I tried the new patch out and everything looks good to me. One comment though: The public method NewIndexModifier.flush() just calls the public method flushRamSegments(). It might be confusing to have two public methods that do exactly the same? Besides this minor question I'm all for committing this patch.
          Hide
          Michael McCandless added a comment -

          The flush() was added to better match the current IndexModifier, based
          on feedback (bullet 6) above:

          https://issues.apache.org/jira/browse/LUCENE-565#action_12428035

          Actually, back when that feedback was given, flushRamSegments() was
          still private. I agree it's awkward now to have two separate methods
          that do the same thing.

          But, I prefer "flush" over "flushRamSegments" because flush() is more
          generic so it reveals less about how the IndexWriter makes use of its
          RAM and leaves freedom in the future to have more interesting use of
          RAM (like KinoSearch as one example).

          So I think the right fix would be to add a public IndexWriter.flush()
          that just calls flushRamSegments, and then make flushRamSegments
          private again, then remove the flush() method from NewIndexModifier?
          (The public flushRamSegments() has not yet been released so making it
          private again before we release 2.1 is OK).

          Any objections to this approach? I will re-work the last patch &
          attach it.

          Show
          Michael McCandless added a comment - The flush() was added to better match the current IndexModifier, based on feedback (bullet 6) above: https://issues.apache.org/jira/browse/LUCENE-565#action_12428035 Actually, back when that feedback was given, flushRamSegments() was still private. I agree it's awkward now to have two separate methods that do the same thing. But, I prefer "flush" over "flushRamSegments" because flush() is more generic so it reveals less about how the IndexWriter makes use of its RAM and leaves freedom in the future to have more interesting use of RAM (like KinoSearch as one example). So I think the right fix would be to add a public IndexWriter.flush() that just calls flushRamSegments, and then make flushRamSegments private again, then remove the flush() method from NewIndexModifier? (The public flushRamSegments() has not yet been released so making it private again before we release 2.1 is OK). Any objections to this approach? I will re-work the last patch & attach it.
          Hide
          Michael Busch added a comment -

          Thanks for the explanation, Mike. I'd prefer flush() too and the changes you suggest look good to me!

          Show
          Michael Busch added a comment - Thanks for the explanation, Mike. I'd prefer flush() too and the changes you suggest look good to me!
          Hide
          Michael McCandless added a comment -

          OK I've attached NewIndexModifier.Jan2007.take3.patch with that approach.

          I plan on committing this in the next day or two if there are no more questions/feedback....

          Thank you Ning for this great addition, and for persisting through this long process!

          Show
          Michael McCandless added a comment - OK I've attached NewIndexModifier.Jan2007.take3.patch with that approach. I plan on committing this in the next day or two if there are no more questions/feedback.... Thank you Ning for this great addition, and for persisting through this long process!
          Hide
          Michael McCandless added a comment -

          I just committed this.

          Thank you Ning. Keep the patches coming!

          Show
          Michael McCandless added a comment - I just committed this. Thank you Ning. Keep the patches coming!
          Hide
          Michael McCandless added a comment -

          Reopening based on recent discussions on java-dev:

          http://www.gossamer-threads.com/lists/lucene/java-dev/45099

          Show
          Michael McCandless added a comment - Reopening based on recent discussions on java-dev: http://www.gossamer-threads.com/lists/lucene/java-dev/45099
          Hide
          Michael McCandless added a comment -

          OK I moved NewIndexModifier's methods into IndexWriter and did some
          small refactoring, tightening up protections, fixed javadocs,
          indentation, etc. NewIndexModifier is now removed.

          I like this solution much better!

          I also increased the default number of deleted terms before a flush is
          triggered from 10 to 1000. These buffered terms use very little
          memory so I think it makes sense to have a larger default?

          So, this adds these public methods to IndexWriter:

          public void updateDocument(Term term, Document doc, Analyzer analyzer)
          public void updateDocument(Term term, Document doc)
          public synchronized void deleteDocuments(Term[] terms)
          public synchronized void deleteDocuments(Term term)
          public void setMaxBufferedDeleteTerms(int maxBufferedDeleteTerms)
          public int getMaxBufferedDeleteTerms()

          And this public field:

          public final static int DEFAULT_MAX_BUFFERED_DELETE_TERMS = 10;

          On the extensions points, we had previously added these 4:

          protected void doAfterFlushRamSegments(boolean flushedRamSegments)
          protected boolean timeToFlushRam()
          protected boolean anythingToFlushRam()
          protected boolean onlyRamDocsToFlush()

          I would propose that instead we add only the first one above, but
          rename it to "doAfterFlush()". This is basically a callback that a
          subclass could use to do its own thing after a flush but before a
          commit.

          But then I don't think we should add any of the others. The
          "timeToFlushRam()" callback isn't really needed now that we have a
          public "flush()" method. And the other two are very specific to how
          IndexWriter implements RAM buffering/flushing and so unless/until we
          can think of a use case that needs these I'm inclined to not include
          them?

          Yonik, is there something in Solr that would need these last 2
          callbacks?

          I've attached the patch (LUCENE-565.Feb2007.patch) with these
          changes!

          Show
          Michael McCandless added a comment - OK I moved NewIndexModifier's methods into IndexWriter and did some small refactoring, tightening up protections, fixed javadocs, indentation, etc. NewIndexModifier is now removed. I like this solution much better! I also increased the default number of deleted terms before a flush is triggered from 10 to 1000. These buffered terms use very little memory so I think it makes sense to have a larger default? So, this adds these public methods to IndexWriter: public void updateDocument(Term term, Document doc, Analyzer analyzer) public void updateDocument(Term term, Document doc) public synchronized void deleteDocuments(Term[] terms) public synchronized void deleteDocuments(Term term) public void setMaxBufferedDeleteTerms(int maxBufferedDeleteTerms) public int getMaxBufferedDeleteTerms() And this public field: public final static int DEFAULT_MAX_BUFFERED_DELETE_TERMS = 10; On the extensions points, we had previously added these 4: protected void doAfterFlushRamSegments(boolean flushedRamSegments) protected boolean timeToFlushRam() protected boolean anythingToFlushRam() protected boolean onlyRamDocsToFlush() I would propose that instead we add only the first one above, but rename it to "doAfterFlush()". This is basically a callback that a subclass could use to do its own thing after a flush but before a commit. But then I don't think we should add any of the others. The "timeToFlushRam()" callback isn't really needed now that we have a public "flush()" method. And the other two are very specific to how IndexWriter implements RAM buffering/flushing and so unless/until we can think of a use case that needs these I'm inclined to not include them? Yonik, is there something in Solr that would need these last 2 callbacks? I've attached the patch ( LUCENE-565 .Feb2007.patch) with these changes!
          Hide
          Yonik Seeley added a comment -

          OK I moved NewIndexModifier's methods into IndexWriter and did some
          small refactoring, tightening up protections,

          > I would propose that instead we add only the first one above, but rename it to "doAfterFlush()".

          Yes, that sounds fine.

          The problem is that we wouldn't be able to take advantage of the hook because of the "tightening up protections". Access to the segments is key.

          + private SegmentInfos segmentInfos = new SegmentInfos(); // the segments
          + private SegmentInfos ramSegmentInfos = new SegmentInfos(); // the segments in ramDirectory
          + final private SegmentInfo buildSingleDocSegment(Document doc, Analyzer analyzer)

          So instead of changing these to private, how about package protected?

          Show
          Yonik Seeley added a comment - OK I moved NewIndexModifier's methods into IndexWriter and did some small refactoring, tightening up protections, > I would propose that instead we add only the first one above, but rename it to "doAfterFlush()". Yes, that sounds fine. The problem is that we wouldn't be able to take advantage of the hook because of the "tightening up protections". Access to the segments is key. + private SegmentInfos segmentInfos = new SegmentInfos(); // the segments + private SegmentInfos ramSegmentInfos = new SegmentInfos(); // the segments in ramDirectory + final private SegmentInfo buildSingleDocSegment(Document doc, Analyzer analyzer) So instead of changing these to private, how about package protected?
          Hide
          Michael McCandless added a comment -

          OK, got it. I will change those 3 to package protection and then commit. Thanks Yonik.

          Show
          Michael McCandless added a comment - OK, got it. I will change those 3 to package protection and then commit. Thanks Yonik.
          Hide
          Michael McCandless added a comment -

          Closing all issues that were resolved for 2.1.

          Show
          Michael McCandless added a comment - Closing all issues that were resolved for 2.1.

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Ning Li
            • Votes:
              8 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development