|
[
Permlink
| « Hide
]
Doug Cutting added a comment - 09/May/06 09:24 AM
Can you please attach diffs rather than complete files? The diffs should not not contain CHANGE comments. To generate diffs, check Lucene out of Subversion, make your changes, then, from the Lucene trunk, run something like 'svn diff > my.patch'. New files should first be added with 'svn add' so that they're included in the diff. Thanks!
I took a look at the patch and it looks good to me (anyone else had a look)?
Unfortunately, I couldn't get the patch to apply $ patch -F3 < IndexWriter.patch Would it be possible for you to regenerate the patch against IndexWriter in HEAD? Also, I noticed ^Ms in the patch, but I can take care of those easily (dos2unix). Finally, I noticed in 2-3 places that the simple logging via "infoStream" variable was removed, for example:
Perhaps this was just an oversight? Looking forward to the new patch. Thanks! For an overview of my changes, I'll repeat some of what I said in
my earlier e-mail (see http://www.gossamer-threads.com/lists/lucene/java-dev/35143 then add more detail about specific coding changes. Overview API Changes Coding Changes For simplicity of explanation, let's assume the index resides in a Changes to the IndexWriter variables:
Changes to IndexWriter methods:
Thanks for all the information about coding changes, that makes it easier to understand the diff.
Ideally this will become comments in the new diff, which I can commit. Yonik mentioned this in email. It does sound like a better place for this might be in a higher level class. IndexWriter would really not be just a writer/appender once delete functionality is added to it, even if it's the IndexReaders behind the scenes doing the work. So if you are going to be redoing the patch, consider this. Perhaps IndexModifier methods should be deprecated and it should get a new/your API? Hi Otis,
I've attached two patch files:
All unit test succeeded except the following one: However, the unit test has a problem, not the patch: IndexWriter's docCount() does not tell the actual number of documents in an index, only IndexReader's numDocs() does. For example, in a similar test below, where 10 documents are added, then 1 deleted, then 2 added, the last call to docCount() returns 12, not 11, with or without the patch. public void testIndexSimple() throws IOException { The reason for the docCount() difference in the unit test (which does not affect the correctness of the patch) is that flushRamSegments() in the patch merges all and only the segments in ram and write to disk, whereas the original flushRamSegments() merges not only the segments in ram but sometimes also one segment from disk (see in that function the comment "// add one FS segment?"). Regards, Hopefully, third time's a charm.
I rewrote IndexWriter in such a way that semantically it's the same as before, but it provides extension points so that delete-by-term, delete-by-query, and more functionalities can be easily supported in a subclass. NewIndexModifier is such a subclass that supports delete-by-term. Here is an overview of the changes: Changes to IndexWriter
NewIndexModifier
I tried out this patch (July18), and have a few comments...
First, it is nice to be able to add/remove documents with no need to care for switching between readers and writers, and without worrying for performance issues as result of that switching. I did not test for performance yet. This post is quite long, so here is an outline... ------------ (1) Compile error in test code ------------ (2) Failing tests - does this patch also fix a bug in current flushRamSegments()? ------------ (3) Additional tests I ran All tests passed, except for TestIndexModifier.testIndex() - this is the same failure as above - so, no problem detected in new class. ------------ (4) Javadocs Remarks ------------ (5) deleteDocument(int doc) not implemented ------------ (6) flush() not implemented ------------ (7) Method name - batchDeleteDocuments(Term[]) ------------ (8) Class name and placement + What's Next for this patch It seems to me that this class would be very useful for users, either as a new class or if it replaces the current implementation of IndexModifier. Latter would be possible only if the 2 missing methods mentioned above are added. In this case, the "immediate delete" behavior of current IndexModifier should be possible to achieve by users, by setting maxBefferedDeleteTerms to 1. One disadvantage of this class vs. current IndexModifier is the ability to add access to further methods of IndexReader. With current IndexModifier this is very simple (though not always efficient) - just follow code template in existing methods, i.e. close writer/reader and open reader/writer as required. With the new class, exposing further methods of IndexReader would be more of a challenge. Perhaps having a multiReader on all segment readers can do. I am not sure to what extent this should be a consideration, so just bringing it up. So, If this class replaces IndexModifier - fine. If not, how about calling it BufferredIndexWriter?
I tested just the IndexWriter from this code base, it does not seem to work. NewIndexModifier does work. I simply used IndexWriter to create several documents and then search for them. Nothing came back even though it seems something was written to disk.
> Yes I am including this patch as it is very useful for increasing
> the efficiency of updates as you described. I will be conducting > more tests and will post any results. Yes a patch for IndexWriter > will be useful so that the entirety of this build will work. > Thanks! I've attached a patch that works with the current code. The This patch had evolved with the help of many good discussions
Suggestions are welcome! Especially those that may help it get Doron, thank you very much for the review! I want to briefly comment
on one of your comments: > (5) deleteDocument(int doc) not implemented I deliberately left that one out. This is because document ids are This IndexWriter seems to work. Thanks. Great work!
I ran a performance test for interleaved adds and removes - and compared between IndexModifier and NewIndexModifier.
Few setups were tested, with a few combinations of "consecutive adds before a delete takes place", maxBufferredDocs, and "number of total test iterations", where each iteration does the conseutive adds and then does the deletes. Each setup ran in this order - orig indexModifier, new one, orig, new one, and the best time out of the two runs was used. Results indicate that NewIndexModifier is far faster for most setups. Attached is the performance test, the performance results, and the log of the run. The performance test is written as a Junit test, and it fails in case the original IndexModfier is faster than the new one by more than 1 second (smaller than 1 sec difference is considered noise). Test was run on XP (SP1) with IBM JDK 1.5. Test was first failing with "access denied" errors due to what seems to be an XP issue. So in order to run this test on XP (and probably other Windows platforms) the patch from http://issues.apache.org/jira/browse/LUCENE-665 It is interesting to notice that in addition to preformance gain, NewIndexModifier seems less sensitive to "access denied" XP problems, because it closes/reopens readers and writers less frequently, and indeed, at least in my runs, these errors had to be bypassed (by the "retry" patch) only for the current index-modifier.
It seems this writer works, but then some mysterious happens to the index and the searcher can no longer read it. I am using this in conjunction with Solr. The index files look ok, however a search will return nothing. I have seen this repeatedly over about 1 weeks time.
Is it that results that were returned are suddenly (say after updates) not returned anymore (indicating something bad happened to existing index)?
Or is it that the search does not reflect recent changes? I don't remember how often Solr closes and re-opens the writer/modifier... with this patch a delete does not immediately cause a "flush to disk" - so flushes are controlled by closing the NewIndexModifier (and re-opening, since there no flush() method) and by the limits for max-bufferred-docs and max-bufferred-deletes. If this seems relevant to your case, what limits are in effect? I started to flush the deletes after making them, which opens a new NewIndexModifier afterwards. I still see the same thing. I am starting off by deleting all documents by matching on a Term that all of them have. Commit (reopen), then perform a batch addDocuments. Then when a search is executed nothing is returned, and after an optimize the index goes down to 1K. Seems like some peculiarity in NewIndexModifier. Seems like the new documents are deleted even after they are added.
Just to make sure on the scenario - are you -
(1) using NewIndexModifier at all, or (2) just letting Solr use this IndexWriter (with the code changes introduced to ebable NewIndexModifier) instead of the Lucene's svn-head (or cetrain release) IndexModifier. As is, Solr would not use NewIndexModifier or IndexModifier at all. For case (2) above the bufferred deletes logic is not in effect at all. I wonder if it possibe to re-create this with a simple Lucene stand-alone (test) program rather than with Solr - it would be easier to analyze. Good points... I've actually used both NewIndexModifier and the parent. I've tried writing a new UpdateHandler, and incorporating the new IndexWriter into DirectUpdateHandler2. I will create a non-Solr reproduction of the issue. I guess it has something to do with ths doc ids being reused and so the new documents that are added are also marked as deleted as the number of documents would match almost exactly after the rebuild. I am not an expert in regards to that aspect of Lucene.
Having trouble reproducing this. Probably something in the other code. Thanks for the help and the patch, I feel more confident in it now.
I figured out the problem, the Solr DirectUpdateHandler2 expects to delete only a certain number of documents specifically the oldest first in some cases by using TermDocs and deleting by the doc id. NewIndexModifier deletes at the level of the SegmentReader however. Any good way to do this?
Updated performance test results - perf-test-res2.JPG - in avarage, the new code is 9 times faster!
What have changed? - in previous test I forgot to set max-buffered-deletes. After fixing so, I removed the test cases with max-buffer of 5,000 and up, because they consumed too much memory, and added more practical (I think) cases of 2000 and 3000. Here is a textual summary of the data in the attached image: max buf add/del 10 10 100 1000 2000 3000 Note: for the first two cases new code is slower by 11% and 55%, but this is a very short test case, - the absolute difference here is less than 100ms, comparing to the other cases, where the difference is measured in seconds and 10's of seconds. > the new code is 9 times faster!
That's a bit apples-and-oranges IMO, performance of existing code using IndexWriter is more important, and what I would be interested in. Say indexing 10000 documents to a RamDirectory with default settings (no deletes at all). I haven't had a chance to review the new code, so I don't know if it's something to worry about or not. I believe this patch probably also changes the merge behavior.
I think we need to discuss what exactly the new merge behavior is, if it's OK, what we think the index invariants should be (no more than x segments of y size, etc), and I'd like to see some code to test those invariants. Keep in mind the difficulty of getting the last IndexWriter patch correct (and that was a very minor patch to keep track of the number of buffered docs!) I agree - I also suspected it might change the merge behavior (and also had reflections from the repeated trials to have that simple Indexwriter buffered-docs patch correct...
Guess I just wanted to get a feeling if there is interest to include this patch before I delve into it too much - and the perf test was meant to see for my self if it really helps. I was a bit surprised that it speeds 9 times in an interleaving add/delete scenario. Guess this by itself now justifies delving into this patch, analyzing merge behavior as you suggest - will do - I think idealy should try this patch not to modify the merge behavior. About the test - l was trying to test what I thought is a realistic use scenario (max-buf, etc.) - I have a fixed version of the perf test that is easier to modify for different scenarios - can upload it here if there is interest. This patch features the new more robust merge policy. Reference on the new policy is at http://www.gossamer-threads.com/lists/lucene/java-dev/35147
The following is a detailed description of the new merge policy and its properties. Overview of merge policy: A flush is triggered either by close() or by the number of ram segments LowerBound and upperBound set the limits on the doc count of a segment Case 1: number of worthy segments < mergeFactor, no merge, done. This merge policy guarantees two invariants if M does not change and Thanks for separating out the new merge policy Ning! I'm reviewing the patch now...
Assuming everything looks good (it does so far), I'm inclined to commit it. I'm just giving a heads up to other lucene developers as this is a change in behavior to core lucene. I think the new merge policy is a positive change because:
Are there any concerns? I also did a quick indexing performance test w/ Solr:
maxBufferedDocs=100, mergeFactor=4, did 100K random overwriting adds in batches of 75 (75 docs added, dups deleted). This is to update the delete-support patch after the commit of the new merge policy.
[[ Old comment, sent by email on Thu, 6 Jul 2006 07:53:35 -0700 ]] Hi Otis, I will regenerate the patch and add more comments. Regards, "Otis Gospodnetic [ Otis Gospodnetic commented on I took a look at the patch and it looks good to me (anyone else had a $ patch -F3 < IndexWriter.patch Would it be possible for you to regenerate the patch against IndexWriter in Also, I noticed ^Ms in the patch, but I can take care of those easily Finally, I noticed in 2-3 places that the simple logging via "infoStream"
Perhaps this was just an oversight? Looking forward to the new patch. Thanks! Provided) a to – With the recent commits to IndexWriter, this patch no longer applies cleanly. The 5 votes for this issue encourages
me to submit yet another patch. suggestions that help improve it and help get it committed. With the new merge policy committed, the change to IndexWriter is minimal: three zero-or-one-line functions are The new IndexModifier is a subclass of IndexWriter and only overwrites the three functions described above. The new IndexModifier supports all APIs from the current IndexModifier except one: deleteDocument(int doc). This behaviour is true for both the new IndexModifier and the current IndexModifier. If this is preventing this What are the reasons to not add the NewIndexModifier to Lucene? This issue has already 6 votes, so it seems to be very popular amongst users (there is only one issue that has more votes). I can say that I'm using it for a couple of months already, it works flawlessly and made my life a lot easier
I think the main objections were that too many changes to IndexWriter were made in the earliest versions of this patch, but with the new merge policy committed, most of the new code is in the new class NewIndexModifier whereas the changes to IndexWriter are minimal. So I would like to encourage committer(s) to take another look, I think this would be a nice feature for the next Lucene release. Lack of committer time... I've been busy enough that I've shied away from complexity and gravitated toward issues that I can handle in a single bite. I'm on PTO until the end of the year, so I expect my time to be more compressed.
To minimize the number of reader open/closes on large persistent segments, I think the ability to apply deletes only before a merge is important. That might add a 4th method: doBeforeMerge() It would be nice to not have to continually open and close readers on segments that aren't involved in a merge. Is there a way to do this? Ning, please do produce another patch to the latest trunk (but you might want to wait until after > It would be nice to not have to continually open and close readers on segments
> that aren't involved in a merge. Is there a way to do this? Hmmm, and what about segments that are involved in a merge? If SegmentInfos had a cached reader, that seems like it would solve both problems. > If SegmentInfos had a cached reader, that seems like it would solve both problems.
> I haven't thought about it enough to figure out how doable it is though. Good idea! I think this could also be used by reopen ( > Good idea! I think this could also be used by reopen (
Yes, although reopen() needs more support than what would be needed for this though (namely reference counting). On 12/12/06, Ning Li <ning.li.li@gmail.com> wrote:
> > To minimize the number of reader open/closes on large persistent segments, I think the ability to apply deletes only before a merge is important. That might add a 4th method: doBeforeMerge() > > I'm not sure I get this. Buffered deletes are only applied(flushed) > during ram flush. No buffered deletes are applied in the merges of > on-disk segments. What is important is to be able to apply deletes before any ids change. If we can't reuse IndexReaders, this becomes more important. One could perhaps choose to defer deletes until a segment with deleted docs is involved in a merge. > > It would be nice to not have to continually open and close readers on segments that aren't involved in a merge. Is there a way to do this? As I said, I haven't had time to think about it at all, but at the lowest level of reuse, it wouldn't increase the footprint at all in the event that deletes are deferred until a merge: The specific scenario I'm thinking of is instead of It would be: This cutting out an additional open-close cycle. Yet another strategy a subclass of IndexWriter could choose is to only apply deletes to segments actually involved in a merge. Then the bigger segments in the index wouldn't continually have an reader opened and closed on them.... it could all be deferred until a close, or until there are too many deletes buffered. Of course NewIndexModifier doesn't have to impliment all these options to start with, but it would be nice if the extension hooks in IndexWriter could support them. Whew, this is why I was slow to get involved in this again > or you could choose to do it before a merge of the lowest level on-disk
> segments. If none of the lowest level segments have deletes, you could > even defer the deletes until after all the lowest-level segments have been > merged. This makes the deletes more efficient since it goes from > O(mergeFactor * log(maxBufferedDocs)) to O(log(mergeFactor*maxBufferedDocs)) I don't think I like this semantics, though. With the semantics in the patch, > You are right that other forms of reader caching could increase the footprint, Agree. It'd be nice to cache all readers... Thanks again for your comments. Enjoy your PTO! Hmmm, I see your point... If deletes are deferred, a different reader could go and open the index and see the additions but not the deletions.
Can the same thing happen with your patch (with a smaller window), or are deletes applied between writing the new segment and writing the new segments file that references it? (hard to tell from current diff in isolation) Anyway, it's less of a problem if opening a new reader is coordinated with the writing. That still does leave the crash scenario though. > Can the same thing happen with your patch (with a smaller window), or are deletes applied between writing the new segment and writing the new segments file that references it? (hard to tell from current diff in isolation)
No, it does not happen with the patch, no matter what the window size is. > both inserts and deletes - are committed in the same transaction.
OK, cool. I agree that's the ideal default behavior. Minor question... in the places that you use Vector, is there a reason you aren't using ArrayList?
And in methods that pass a Vector, that could be changed to a List . I'd like to give this a try over the upcoming holidays.
Would it be possible to post a single patch? A single patch can be made by locally svn add'ing all new files and then doing an svn diff on all files involved from the top directory. Regards, Many versions of the patch were submitted as new code was committed to IndexWriter.java. For each version, all changes made were included in a single patch file.
I removed all but the latest version of the patch. Even this one is outdated by the commit of On 12/18/06, Paul Elschot (JIRA) <jira@apache.org> wrote: That's great! We can discuss/compare the designs then. Or, we can discuss/compare the designs before submitting new patches. Here is the design overview. Minor changes were made because of lock-less commits.
In the current IndexWriter, newly added documents are buffered in ram in the form of one-doc segments. NewIndexModifier extends IndexWriter and supports document deletion in addition to document addition. 1 merge ram documents into one segment and written to disk 2 for each disk segment to which a delete may apply 3 commit - write new segments_N to disk Further merges for disk segments work the same as before. As an option, we can cache readers to minimize the number of reader opens/closes. In other words, 1 same as above 2 for each disk segment to which a delete may apply 3 commit - write new segments_N to disk The logic for disk segment merge changes accordingly: open reader if not already opened/cached; Happy New Year everyone. I'm personally very excited about this improvement to Lucene. It really begins to open Lucene up to service highly mutable data, important for the application I'm developing. Following the thread, it looks like quite a few people have favorably reviewed the patch. Perhaps it's time for a blessing and commit?
The patch is updated because of the code committed to IndexWriter since the last patch. The high-level design is the same as before. See comments on 18/Dec/06.
Care has been taken to make sure if writer/modifier tries to commit but hits disk full that writer/modifier remains consistent and usable. A test case is added to TestNewIndexModifierDelete to test this. All tests pass. Thanks for redoing the patch Ning! I like the added test case for I've reviewed this and it looks great. I fixed a few small typos and I think this is the only issue holding up a Lucene 2.1 release (so I just reviewed this, and it looks good to me.
I like how you managed to enable parallel analysis in updateDocument() too. I tried the new patch out and everything looks good to me. One comment though: The public method NewIndexModifier.flush() just calls the public method flushRamSegments(). It might be confusing to have two public methods that do exactly the same?
Besides this minor question I'm all for committing this patch. The flush() was added to better match the current IndexModifier, based
on feedback (bullet 6) above: https://issues.apache.org/jira/browse/LUCENE-565#action_12428035 Actually, back when that feedback was given, flushRamSegments() was But, I prefer "flush" over "flushRamSegments" because flush() is more So I think the right fix would be to add a public IndexWriter.flush() Any objections to this approach? I will re-work the last patch & Thanks for the explanation, Mike. I'd prefer flush() too and the changes you suggest look good to me!
OK I've attached NewIndexModifier.Jan2007.take3.patch with that approach.
I plan on committing this in the next day or two if there are no more questions/feedback.... Thank you Ning for this great addition, and for persisting through this long process! I just committed this.
Thank you Ning. Keep the patches coming! Reopening based on recent discussions on java-dev:
OK I moved NewIndexModifier's methods into IndexWriter and did some
small refactoring, tightening up protections, fixed javadocs, indentation, etc. NewIndexModifier is now removed. I like this solution much better! I also increased the default number of deleted terms before a flush is So, this adds these public methods to IndexWriter: public void updateDocument(Term term, Document doc, Analyzer analyzer) And this public field: public final static int DEFAULT_MAX_BUFFERED_DELETE_TERMS = 10; On the extensions points, we had previously added these 4: protected void doAfterFlushRamSegments(boolean flushedRamSegments) I would propose that instead we add only the first one above, but But then I don't think we should add any of the others. The Yonik, is there something in Solr that would need these last 2 I've attached the patch ( OK I moved NewIndexModifier's methods into IndexWriter and did some
small refactoring, tightening up protections, > I would propose that instead we add only the first one above, but rename it to "doAfterFlush()". Yes, that sounds fine. The problem is that we wouldn't be able to take advantage of the hook because of the "tightening up protections". Access to the segments is key. + private SegmentInfos segmentInfos = new SegmentInfos(); // the segments So instead of changing these to private, how about package protected? OK, got it. I will change those 3 to package protection and then commit. Thanks Yonik.
Closing all issues that were resolved for 2.1.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||