|
[
Permlink
| « Hide
]
Grant Ingersoll added a comment - 30/Sep/07 04:58 PM
I think we should wait a bit more on this, as there still are a fair number of issues related to these changes in Lucene that need to be ironed out.
We could probably just wait for lucene 2.3 to be released before releasing 1.3. I wouldn't be averse to pre-integrating the changes, though.
yep, I agree with pre-integrating, just from watching the Lucene discussions going on lately, I think it is worth letting a few more things be worked out before using a nightly build.
Updated to cover the broader scope of changes that effect upgrading to Lucene trunk.
Plan to implement: Add <mergeScheduler> tag that can have two values: concurrent or serial. Or would it be better to take in a classname? Doing the latter would mean we would have to have a no-arg constructor, right? Add <mergePolicy> tag that defines the merge policy that can have two values: byteSize or docCount. Or would it be better to take a classname? NOTE: I am not proposing to handle the new reusable Document/Field/Token mechanism in Lucene, which should also be considered. Perhaps for now mergePolicy should be aligned with the buffering limit (ram or docs)?
If/when we do add <mergePolicy> it seems like it should be a class name. OK, this would imply it is set to LogByteSizeMergePolicy when setting <ramBufferSizeMB> and LogDocMergePolicy when setting <maxBufferedDocs>
First crack at implementing this. All tests pass on OS X except SolrJ's SolrExceptionTest, but for some reason that is failing on a clean version, too, so I am convinced it is not due to anything I did.
My personal benchmarking of just the Lucene side of things (see indexing.alg in Lucene contrib/benchmark) show pretty significant performance gains. This is also anecdotally confirmed by my basic testing in Solr. I set the default to be 16MB, per Mike McCandless defaults in Lucene, but this is probably too low given the server nature of Solr where a lot more memory is likely to be available. There are 4 new configuration possibilities: Patch is inside the tar file, as well as a bundling of the Lucene jars (not technically the latest, but only a couple days old) > Default is the maxBufferedDocs way, but this could be changed to be the other way around (and probably should be)
+1
For Solr, you never would want to use it. trying to catch a glimpse of new segments as they are flushed leads to an inconsistent view of the index since docs haven't been deleted yet. We do need to document recommended solrconfig.xml changes in CHANGES.txt (at the top in the migration section we normally have) for people to get these performance gains with existing configs.
The other thing is, Lucene actually supports setting both, and flushes based on whichever one is hit first. Is this worth supporting?
Should we even expose this, then? It seems like we should just make it false. I will write up more in the changes. > The other thing is, Lucene actually supports setting both, and flushes based on whichever one is hit first. Is this worth supporting?
Since it takes no extra work in Solr, I guess we should just allow it. > Should we even expose this [lucene autocommit], then? It seems like we should just make it false. I think so... some people use Solr in some quite advanced ways. Changes:
1. Updated changes.txt with recommendations on settings. 2. Changed SolrIndexWriter from last patch to allow for setting both maxBufferedDocs and ramBufferSizeMB. 3. Updated the various sample solrconfig.xml to have a default of 32 MB for ramBufferSizeMB. Commented out maxBufferedDocs, but did not deprecate it. 4. Added a note to the various solrconfig.xml explaining what Lucene does if BOTH ramBufferSizeMB and maxBufferedDocs is set. The Lucene libraries are bundled with the previous patch, but are still needed. just a comment to say that we added this patch and saw rather signifigant improvements, on the order of 10-25% for different index tests.
Dumb little script to copy over the required Lucene jars from a built Lucene directory. Takes in two parameters, the location of Lucene Home and the version to copy over. Requires Lucene to be built.
Belongs in the lib directory. For example, is there any update on getting this patch committed? we needed to be able to set some of the buffer sizes so the script wouldn't help. have other people experienced tourbles with 2.3 and/or this patch that i should be wary of?
In my mind, there are still some issues w/ 2.3 dev that are being worked on. Personally, I think we should wait until 2.3 is released, but it would be good for people to get some running-time with this patch, if they can, before then, as that will help work out any issues remaining in 2.3.
Looks like Lucene 2.3 is shaping up to be released fairly soon (~2 weeks) and that many of the indexing/thread-safety concerns have been worked out. Might as well wait for the official release at this point, although I have been using 2.3-dev for a fairly long time at this point.
I don't think it's necessary to wait for the official lucene 2.3 release, esp since there is still a lot of solr work to be done (tokenizer upgrades to use char[], reusable tokenizers, reusable analyzers?, etc). We could upgrade to the latest snapshot when someone is willing to tackle those issues.
Updated to work against trunk.
As always, let me know if there is anything I can do to help get this committed. Now mergeFactor can be effective as long as mergePolicy is an instance of LogMergePolicy.
How about <mergePolicy mergeFactor="10"/> or <mergePolicy> instead of <indexDefaults> I did some benchmarking of the autocommit functionality in Lucene (as opposed to in Solr, which is different). Currently, in Lucene autocommit is true by default, meaning that every time there is a flush, it is also committed. Solr adds its own layer on top of this with its commit semantics. There is a noticeable difference in memory used and speed in Lucene performance between autocommit = false and autocommit = true.
Some rough numbers using the autocommit.alg in Lucene's benchmark contrib (from trunk): The first row has autocommit = true, second is false, and then alternating. The key value is the rec/s, which is: Notice also the diff in avgUsedMem. Adding this functionality may, perhaps, be more important to Solr's performance than the flush by RAM capability. Update of patch to account for the fact that mergeFactor is only for Log based merges. I left it as the <mergeFactor> tag, but put in an instanceof clause in the init method of the SolrIndexWriter to check to see if the mergeFactor is settable.
I think we're running into a very serious issue with trunk + this patch. either the document summaries are not matched or the overall matching is 'wrong'. i did find this in the lucene jira:
"Note that these changes will break users of ParallelReader because the we're seeing rather consistent bad results but only after 20-30k documents and multiple commits and wondering if anyone else is seeing anything. i've verified that the results are bad even though luke which would seem to remove the search side of hte solr equation. the basic test case is to search for title:foo and get back documents that only have title:bar. we're going to start on a unit test but give the document counts and the corpus we're testing against it may be a while so i thought i'd ask to see if anyone had any hints. removing this patch seems to remove the issue so i doesn't appear to be a lucene problem Yikes! Thanks for the report Will. It certainly sounds like a Lucene issue to me (esp because removal of this patch fixes things... that means it only happens under certain lucene settings). Could you perhaps try the very latest Lucene trunk (there were some seemingly unrelated fixes recently).
Will, are you using term vectors anywhere, or any customizations to Solr (at the lucene level)?
When you say "document summaries are not matched", you you mean that the incorrect documents are matched, or that the correct documents are matched but just highlighting is wrong? patched solr + lucene trunk is stil broken. if anyone has any pointers for ways to coax this problem to happen before we get 20-30k large docs in the system let me know and we can start working on a unit test, otherwise it's going to take a while to reproduce anything.
Thanks Will. My guess at this point is a merging bug in Lucene, so you might be able to reproduce by forcing more merges. Make mergeFacor=2 and lower how many docs it takes to do a merge (set maxBufferedDocs to 2, or set ramBufferSizeMB to 1).
Can you share your settings? (solrconfig.xml), or at least the relevant sections.
we have:
<mergeFactor>10</mergeFactor> and i'm working on a unit test but just adding a few terms per doc doesnt seem to trigger it, at least not 'quickly.' You mentioned ParallelReader, are you using that, or any other patches?
what is "large" in your terms? we're not using parallel reader but we are using direct core access instead of going over http. as for doc size, we're indexing wikipedia but creating anumber of extra fields. they are only large in comparison to any of the 'large volume' tests i've seen in most of the solr and lucene tests.
Direct core meaning embedded, right? It's interesting, b/c I have done a fair amount of Lucene 2.3 testing w/ Wikipedia (nothing like a free, fairly large dataset)
Can you reproduce the problem using Lucene directly? (have a look at contrib/benchmark for a way to get Lucene/Wikipedia up and running quickly) Also, are there any associated exceptions anywhere in the chain? Or is it just that your index is bad? Are you starting from a clean index or updating an existing one? we're using SolrCore in terms of:
core = new SolrCore("foo", dataDir, solrConfig, solrSchema); which is a bit more low level than normal however when we flipped back to solr trunk + lucene 2.3 everything was fine so it leads me to belive that we are ok in that respect. i was going to try and reproduce with lucene directly also but that too is a bit outside the scope of what i have time for at the moment. and we're not getting any exceptions, just bad search results. Also, are you doing multi-threaded indexing or are you indexing while searching?
we are doing multi-threaded indexing and searching while indexing however the 'bad' results come back after the indexing run is finished and the index itself is static.
OK, I've managed to reproduce this in a straight lucene testcase.
I'm doing further verification and will open up a Lucene bug shortly. Sounds like we will have a Lucene 2.3.1 release in the next week or so with the fixes in place. Will, in the meantime, when
i switched to the lucene 2.3 branch, updated (and confirmed that yonik's 1 line change was in place), reran the tests and still saw the same problem. if i missed something please let me know.
Will, did you create a new index in your test?
Also make sure you are using this URL to checkout the 2.3 branch sources: https://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_3 You should see 7 issues listed in the CHANGES.txt under "Bug fixes"? Will, one more thing to try is to on assertions for org.apache.lucene.*; this may catch an issue sooner.
Will, you should be able to verify the lucene version with this link:
http://localhost:8983/solr/admin/registry.jsp It should be different from this: Lucene Specification Version: 2.3.0 Lucene Implementation Version: 2.3.0 613715 - buschmi - 2008-01-21 01:30:48 the new solr with the new lucene did the trick. i was made the mistake of using the 2.3 tag instead of the branch before which was why i still saw the problem.
Super, I'm glad to hear that!
I'm going to upgrade to 2.3.1 and then double check this and commit, unless I hear any objections in the next day or two.
oops, we are already on 2.3.1, so then I will just commit in a day or two.
Committed revision 634016.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||