Issue Details (XML | Word | Printable)

Key: SOLR-342
Type: Improvement Improvement
Status: Resolved Resolved
Resolution: Fixed
Priority: Minor Minor
Assignee: Grant Ingersoll
Reporter: Grant Ingersoll
Votes: 1
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Solr

Add support for Lucene's new Indexing and merge features (excluding Document/Field/Token reuse)

Created: 20/Aug/07 08:52 PM   Updated: 05/Mar/08 08:23 PM
Return to search
Component/s: update
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
File Licensed for inclusion in ASF works copyLucene.sh 2007-12-03 05:48 PM Grant Ingersoll 1 kB
Text File Licensed for inclusion in ASF works SOLR-342.patch 2008-02-07 03:53 PM Grant Ingersoll 22 kB
Text File Licensed for inclusion in ASF works SOLR-342.patch 2008-02-02 03:32 AM Grant Ingersoll 22 kB
Text File Licensed for inclusion in ASF works SOLR-342.patch 2007-10-26 05:10 PM Grant Ingersoll 33 kB
GZip Archive Licensed for inclusion in ASF works SOLR-342.tar.gz 2007-10-25 09:19 PM Grant Ingersoll 781 kB
Issue Links:
Blocker
 
Reference
 

Resolution Date: 05/Mar/08 08:23 PM


 Description  « Hide
LUCENE-843 adds support for new indexing capabilities using the setRAMBufferSizeMB() method that should significantly speed up indexing for many applications. To fix this, we will need trunk version of Lucene (or wait for the next official release of Lucene)

Side effect of this is that Lucene's new, faster StandardTokenizer will also be incorporated.

Also need to think about how we want to incorporate the new merge scheduling functionality (new default in Lucene is to do merges in a background thread)



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Grant Ingersoll added a comment - 30/Sep/07 04:58 PM
I think we should wait a bit more on this, as there still are a fair number of issues related to these changes in Lucene that need to be ironed out.

Mike Klaas added a comment - 01/Oct/07 06:19 PM
We could probably just wait for lucene 2.3 to be released before releasing 1.3. I wouldn't be averse to pre-integrating the changes, though.

Grant Ingersoll added a comment - 01/Oct/07 06:28 PM
yep, I agree with pre-integrating, just from watching the Lucene discussions going on lately, I think it is worth letting a few more things be worked out before using a nightly build.

Grant Ingersoll added a comment - 22/Oct/07 08:14 PM
Updated to cover the broader scope of changes that effect upgrading to Lucene trunk.

Plan to implement:
Add <ramBufferSizeMB> tag to specify the number of megabytes to give Lucene. Setting this value will override all other related settings (maxBufferedDocs, etc.) related to IndexWriter configuration

Add <mergeScheduler> tag that can have two values: concurrent or serial. Or would it be better to take in a classname? Doing the latter would mean we would have to have a no-arg constructor, right?

Add <mergePolicy> tag that defines the merge policy that can have two values: byteSize or docCount. Or would it be better to take a classname?

NOTE: I am not proposing to handle the new reusable Document/Field/Token mechanism in Lucene, which should also be considered.


Yonik Seeley added a comment - 23/Oct/07 11:25 PM
Perhaps for now mergePolicy should be aligned with the buffering limit (ram or docs)?
If/when we do add <mergePolicy> it seems like it should be a class name.

Grant Ingersoll added a comment - 25/Oct/07 02:39 PM
OK, this would imply it is set to LogByteSizeMergePolicy when setting <ramBufferSizeMB> and LogDocMergePolicy when setting <maxBufferedDocs>

Grant Ingersoll added a comment - 25/Oct/07 09:19 PM
First crack at implementing this. All tests pass on OS X except SolrJ's SolrExceptionTest, but for some reason that is failing on a clean version, too, so I am convinced it is not due to anything I did.

My personal benchmarking of just the Lucene side of things (see indexing.alg in Lucene contrib/benchmark) show pretty significant performance gains. This is also anecdotally confirmed by my basic testing in Solr.

I set the default to be 16MB, per Mike McCandless defaults in Lucene, but this is probably too low given the server nature of Solr where a lot more memory is likely to be available.

There are 4 new configuration possibilities:
<ramBufferSizeMB> - When set, <maxBufferedDocs> is set to DISABLE_AUTO_FLUSH. Default is the maxBufferedDocs way, but this could be changed to be the other way around (and probably should be)
<mergePolicy> - Set the MergePolicy, default is the new Lucene LogByteSizeMergePolicy. Old Lucene policy is LogDocMergePolicy. LogByteSizeMergePolicy by default.
<mergeScheduler> - Set the way merges are performed. New way is ConcurrentMergeScheduler which runs the merges in separate background threads. Old way was SerialMergeScheduler. Concurrent by default.
<luceneAutoCommit> - Specify whether Lucene IndexWriter should autoCommit flushes. false is the best for performance. Still need to develop recommendations for when to change this. Named it this way to avoid confusion with Solr's version. false by default.

Patch is inside the tar file, as well as a bundling of the Lucene jars (not technically the latest, but only a couple days old)


Yonik Seeley added a comment - 25/Oct/07 09:30 PM
> Default is the maxBufferedDocs way, but this could be changed to be the other way around (and probably should be)

+1
ramBufferSizeMB is now the default in Lucene AFAIK.
I think perhaps 32MB might be a good default.

luceneAutoCommit [...] Still need to develop recommendations for when to change this.

For Solr, you never would want to use it. trying to catch a glimpse of new segments as they are flushed leads to an inconsistent view of the index since docs haven't been deleted yet.

We do need to document recommended solrconfig.xml changes in CHANGES.txt (at the top in the migration section we normally have) for people to get these performance gains with existing configs.


Grant Ingersoll added a comment - 26/Oct/07 11:13 AM

+1
ramBufferSizeMB is now the default in Lucene AFAIK.
I think perhaps 32MB might be a good default.

The other thing is, Lucene actually supports setting both, and flushes based on whichever one is hit first. Is this worth supporting?

For Solr, you never would want to use it. trying to catch a glimpse of new segments as they are flushed leads to an inconsistent view of the index since docs haven't been deleted yet.

Should we even expose this, then? It seems like we should just make it false.

I will write up more in the changes.


Yonik Seeley added a comment - 26/Oct/07 03:43 PM
> The other thing is, Lucene actually supports setting both, and flushes based on whichever one is hit first. Is this worth supporting?

Since it takes no extra work in Solr, I guess we should just allow it.

> Should we even expose this [lucene autocommit], then? It seems like we should just make it false.

I think so... some people use Solr in some quite advanced ways.


Grant Ingersoll added a comment - 26/Oct/07 05:10 PM
Changes:

1. Updated changes.txt with recommendations on settings.

2. Changed SolrIndexWriter from last patch to allow for setting both maxBufferedDocs and ramBufferSizeMB.

3. Updated the various sample solrconfig.xml to have a default of 32 MB for ramBufferSizeMB. Commented out maxBufferedDocs, but did not deprecate it.

4. Added a note to the various solrconfig.xml explaining what Lucene does if BOTH ramBufferSizeMB and maxBufferedDocs is set.

The Lucene libraries are bundled with the previous patch, but are still needed.


Will Johnson added a comment - 03/Dec/07 02:26 PM
just a comment to say that we added this patch and saw rather signifigant improvements, on the order of 10-25% for different index tests.

Grant Ingersoll added a comment - 03/Dec/07 05:48 PM
Dumb little script to copy over the required Lucene jars from a built Lucene directory. Takes in two parameters, the location of Lucene Home and the version to copy over. Requires Lucene to be built.

Belongs in the lib directory.

For example,
./copyLucene.sh <path to Lucene> 2.3-dev


Will Johnson added a comment - 03/Dec/07 06:16 PM
is there any update on getting this patch committed? we needed to be able to set some of the buffer sizes so the script wouldn't help. have other people experienced tourbles with 2.3 and/or this patch that i should be wary of?

Grant Ingersoll added a comment - 03/Dec/07 06:24 PM
In my mind, there are still some issues w/ 2.3 dev that are being worked on. Personally, I think we should wait until 2.3 is released, but it would be good for people to get some running-time with this patch, if they can, before then, as that will help work out any issues remaining in 2.3.

Grant Ingersoll added a comment - 31/Dec/07 07:47 PM
Looks like Lucene 2.3 is shaping up to be released fairly soon (~2 weeks) and that many of the indexing/thread-safety concerns have been worked out. Might as well wait for the official release at this point, although I have been using 2.3-dev for a fairly long time at this point.

Yonik Seeley added a comment - 04/Jan/08 03:20 AM
I don't think it's necessary to wait for the official lucene 2.3 release, esp since there is still a lot of solr work to be done (tokenizer upgrades to use char[], reusable tokenizers, reusable analyzers?, etc). We could upgrade to the latest snapshot when someone is willing to tackle those issues.

Grant Ingersoll added a comment - 02/Feb/08 03:32 AM
Updated to work against trunk.

As always, let me know if there is anything I can do to help get this committed.


Koji Sekiguchi added a comment - 03/Feb/08 12:06 AM
Now mergeFactor can be effective as long as mergePolicy is an instance of LogMergePolicy.
How about

<mergePolicy mergeFactor="10"/>

or

<mergePolicy>
<mergeFactor>10</mergeFactor>
</mergePolicy>

instead of

<indexDefaults>
<mergeFactor>10</mergeFactor>
</indexDefaults>


Grant Ingersoll added a comment - 07/Feb/08 03:46 PM
I did some benchmarking of the autocommit functionality in Lucene (as opposed to in Solr, which is different). Currently, in Lucene autocommit is true by default, meaning that every time there is a flush, it is also committed. Solr adds its own layer on top of this with its commit semantics. There is a noticeable difference in memory used and speed in Lucene performance between autocommit = false and autocommit = true.

Some rough numbers using the autocommit.alg in Lucene's benchmark contrib (from trunk):
Operation round ac ram runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
[java] MAddDocs_200000 0rue2.00 1 200000 400.1 499.90 61,322,608 68,780,032
[java] MAddDocs_200000 - 1lse2.00 - - 1 - - 200000 - - 499.9 - - 400.08 - 49,373,632 - 75,018,240
[java] MAddDocs_200000 2rue2.00 1 200000 383.7 521.27 70,716,096 75,018,240
[java] MAddDocs_200000 - 3lse2.00 - - 1 - - 200000 - - 552.7 - - 361.89 - 68,069,464 - 75,018,240

The first row has autocommit = true, second is false, and then alternating. The key value is the rec/s, which is:
1. ac = true 400.1
2. ac = false 499.9
3. ac = true 383.7
4. ac = false 552.7

Notice also the diff in avgUsedMem. Adding this functionality may, perhaps, be more important to Solr's performance than the flush by RAM capability.


Grant Ingersoll added a comment - 07/Feb/08 03:53 PM
Update of patch to account for the fact that mergeFactor is only for Log based merges. I left it as the <mergeFactor> tag, but put in an instanceof clause in the init method of the SolrIndexWriter to check to see if the mergeFactor is settable.

Will Johnson added a comment - 08/Feb/08 05:10 PM
I think we're running into a very serious issue with trunk + this patch. either the document summaries are not matched or the overall matching is 'wrong'. i did find this in the lucene jira: LUCENE-994

"Note that these changes will break users of ParallelReader because the
parallel indices will no longer have matching docIDs. Such users need
to switch IndexWriter back to flushing by doc count, and switch the
MergePolicy back to LogDocMergePolicy. It's likely also necessary to
switch the MergeScheduler back to SerialMergeScheduler to ensure
deterministic docID assignment."

we're seeing rather consistent bad results but only after 20-30k documents and multiple commits and wondering if anyone else is seeing anything. i've verified that the results are bad even though luke which would seem to remove the search side of hte solr equation. the basic test case is to search for title:foo and get back documents that only have title:bar. we're going to start on a unit test but give the document counts and the corpus we're testing against it may be a while so i thought i'd ask to see if anyone had any hints.

removing this patch seems to remove the issue so i doesn't appear to be a lucene problem


Yonik Seeley added a comment - 08/Feb/08 05:20 PM
Yikes! Thanks for the report Will. It certainly sounds like a Lucene issue to me (esp because removal of this patch fixes things... that means it only happens under certain lucene settings). Could you perhaps try the very latest Lucene trunk (there were some seemingly unrelated fixes recently).

Yonik Seeley added a comment - 08/Feb/08 06:06 PM
Will, are you using term vectors anywhere, or any customizations to Solr (at the lucene level)?
When you say "document summaries are not matched", you you mean that the incorrect documents are matched, or that the correct documents are matched but just highlighting is wrong?

Will Johnson added a comment - 08/Feb/08 06:25 PM
patched solr + lucene trunk is stil broken. if anyone has any pointers for ways to coax this problem to happen before we get 20-30k large docs in the system let me know and we can start working on a unit test, otherwise it's going to take a while to reproduce anything.

Yonik Seeley added a comment - 08/Feb/08 06:34 PM
Thanks Will. My guess at this point is a merging bug in Lucene, so you might be able to reproduce by forcing more merges. Make mergeFacor=2 and lower how many docs it takes to do a merge (set maxBufferedDocs to 2, or set ramBufferSizeMB to 1).

Grant Ingersoll added a comment - 08/Feb/08 08:23 PM
Can you share your settings? (solrconfig.xml), or at least the relevant sections.

Will Johnson added a comment - 08/Feb/08 08:46 PM
we have:

<mergeFactor>10</mergeFactor>
<ramBufferSizeMB>64</ramBufferSizeMB>
<maxMergeDocs>2147483647</maxMergeDocs>

and i'm working on a unit test but just adding a few terms per doc doesnt seem to trigger it, at least not 'quickly.'


Grant Ingersoll added a comment - 08/Feb/08 09:20 PM
You mentioned ParallelReader, are you using that, or any other patches?

problem to happen before we get 20-30k large docs

what is "large" in your terms?


Will Johnson added a comment - 08/Feb/08 09:49 PM
we're not using parallel reader but we are using direct core access instead of going over http. as for doc size, we're indexing wikipedia but creating anumber of extra fields. they are only large in comparison to any of the 'large volume' tests i've seen in most of the solr and lucene tests.
  • will

Grant Ingersoll added a comment - 08/Feb/08 10:01 PM
Direct core meaning embedded, right? It's interesting, b/c I have done a fair amount of Lucene 2.3 testing w/ Wikipedia (nothing like a free, fairly large dataset)

Can you reproduce the problem using Lucene directly? (have a look at contrib/benchmark for a way to get Lucene/Wikipedia up and running quickly)

Also, are there any associated exceptions anywhere in the chain? Or is it just that your index is bad? Are you starting from a clean index or updating an existing one?


Will Johnson added a comment - 08/Feb/08 10:48 PM
we're using SolrCore in terms of:

core = new SolrCore("foo", dataDir, solrConfig, solrSchema);
UpdateHandler handler = core.getUpdateHandler();
updateHandler.addDoc(command);

which is a bit more low level than normal however when we flipped back to solr trunk + lucene 2.3 everything was fine so it leads me to belive that we are ok in that respect.

i was going to try and reproduce with lucene directly also but that too is a bit outside the scope of what i have time for at the moment.

and we're not getting any exceptions, just bad search results.


Grant Ingersoll added a comment - 11/Feb/08 02:15 AM
Also, are you doing multi-threaded indexing or are you indexing while searching?

Will Johnson added a comment - 11/Feb/08 03:08 AM
we are doing multi-threaded indexing and searching while indexing however the 'bad' results come back after the indexing run is finished and the index itself is static.

Yonik Seeley added a comment - 11/Feb/08 08:37 PM
OK, I've managed to reproduce this in a straight lucene testcase.
I'm doing further verification and will open up a Lucene bug shortly.

Grant Ingersoll added a comment - 14/Feb/08 02:12 PM
Sounds like we will have a Lucene 2.3.1 release in the next week or so with the fixes in place. Will, in the meantime, when LUCENE-1177 and LUCENE-1173 get committed,maybe you could try either building the LUCENE 2.3 branch or trying out with the Lucene nightly build to see if it solves your issue and let us know.

Will Johnson added a comment - 15/Feb/08 08:52 PM
i switched to the lucene 2.3 branch, updated (and confirmed that yonik's 1 line change was in place), reran the tests and still saw the same problem. if i missed something please let me know.

Michael McCandless added a comment - 16/Feb/08 11:17 AM
Will, did you create a new index in your test?

Also make sure you are using this URL to checkout the 2.3 branch sources:

https://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_3

You should see 7 issues listed in the CHANGES.txt under "Bug fixes"?


Michael McCandless added a comment - 18/Feb/08 10:49 AM
Will, one more thing to try is to on assertions for org.apache.lucene.*; this may catch an issue sooner.

Yonik Seeley added a comment - 18/Feb/08 05:01 PM
Will, you should be able to verify the lucene version with this link:
http://localhost:8983/solr/admin/registry.jsp
It should be different from this:
Lucene Specification Version: 2.3.0
Lucene Implementation Version: 2.3.0 613715 - buschmi - 2008-01-21 01:30:48

Will Johnson added a comment - 19/Feb/08 05:23 PM
the new solr with the new lucene did the trick. i was made the mistake of using the 2.3 tag instead of the branch before which was why i still saw the problem.

Michael McCandless added a comment - 19/Feb/08 05:37 PM
Super, I'm glad to hear that!

Grant Ingersoll added a comment - 03/Mar/08 01:48 PM
I'm going to upgrade to 2.3.1 and then double check this and commit, unless I hear any objections in the next day or two.

Grant Ingersoll added a comment - 03/Mar/08 01:58 PM
oops, we are already on 2.3.1, so then I will just commit in a day or two.

Grant Ingersoll added a comment - 05/Mar/08 08:23 PM
Committed revision 634016.