Issue Details (XML | Word | Printable)

Key: LUCENE-1120
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Michael McCandless
Reporter: Michael McCandless
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

Use bulk-byte-copy when merging term vectors

Created: 07/Jan/08 02:52 PM   Updated: 11/Oct/08 12:49 PM
Return to search
Component/s: Index
Affects Version/s: None
Fix Version/s: 2.4

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works LUCENE-1120.patch 2008-01-07 02:55 PM Michael McCandless 32 kB
Text File Licensed for inclusion in ASF works LUCENE-1120.take2.patch 2008-01-21 02:53 AM Michael McCandless 34 kB

Lucene Fields: New
Resolution Date: 25/Jan/08 11:33 AM


 Description  « Hide
Indexing all of Wikipedia, with term vectors on, under the YourKit
profiler, shows that 26% of the time (!!) was spent merging the
vectors. This was without offsets & positions, which would make
matters even worse.

Depressingly, merging, even with ConcurrentMergeScheduler, cannot in
fact keep up with the flushing of new segments in this test, and this
is on a strong IO system (Mac Pro with 4 drive RAID 0 array, 4 CPU
cores).

So, just like Robert's idea to merge stored fields with bulk copying
whenever the field name->number mapping is "congruent" (LUCENE-1043),
we can do the same with term vectors.

It's a little trickier because the term vectors format doesn't quite
make it easy to bulk-copy because it doesn't directly encode the
offset into the tvf file.

I worked out a patch that changes the tvx format slightly, by storing
the absolute position in the tvf file for the start of each document
into the tvx file, just like it does for tvd now. This adds an extra
8 bytes (long) in the tvx file, per document.

Then, I removed a vLong (the first "position" stored inside the tvd
file), which makes tvd contents fully position independent (so you can
just copy the bytes).

This adds up to 7 bytes per document (less for larger indices) that
have term vectors enabled, but I think this small increase in index
size is acceptable for the gains in indexing performance?

With this change, the time spent merging term vectors dropped from 26%
to 3%. Of course, this only applies if your documents are "regular".
I think in the future we could have Lucene try hard to assign the same
field number for a given field name, if it had been seen before in the
index...

Merging terms now dominates the merge cost (~20% over overall time
building the Wikipedia index).

I also beefed up TestBackwardsCompatibility unit test: test a non-CFS
and a CFS of versions 1.9, 2.0, 2.1, 2.2 index formats, and added some
term vector fields to these indices.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Michael McCandless added a comment - 07/Jan/08 02:55 PM
Attached patch. All tests pass.

(Note that the TestBackwardsCompatibility test will fail if you apply the patch because the new *.zip files I added aren't in the patch).

I think we should commit this for 2.3? It's a sizable gain in merging
performance.


Michael Busch added a comment - 07/Jan/08 03:21 PM

I think we should commit this for 2.3? It's a sizable gain in merging

HI Mike,

I'm planning to branch the trunk today. Considering the file format changes, it might be a bit risky to apply this patch last minute. I think we should commit this for 2.4. What do you think?

-Michael


Michael McCandless added a comment - 07/Jan/08 03:36 PM

Considering the file format changes, it might be a bit risky to apply this patch last minute. I think we should commit this for 2.4. What do you think?

OK, I agree, it is somewhat risky, so let's wait. (Though it is a sizable gain in performance!).


Michael Busch added a comment - 07/Jan/08 03:36 PM

Indexing all of Wikipedia, with term vectors on, under the YourKit
profiler, shows that 26% of the time (!!) was spent merging the
vectors.

I wonder how accurate these profiling numbers are? Java profiling slows
down (non-native) method calls, but not I/O operations. Did you measure
the merge time before and after applying this patch without a profiling
tool?

-Michael


Michael McCandless added a comment - 07/Jan/08 03:39 PM

I wonder how accurate these profiling numbers are? Java profiling slows
down (non-native) method calls, but not I/O operations. Did you measure
the merge time before and after applying this patch without a profiling
tool?

Good point – I haven't measured outside of profiling. I plan to build a full Wiki index with and without this change to test....


Michael McCandless added a comment - 21/Jan/08 02:51 AM
Attached patch updated to current trunk. All tests pass. I plan to
commit after 2.3 is out...

OK I ran a performance test with this patch, indexing the first 200K
docs of Wikipedia using this alg:

analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker

doc.stored = true
doc.term.vector = true
doc.term.vector.positions = true
doc.term.vector.offsets = true

docs.file=/Volumes/External/lucene/wiki.txt

directory=FSDirectory

merge.scheduler=org.apache.lucene.index.SerialMergeScheduler

{ "Rounds"
ResetSystemErase
{ "BuildIndex"
CreateIndex

{ "AddDocs" AddDoc > : 200000 CloseIndex }

NewRound
} : 3

RepSumByPrefRound BuildIndex

I used SerialMergeScheduler so that I could measure time saved due to
faster merging.

Without the patch, best of 3 was 509.0 sec; with patch, best of 3 was
448.8 sec = 11.8% overall speedup.


Michael McCandless added a comment - 21/Jan/08 02:53 AM
Attaching the right patch this time...

Michael McCandless added a comment - 25/Jan/08 11:33 AM
I just committed this. Note that this is a [small] change to the index format, so if you use trunk to build an index, 2.3 won't be able to read it!