[LUCENE-629] Performance improvement for merging stored, compressed fields - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: core/index
Labels:
None

Description

Hello everyone,

currently the merging of stored, compressed fields is not optimal for the following reason: every time a stored, compressed field is being merged, the FieldsReader uncompresses the data, hence the FieldsWriter has to compress it again when it writes the merged fields data (.fdt) file. The uncompress/compress step is unneccessary and slows down the merge performance significantly.

This patch improves the merge performance by avoiding the uncompress/compress step. In the following I give an overview of the changes I made:

Added a new FieldSelectorResult constant named "LOAD_FOR_MERGE" to org.apache.lucene.document.FieldSelectorResult
SegmentMerger now uses an FieldSelector to get stored fields from the FieldsReader. This FieldSelector's accept() method returns the FieldSelectorResult "LOAD_FOR_MERGE" for every field.
Added a new inner class to FieldsReader named "FieldForMerge", which extends org.apache.lucene.document.AbstractField. This class holds the field properties and its data. If a field has the FieldSelectorResult "LOAD_FOR_MERGE", then the FieldsReader creates an instance of "FieldForMerge" and does not uncompress the field's data.
FieldsWriter checks if the field it is about to write is an instanceof FieldsReader.FieldForMerge. If true, then it does not compress the field data.

To test the performance I index about 350,000 text files and store the raw text in a stored, compressed field in the lucene index. I use a merge factor of 10. The final index has a size of 366MB. After building the index, I optimize it to measure the pure merge performance.

Here are the performance results:

old version:

Time for Indexing: 36.7 minutes
Time for Optimizing: 4.6 minutes

patched version:

Time for Indexing: 20.8 minutes
Time for Optimizing: 0.5 minutes

The results show that the index build time improved by about 43%, and the optimizing step is more than 8x faster.

A diff of the final indexes (old and patched version) shows, that they are identical. Furthermore, all junit testcases succeeded with the patched version.

Regards,
Michael Busch

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

optimized_merge_compressed_fields.patch
17/Jul/06 21:51
8 kB
Michael Busch

Activity

People

Assignee:: Unassigned

Reporter:: Michael Busch

Votes:: 2 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 17/Jul/06 21:48

Updated:: 28/Aug/22 11:29

Resolved:: 13/Aug/06 06:10