Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-629

Performance improvement for merging stored, compressed fields

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None

      Description

      Hello everyone,

      currently the merging of stored, compressed fields is not optimal for the following reason: every time a stored, compressed field is being merged, the FieldsReader uncompresses the data, hence the FieldsWriter has to compress it again when it writes the merged fields data (.fdt) file. The uncompress/compress step is unneccessary and slows down the merge performance significantly.

      This patch improves the merge performance by avoiding the uncompress/compress step. In the following I give an overview of the changes I made:

      • Added a new FieldSelectorResult constant named "LOAD_FOR_MERGE" to org.apache.lucene.document.FieldSelectorResult
      • SegmentMerger now uses an FieldSelector to get stored fields from the FieldsReader. This FieldSelector's accept() method returns the FieldSelectorResult "LOAD_FOR_MERGE" for every field.
      • Added a new inner class to FieldsReader named "FieldForMerge", which extends org.apache.lucene.document.AbstractField. This class holds the field properties and its data. If a field has the FieldSelectorResult "LOAD_FOR_MERGE", then the FieldsReader creates an instance of "FieldForMerge" and does not uncompress the field's data.
      • FieldsWriter checks if the field it is about to write is an instanceof FieldsReader.FieldForMerge. If true, then it does not compress the field data.

      To test the performance I index about 350,000 text files and store the raw text in a stored, compressed field in the lucene index. I use a merge factor of 10. The final index has a size of 366MB. After building the index, I optimize it to measure the pure merge performance.

      Here are the performance results:

      old version:

      • Time for Indexing: 36.7 minutes
      • Time for Optimizing: 4.6 minutes

      patched version:

      • Time for Indexing: 20.8 minutes
      • Time for Optimizing: 0.5 minutes

      The results show that the index build time improved by about 43%, and the optimizing step is more than 8x faster.

      A diff of the final indexes (old and patched version) shows, that they are identical. Furthermore, all junit testcases succeeded with the patched version.

      Regards,
      Michael Busch

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              buschmic Michael Busch
            • Votes:
              2 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: