Lucene - Core
  1. Lucene - Core
  2. LUCENE-1634

LogMergePolicy should use the number of deleted docs when deciding which segments to merge

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I found that IndexWriter.optimize(int) method does not pick up large segments with a lot of deletes even when most of the docs are deleted. And the existence of such segments affected the query performance significantly.

      I created an index with 1 million docs, then went over all docs and updated a few thousand at a time. I ran optimize(20) occasionally. What saw were large segments with most of docs deleted. Although these segments did not have valid docs they remained in the directory for a very long time until more segments with comparable or bigger sizes were created.

      This is because LogMergePolicy.findMergeForOptimize uses the size of segments but does not take the number of deleted documents into consideration when it decides which segments to merge. So, a simple fix is to use the delete count to calibrate the segment size. I can create a patch for this.

      1. LUCENE-1634.patch
        4 kB
        Yasuhiro Matsuda
      2. LUCENE-1634.patch
        2 kB
        Yasuhiro Matsuda

        Activity

        Yasuhiro Matsuda created issue -
        Yasuhiro Matsuda made changes -
        Field Original Value New Value
        Attachment LUCENE-1634.patch [ 12407891 ]
        Michael McCandless made changes -
        Assignee Michael McCandless [ mikemccand ]
        Michael McCandless made changes -
        Fix Version/s 2.9 [ 12312682 ]
        Priority Major [ 3 ] Minor [ 4 ]
        Yasuhiro Matsuda made changes -
        Attachment LUCENE-1634.patch [ 12408065 ]
        Michael McCandless made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Mark Miller made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Mark Thomas made changes -
        Workflow jira [ 12463229 ] Default workflow, editable Closed status [ 12563944 ]
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12563944 ] jira [ 12584577 ]

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Yasuhiro Matsuda
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development