Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2506

A Stateful Filter That Works Across Index Segments

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.2
    • None
    • core/index
    • New, Patch Available

    Description

      By design, Lucene's Filter abstraction is applied once for every segment in the index during searching. In particular, the reader provided to its #getDocIdSet method does not represent the whole underlying index. In other words, if the index has more than one segment the given reader only represents a single segment. As a result, that definition of the filter suffers the limitation of not having the ability to permit/prohibit documents in the search results based on the terms that reside in segments that precede the current one.

      To address this limitation, we introduce here a StatefulFilter which specifically builds on the Filter class so as to make it capable of remembering terms in segments spanning the whole underlying index. To reiterate, the need for making filters stateful stems from the fact that some, although not most, filters care about the terms that they may have come across in prior segments. It does so by keeping track of the past terms from prior segments in a cache that is maintained in a StatefulTermsEnum instance on a per-thread basis.

      Additionally, to address the case where a filter might want to accept the last matching term, we keep track of the TermsEnum#docFreq of the terms in the segments filtered thus far. By comparing the sum of such TermsEnum#docFreq with that of the top-level reader, we can tell if the current segment is the last segment in which the current term appears. Ideally, for this to work correctly, we require the user to explicitly set the top-level reader on the StatefulFilter. Knowing what the top-level reader is also helps the StatefulFilter to clean up after itself once the search has concluded.

      Note that we leave it up to each concrete sub-class of the stateful filter to decide what to remember in its state and what not to. In other words, it can choose to remember as much or as little from prior segments as it deems necessary. In keeping with the TermsEnum interface, which the StatefulTermsEnum class extends, the filter must decide which terms to accept or not, based on the holistic state of the search.

      Attachments

        1. LUCENE-2506.patch
          28 kB
          Karthick Sankarachary

        Issue Links

          Activity

            People

              Unassigned Unassigned
              karthick Karthick Sankarachary
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: