Lucene - Core
  1. Lucene - Core
  2. LUCENE-2506

A Stateful Filter That Works Across Index Segments

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 3.0.2
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
    • Lucene Fields:
      New, Patch Available

      Description

      By design, Lucene's Filter abstraction is applied once for every segment in the index during searching. In particular, the reader provided to its #getDocIdSet method does not represent the whole underlying index. In other words, if the index has more than one segment the given reader only represents a single segment. As a result, that definition of the filter suffers the limitation of not having the ability to permit/prohibit documents in the search results based on the terms that reside in segments that precede the current one.

      To address this limitation, we introduce here a StatefulFilter which specifically builds on the Filter class so as to make it capable of remembering terms in segments spanning the whole underlying index. To reiterate, the need for making filters stateful stems from the fact that some, although not most, filters care about the terms that they may have come across in prior segments. It does so by keeping track of the past terms from prior segments in a cache that is maintained in a StatefulTermsEnum instance on a per-thread basis.

      Additionally, to address the case where a filter might want to accept the last matching term, we keep track of the TermsEnum#docFreq of the terms in the segments filtered thus far. By comparing the sum of such TermsEnum#docFreq with that of the top-level reader, we can tell if the current segment is the last segment in which the current term appears. Ideally, for this to work correctly, we require the user to explicitly set the top-level reader on the StatefulFilter. Knowing what the top-level reader is also helps the StatefulFilter to clean up after itself once the search has concluded.

      Note that we leave it up to each concrete sub-class of the stateful filter to decide what to remember in its state and what not to. In other words, it can choose to remember as much or as little from prior segments as it deems necessary. In keeping with the TermsEnum interface, which the StatefulTermsEnum class extends, the filter must decide which terms to accept or not, based on the holistic state of the search.

      1. LUCENE-2506.patch
        28 kB
        Karthick Sankarachary

        Issue Links

          Activity

          Hide
          Trejkaz added a comment -

          Putting my vote on this one.

          About 80% of our filters were broken by the change to making filters work only over a single segment, and our hack to make it work in the meantime is not particularly great as it makes our filters only work with a single reader. Some better Filter API is required, which lets us know of the context.

          Show
          Trejkaz added a comment - Putting my vote on this one. About 80% of our filters were broken by the change to making filters work only over a single segment, and our hack to make it work in the meantime is not particularly great as it makes our filters only work with a single reader. Some better Filter API is required, which lets us know of the context.
          Hide
          Earwin Burrfoot added a comment -

          Some better Filter API is required, which lets us know of the context.

          There's an opposite opinion. The way Lucene traverses segments one-by-one during searching is by no means the only possible one.
          I'm working on parallel searches, when segments are traversed in parallel by different threads. This robs you of any context possible.
          So, I think it would be nice for the core API to be context-less. If you require context - build it on top.

          Show
          Earwin Burrfoot added a comment - Some better Filter API is required, which lets us know of the context. There's an opposite opinion. The way Lucene traverses segments one-by-one during searching is by no means the only possible one. I'm working on parallel searches, when segments are traversed in parallel by different threads. This robs you of any context possible. So, I think it would be nice for the core API to be context-less. If you require context - build it on top.
          Hide
          Michael McCandless added a comment -

          Unfortunately, on first glance, this solution sound too heavyweight, in general (eg it seems to have an unbounded cache of seg -> term -> docFreq).

          What if Filter.getDocIDSet also received the top reader and the docBase of this sub reader within that top reader?

          But, note that the Filter should not assume any order in which the the sub readers are visited. EG in the future we may have multiple threads visiting even the same sub at once, and different subs in different orders.

          Or.. is it possible to require that these stateful filters are always "caching" filters? Ie, up front they build bitsets for all subs. Then the Filter API can remain unchanged... and we'd probably make a CachingStatefulFilter that holds the cache so the particular filter can just focus on building the bit sets.

          Show
          Michael McCandless added a comment - Unfortunately, on first glance, this solution sound too heavyweight, in general (eg it seems to have an unbounded cache of seg -> term -> docFreq). What if Filter.getDocIDSet also received the top reader and the docBase of this sub reader within that top reader? But, note that the Filter should not assume any order in which the the sub readers are visited. EG in the future we may have multiple threads visiting even the same sub at once, and different subs in different orders. Or.. is it possible to require that these stateful filters are always "caching" filters? Ie, up front they build bitsets for all subs. Then the Filter API can remain unchanged... and we'd probably make a CachingStatefulFilter that holds the cache so the particular filter can just focus on building the bit sets.
          Hide
          Trejkaz added a comment -

          What if Filter.getDocIDSet also received the top reader and the docBase of this sub reader within that top reader?

          That would be enough for us, would still allow for the parallel case, and would even be efficient in the parallel case for the majority of our filters. The bulk of our context-sensitive filters are actually only sensitive to the docBase - we are doing an SQL query, get back the doc IDs relative to the root reader and only have to offset them to the local one.

          There are still filters where we would have to stop the world and go back to build up a filter over the whole reader (e.g. filtering out non-current copies of a document), but we only have one or two filters like that, it can be done easily using a Future, and it would impact only the speed of our own code. (Of course, if Lucene ever allowed modifying existing documents in-place, it would remove a lot of that sort of hack, since we could have a 'current-version' field and remove it from the non-current copies...)

          Show
          Trejkaz added a comment - What if Filter.getDocIDSet also received the top reader and the docBase of this sub reader within that top reader? That would be enough for us, would still allow for the parallel case, and would even be efficient in the parallel case for the majority of our filters. The bulk of our context-sensitive filters are actually only sensitive to the docBase - we are doing an SQL query, get back the doc IDs relative to the root reader and only have to offset them to the local one. There are still filters where we would have to stop the world and go back to build up a filter over the whole reader (e.g. filtering out non-current copies of a document), but we only have one or two filters like that, it can be done easily using a Future, and it would impact only the speed of our own code. (Of course, if Lucene ever allowed modifying existing documents in-place, it would remove a lot of that sort of hack, since we could have a 'current-version' field and remove it from the non-current copies...)
          Hide
          Earwin Burrfoot added a comment -

          we are doing an SQL query, get back the doc IDs relative to the root reader and only have to offset them to the local one

          What do you do when segments merge and these very-very transient doc IDs change?

          Show
          Earwin Burrfoot added a comment - we are doing an SQL query, get back the doc IDs relative to the root reader and only have to offset them to the local one What do you do when segments merge and these very-very transient doc IDs change?
          Hide
          Uwe Schindler added a comment -

          For the same reasons like Darwin said, I strongly discourage any state. In prior versions of Lucene that might have worked, but it was never documented to work on any special type of IR. Like queries, filters must be tasteless. any state is useless when you reopen readers e.g. with near real time. and only IndexSearcher but perhaps no other Subclass of Searcher might work with a docBase. Because ParallelMultiSearcher is broken, I currently plan a Replacement that does a parallel search on the sequential sub readers in a similar way like PMS does currently. Only that it does not work on several Searchers but instead takes one IndexReader and uses its subs as parallel threads.

          Show
          Uwe Schindler added a comment - For the same reasons like Darwin said, I strongly discourage any state. In prior versions of Lucene that might have worked, but it was never documented to work on any special type of IR. Like queries, filters must be tasteless. any state is useless when you reopen readers e.g. with near real time. and only IndexSearcher but perhaps no other Subclass of Searcher might work with a docBase. Because ParallelMultiSearcher is broken, I currently plan a Replacement that does a parallel search on the sequential sub readers in a similar way like PMS does currently. Only that it does not work on several Searchers but instead takes one IndexReader and uses its subs as parallel threads.
          Hide
          Trejkaz added a comment -

          What do you do when segments merge and these very-very transient doc IDs change?

          We are not deleting documents, and have not witnessed merges changing doc IDs in any other situation. We have already heard all these lines about not using the doc ID as an ID, but the reality was that building filters takes a lot longer if you have to additionally perform a search for every row you get back from the database, so using an alternative ID was not an option.

          Show
          Trejkaz added a comment - What do you do when segments merge and these very-very transient doc IDs change? We are not deleting documents, and have not witnessed merges changing doc IDs in any other situation. We have already heard all these lines about not using the doc ID as an ID, but the reality was that building filters takes a lot longer if you have to additionally perform a search for every row you get back from the database, so using an alternative ID was not an option.
          Hide
          Earwin Burrfoot added a comment -

          We are not deleting documents, and have not witnessed merges changing doc IDs in any other situation.

          That assumption's gonna break very soon. Very very soon, when IndexWriter learns how to merge non-sequential segments.

          You can do it same way Zoie authors did. They have a special per-segment cache that maps external ID <-> docID.
          The cache is pre-loaded when opening the segment from a field. It costs some memory, but nothing frightening.
          They found it optimal to add the same Term("ID", "ID") to all the docs and then store external ID in that term's payload -> no TermsIndex pressure + gets read fast.
          Then you can convert between IDs at a cost of simple array lookup.

          Now, this IS reality. Bulletproof, fast lookups by external ID, and no broken cheats.

          P.S. Not sure it was exactly Zoie, but definetly some of LinkedIn's opensourced Lucene stuff.

          Show
          Earwin Burrfoot added a comment - We are not deleting documents, and have not witnessed merges changing doc IDs in any other situation. That assumption's gonna break very soon. Very very soon, when IndexWriter learns how to merge non-sequential segments. You can do it same way Zoie authors did. They have a special per-segment cache that maps external ID <-> docID. The cache is pre-loaded when opening the segment from a field. It costs some memory, but nothing frightening. They found it optimal to add the same Term("ID", "ID") to all the docs and then store external ID in that term's payload -> no TermsIndex pressure + gets read fast. Then you can convert between IDs at a cost of simple array lookup. Now, this IS reality. Bulletproof, fast lookups by external ID, and no broken cheats. P.S. Not sure it was exactly Zoie, but definetly some of LinkedIn's opensourced Lucene stuff.
          Hide
          Trejkaz added a comment -

          That sounds like it would cost a fair bit of memory if you had hundreds of millions of documents. The worst thing is that people actually load this much in, all the time using a desktop computer with only a few gigs of RAM. Because it's a desktop app, "why don't you get another gig of RAM for the cache" probably won't fly if it suddenly happened in a new release of our software. If it were a server app, maybe that would fly... maybe.

          But yeah, some variant on this which only reads some of it from disk instead of all of it, might speed things up a bit. A giant IntBuffer over a memory mapped file would probably be cached by the OS anyway.

          Show
          Trejkaz added a comment - That sounds like it would cost a fair bit of memory if you had hundreds of millions of documents. The worst thing is that people actually load this much in, all the time using a desktop computer with only a few gigs of RAM. Because it's a desktop app, "why don't you get another gig of RAM for the cache" probably won't fly if it suddenly happened in a new release of our software. If it were a server app, maybe that would fly... maybe. But yeah, some variant on this which only reads some of it from disk instead of all of it, might speed things up a bit. A giant IntBuffer over a memory mapped file would probably be cached by the OS anyway.
          Hide
          Michael McCandless added a comment -

          That assumption's gonna break very soon. Very very soon, when IndexWriter learns how to merge non-sequential segments.

          Even if we break this assumption on the ootb config we will still have
          to provide a way to get it back. EG in this case, a merge policy
          which only selects contiguous segments (like the LogMergePolicy
          today).

          Show
          Michael McCandless added a comment - That assumption's gonna break very soon. Very very soon, when IndexWriter learns how to merge non-sequential segments. Even if we break this assumption on the ootb config we will still have to provide a way to get it back. EG in this case, a merge policy which only selects contiguous segments (like the LogMergePolicy today).

            People

            • Assignee:
              Unassigned
              Reporter:
              Karthick Sankarachary
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Development