Lucene - Core
  1. Lucene - Core
  2. LUCENE-3533

Nuke SpanFilters and CachingSpanFilter (maybe move to sandbox)

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      SpanFilters are inefficient and OOM easily (they don't scale at all: Create large Lists of Objects for every match, also filtering deleted docs is a pain). Some talks with Grant on Eurocon and also the fact that caching of them is still broken in 3.x (but fixed on trunk) - I assume nobody uses them, so let's nuke them. They are also in wrong package, so standard statement: "Die, SpanFilters, die!"

        Activity

        Hide
        Grant Ingersoll added a comment -

        +1

        Show
        Grant Ingersoll added a comment - +1
        Hide
        Robert Muir added a comment -

        I'm gonna try to work on this: I have a perf improvement to spans and this class prevents it.

        Show
        Robert Muir added a comment - I'm gonna try to work on this: I have a perf improvement to spans and this class prevents it.
        Hide
        Robert Muir added a comment -

        Patch, nuking these, but also removing stupid extra seeks from SpanQueries.

        The problem is currently:

        1. seek to every term in every segment to compute IDF/etc
        2. seek to every term again in every segment to get docsAndPositionsEnum
        3. (possibly) seek to every term AGAIN in each segment if it doesn't exist, this is when d&penum returns null, to throw an error message "positions were not indexed". if this third seek returns null it will return EMPTY_SPANS

        with the patch we just do 1 seek

        Show
        Robert Muir added a comment - Patch, nuking these, but also removing stupid extra seeks from SpanQueries. The problem is currently: seek to every term in every segment to compute IDF/etc seek to every term again in every segment to get docsAndPositionsEnum (possibly) seek to every term AGAIN in each segment if it doesn't exist, this is when d&penum returns null, to throw an error message "positions were not indexed". if this third seek returns null it will return EMPTY_SPANS with the patch we just do 1 seek
        Hide
        Uwe Schindler added a comment -

        Robert do you want to take this? Should we nuke it on 3.x, too?

        Show
        Uwe Schindler added a comment - Robert do you want to take this? Should we nuke it on 3.x, too?
        Hide
        Robert Muir added a comment - - edited

        It would be good to get a review on the patch: I think its ok in general.

        it removes a lot of stupidity from the spans, except for one case:

        the SpanMultiTermQueryWrapper is still not single pass (it simply throws all termcontexts away).

        I thought about how to solve that one too, and I'm convinced its unfixable
        because SpanQueries aren't really query trees, its just one query that
        calls extractTerms on everything underneath it.

        For this reason, even if i made this MTQ one single-pass by allowing TermContexts
        to be passed to e.g. SpanOrQuery, it would work, but if you had that query inside
        another SpanQuery then it would still do the extra seek like it does now.

        But still, with the patch spans are a little better.

        Show
        Robert Muir added a comment - - edited It would be good to get a review on the patch: I think its ok in general. it removes a lot of stupidity from the spans, except for one case: the SpanMultiTermQueryWrapper is still not single pass (it simply throws all termcontexts away). I thought about how to solve that one too, and I'm convinced its unfixable because SpanQueries aren't really query trees, its just one query that calls extractTerms on everything underneath it. For this reason, even if i made this MTQ one single-pass by allowing TermContexts to be passed to e.g. SpanOrQuery, it would work, but if you had that query inside another SpanQuery then it would still do the extra seek like it does now. But still, with the patch spans are a little better.
        Hide
        Michael McCandless added a comment -

        Patch looks good! One less query relying on TermsEnum caching...

        Show
        Michael McCandless added a comment - Patch looks good! One less query relying on TermsEnum caching...

          People

          • Assignee:
            Robert Muir
            Reporter:
            Uwe Schindler
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development