Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-3533

Nuke SpanFilters and CachingSpanFilter (maybe move to sandbox)

    Details

    • Type: Task
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      SpanFilters are inefficient and OOM easily (they don't scale at all: Create large Lists of Objects for every match, also filtering deleted docs is a pain). Some talks with Grant on Eurocon and also the fact that caching of them is still broken in 3.x (but fixed on trunk) - I assume nobody uses them, so let's nuke them. They are also in wrong package, so standard statement: "Die, SpanFilters, die!"

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        Patch looks good! One less query relying on TermsEnum caching...

        Show
        mikemccand Michael McCandless added a comment - Patch looks good! One less query relying on TermsEnum caching...
        Hide
        rcmuir Robert Muir added a comment - - edited

        It would be good to get a review on the patch: I think its ok in general.

        it removes a lot of stupidity from the spans, except for one case:

        the SpanMultiTermQueryWrapper is still not single pass (it simply throws all termcontexts away).

        I thought about how to solve that one too, and I'm convinced its unfixable
        because SpanQueries aren't really query trees, its just one query that
        calls extractTerms on everything underneath it.

        For this reason, even if i made this MTQ one single-pass by allowing TermContexts
        to be passed to e.g. SpanOrQuery, it would work, but if you had that query inside
        another SpanQuery then it would still do the extra seek like it does now.

        But still, with the patch spans are a little better.

        Show
        rcmuir Robert Muir added a comment - - edited It would be good to get a review on the patch: I think its ok in general. it removes a lot of stupidity from the spans, except for one case: the SpanMultiTermQueryWrapper is still not single pass (it simply throws all termcontexts away). I thought about how to solve that one too, and I'm convinced its unfixable because SpanQueries aren't really query trees, its just one query that calls extractTerms on everything underneath it. For this reason, even if i made this MTQ one single-pass by allowing TermContexts to be passed to e.g. SpanOrQuery, it would work, but if you had that query inside another SpanQuery then it would still do the extra seek like it does now. But still, with the patch spans are a little better.
        Hide
        thetaphi Uwe Schindler added a comment -

        Robert do you want to take this? Should we nuke it on 3.x, too?

        Show
        thetaphi Uwe Schindler added a comment - Robert do you want to take this? Should we nuke it on 3.x, too?
        Hide
        rcmuir Robert Muir added a comment -

        Patch, nuking these, but also removing stupid extra seeks from SpanQueries.

        The problem is currently:

        1. seek to every term in every segment to compute IDF/etc
        2. seek to every term again in every segment to get docsAndPositionsEnum
        3. (possibly) seek to every term AGAIN in each segment if it doesn't exist, this is when d&penum returns null, to throw an error message "positions were not indexed". if this third seek returns null it will return EMPTY_SPANS

        with the patch we just do 1 seek

        Show
        rcmuir Robert Muir added a comment - Patch, nuking these, but also removing stupid extra seeks from SpanQueries. The problem is currently: seek to every term in every segment to compute IDF/etc seek to every term again in every segment to get docsAndPositionsEnum (possibly) seek to every term AGAIN in each segment if it doesn't exist, this is when d&penum returns null, to throw an error message "positions were not indexed". if this third seek returns null it will return EMPTY_SPANS with the patch we just do 1 seek
        Hide
        rcmuir Robert Muir added a comment -

        I'm gonna try to work on this: I have a perf improvement to spans and this class prevents it.

        Show
        rcmuir Robert Muir added a comment - I'm gonna try to work on this: I have a perf improvement to spans and this class prevents it.
        Hide
        gsingers Grant Ingersoll added a comment -

        +1

        Show
        gsingers Grant Ingersoll added a comment - +1

          People

          • Assignee:
            rcmuir Robert Muir
            Reporter:
            thetaphi Uwe Schindler
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development