Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0
    • Fix Version/s: 4.3, 6.0
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      this is essentially a spinnoff from LUCENE-4236
      We currently have no way to make any decsision on how costly a DISI is neither when we apply filters nor when we build conjunctions in BQ. Yet we have most of the information already and can easily expose them via a cost API such that BS and FilteredQuery can apply optimizations on per segment basis.

      1. LUCENE-4607.patch
        75 kB
        Robert Muir
      2. LUCENE-4607.patch
        90 kB
        Simon Willnauer

        Issue Links

          Activity

          Hide
          Simon Willnauer added a comment -

          here is a patch that adds #estimateDocCount to DISI. It still has some nocommits mainly related to BitSets and carnality but I this its fine as a start. I removed the TermConjunction specialization and changed the heuristic in FilteredQuery to use the estimated cost.

          Show
          Simon Willnauer added a comment - here is a patch that adds #estimateDocCount to DISI. It still has some nocommits mainly related to BitSets and carnality but I this its fine as a start. I removed the TermConjunction specialization and changed the heuristic in FilteredQuery to use the estimated cost.
          Hide
          Uwe Schindler added a comment -

          Nice idea! At least the FilteredQuery code looks fine to me. I agree, the Bits interface and FixedBitSet has a costly cardinality (Bits does not have it at all...), so we should think about that. As far as I see it returns maxDoc as cost.

          Show
          Uwe Schindler added a comment - Nice idea! At least the FilteredQuery code looks fine to me. I agree, the Bits interface and FixedBitSet has a costly cardinality (Bits does not have it at all...), so we should think about that. As far as I see it returns maxDoc as cost.
          Hide
          Uwe Schindler added a comment -

          One thing: estimatedDocCount is long, but docIds in Lucene are still int - this makes no sense, because the current scorer/disi interface can never return anything > Integer.MAX_VALUE, so the estimatedDocCount can never be 64 bits.

          We should maybe rethink in the future to make docIds in Lucene longs, but until this has happened, we should not mix both datatypes in public APIs, this would cause confusion.

          Show
          Uwe Schindler added a comment - One thing: estimatedDocCount is long, but docIds in Lucene are still int - this makes no sense, because the current scorer/disi interface can never return anything > Integer.MAX_VALUE, so the estimatedDocCount can never be 64 bits. We should maybe rethink in the future to make docIds in Lucene longs, but until this has happened, we should not mix both datatypes in public APIs, this would cause confusion.
          Hide
          Robert Muir added a comment -

          When i did the cost estimate patch on LUCENE-4236, i chose a long too. but there it was trying to estimate the number of documents visited,
          e.g. the number of postings.

          so the formula for a conjunction would be min(subscorer cost) * #subscorers, and for a disjunction its just the sum of all the subscorer costs, and so on.

          I felt like for scoring purposes this is more useful than the number of documents, but thats just my opinion.

          Show
          Robert Muir added a comment - When i did the cost estimate patch on LUCENE-4236 , i chose a long too. but there it was trying to estimate the number of documents visited, e.g. the number of postings. so the formula for a conjunction would be min(subscorer cost) * #subscorers, and for a disjunction its just the sum of all the subscorer costs, and so on. I felt like for scoring purposes this is more useful than the number of documents, but thats just my opinion.
          Hide
          Simon Willnauer added a comment -

          I tend to agree with robert that using longs makes things a lot easier here too. We don't need to deal with int overflows. Maybe we should rename to estimateDocsVisited?

          Show
          Simon Willnauer added a comment - I tend to agree with robert that using longs makes things a lot easier here too. We don't need to deal with int overflows. Maybe we should rename to estimateDocsVisited?
          Hide
          Robert Muir added a comment -

          The other idea (just for discussion) would be "number of i/os".

          So for example, phrasequery's formula would likely use totalTermFreq rather than docFreq.
          This would reflect the fact that its much more expensive than a conjunction.

          Show
          Robert Muir added a comment - The other idea (just for discussion) would be "number of i/os". So for example, phrasequery's formula would likely use totalTermFreq rather than docFreq. This would reflect the fact that its much more expensive than a conjunction.
          Hide
          Simon Willnauer added a comment -

          The other idea (just for discussion) would be "number of i/os".

          I like this, yet I think in that case we should go back to estimateCost rather than docId etc. since for a bitset this is way different that for a PhraseScorer. I agree it should be a unit of work that we estimate.

          Show
          Simon Willnauer added a comment - The other idea (just for discussion) would be "number of i/os". I like this, yet I think in that case we should go back to estimateCost rather than docId etc. since for a bitset this is way different that for a PhraseScorer. I agree it should be a unit of work that we estimate.
          Hide
          Uwe Schindler added a comment -

          I am fine with any solution. From my perspective as "API policeman" the mix of int and long in the same interface is not good. if estimateDocCount() returns long, also advance() must take and return long, docId() must return long, FixedBitSet must take long as size and finally numDocs and maxDoc must be long.

          Show
          Uwe Schindler added a comment - I am fine with any solution. From my perspective as "API policeman" the mix of int and long in the same interface is not good. if estimateDocCount() returns long, also advance() must take and return long, docId() must return long, FixedBitSet must take long as size and finally numDocs and maxDoc must be long.
          Hide
          Robert Muir added a comment -

          Uwe the current discussion is about not measuring count of documents, but instead i/o operations.

          So advance(), docId(), fixedBitset and so on are totally unrelated to that.

          Show
          Robert Muir added a comment - Uwe the current discussion is about not measuring count of documents, but instead i/o operations. So advance(), docId(), fixedBitset and so on are totally unrelated to that.
          Hide
          Uwe Schindler added a comment -

          Uwe the current discussion is about not measuring count of documents, but instead i/o operations.

          Man, I just wanted to make clear that the current patch and the current issue summary have this problem.

          Show
          Uwe Schindler added a comment - Uwe the current discussion is about not measuring count of documents, but instead i/o operations. Man, I just wanted to make clear that the current patch and the current issue summary have this problem.
          Hide
          Robert Muir added a comment -

          ok: I agree if its a count of documents, the type should be consistent.

          But as i suggested i dont think a count of documents is that great for how we will use this (picking conjunction leader, filtering heuristics, maybe minimum-should-match disjunction scoring, maybe cleanup exact phrase scorer / add its optimizations to sloppy phrase scorer, maybe more BS versus BS2 heuristics in BooleanWeight, etc etc)

          Show
          Robert Muir added a comment - ok: I agree if its a count of documents, the type should be consistent. But as i suggested i dont think a count of documents is that great for how we will use this (picking conjunction leader, filtering heuristics, maybe minimum-should-match disjunction scoring, maybe cleanup exact phrase scorer / add its optimizations to sloppy phrase scorer, maybe more BS versus BS2 heuristics in BooleanWeight, etc etc)
          Hide
          Robert Muir added a comment -

          Patch updated to trunk (as Stefan needs this for LUCENE-4571).

          I carefully reviewed all these costs and also integrated it into spans. Though these queries (e.g. nearquery) dont yet use it, they should.

          The only query i changed was to make ConjunctionTermScorer algorithm our ConjunctionScorer, sorting subscorers on cost (which is docFreq in the all terms case).

          Show
          Robert Muir added a comment - Patch updated to trunk (as Stefan needs this for LUCENE-4571 ). I carefully reviewed all these costs and also integrated it into spans. Though these queries (e.g. nearquery) dont yet use it, they should. The only query i changed was to make ConjunctionTermScorer algorithm our ConjunctionScorer, sorting subscorers on cost (which is docFreq in the all terms case).
          Hide
          Michael McCandless added a comment -

          +1

          Show
          Michael McCandless added a comment - +1
          Hide
          Stefan Pohl added a comment -

          Thanks! I will come up with a new prototype impl for LUCENE-4571 soon which will include and build upon this patch.

          Show
          Stefan Pohl added a comment - Thanks! I will come up with a new prototype impl for LUCENE-4571 soon which will include and build upon this patch.
          Hide
          Robert Muir added a comment -

          The patch might be hard to apply due to the svn replace Stefan. I plan on doing final checks on it and committing it today.

          Really its silly lucene doesnt have this for our scorers already, since e.g. textbook IR formulas make use of this stuff.

          We can open followup issues to improve the other scorers logic (FilteredQuery, SpanNearQuery, etc). I only did the obvious
          conjunction one so we get the logic of ConjunctionTermScorer across any arbitrary scorers.

          Just gimme a few hours, sorry for the delay

          Show
          Robert Muir added a comment - The patch might be hard to apply due to the svn replace Stefan. I plan on doing final checks on it and committing it today. Really its silly lucene doesnt have this for our scorers already, since e.g. textbook IR formulas make use of this stuff. We can open followup issues to improve the other scorers logic (FilteredQuery, SpanNearQuery, etc). I only did the obvious conjunction one so we get the logic of ConjunctionTermScorer across any arbitrary scorers. Just gimme a few hours, sorry for the delay
          Hide
          Simon Willnauer added a comment -

          Thanks Robert, I was thinking about this the other day! Cool that you brought this back to life! +1 to the patch. I think you need to add a CHANGES.TXT entry still.

          simon

          Show
          Simon Willnauer added a comment - Thanks Robert, I was thinking about this the other day! Cool that you brought this back to life! +1 to the patch. I think you need to add a CHANGES.TXT entry still. simon
          Hide
          Robert Muir added a comment -

          I'll put it in when committing...I always wait until then, otherwise its just asking for conflicts if you ever merge up the patch

          Show
          Robert Muir added a comment - I'll put it in when committing...I always wait until then, otherwise its just asking for conflicts if you ever merge up the patch
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Robert Muir
          http://svn.apache.org/viewvc?view=revision&revision=1456670

          LUCENE-4607: add DISI/Spans.cost

          Show
          Commit Tag Bot added a comment - [trunk commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1456670 LUCENE-4607 : add DISI/Spans.cost
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Robert Muir
          http://svn.apache.org/viewvc?view=revision&revision=1456686

          LUCENE-4607: add DISI/Spans.cost

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1456686 LUCENE-4607 : add DISI/Spans.cost
          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.

            People

            • Assignee:
              Unassigned
              Reporter:
              Simon Willnauer
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development