Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.3
    • Fix Version/s: 2.3
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      This feature proposes to add an insertWithOverflow to PriorityQueue so that callers can reuse the objects that are being dropped off the queue. Also, it changes heap to protected for easier extensibility of PQ

      1. LUCENE-1089.take3.patch
        10 kB
        Michael McCandless
      2. PriorityQueue.patch
        5 kB
        Shai Erera
      3. PriorityQueue-2.patch
        9 kB
        Shai Erera

        Issue Links

          Activity

          Hide
          shaie Shai Erera added a comment -

          The fixes done are:
          TopDocCollector:
          1. Use insertWithOverflow instead of insert.
          2. Get rid of unnecessary member variables (numHits and minScore).

          TestPriorityQueue:
          1. Add testInsertWithOverflow.

          PriorityQueue:
          1. Add insertWithOverflow
          2. Change heap to protected.
          3. user heap[1] in insert and insertWithOverflow instead of calling top().

          Show
          shaie Shai Erera added a comment - The fixes done are: TopDocCollector: 1. Use insertWithOverflow instead of insert. 2. Get rid of unnecessary member variables (numHits and minScore). TestPriorityQueue: 1. Add testInsertWithOverflow. PriorityQueue: 1. Add insertWithOverflow 2. Change heap to protected. 3. user heap [1] in insert and insertWithOverflow instead of calling top().
          Hide
          shaie Shai Erera added a comment -

          This one modifies other classes that called insert() with calls to insertWithOverflow():
          TopFieldDocCollector
          QualityQueriesFinder
          FuzzyQuery

          Show
          shaie Shai Erera added a comment - This one modifies other classes that called insert() with calls to insertWithOverflow(): TopFieldDocCollector QualityQueriesFinder FuzzyQuery
          Hide
          mikemccand Michael McCandless added a comment -

          Patch looks good; I made a few changes (attached take3):

          • Put back the minScore optimization in TopDocCollector.collect &
            FuzzyQuery, but, using reusableSD.score as the bar. This saves
            changing reusableSD & calling insertWithOverflow for every
            candidate. Then I put back TopFieldDocCollector.collect since it
            can't make that same optimization.
          • I implemented PriorityQueue.insert by calling insertWithOverflow
            (so we keep the tricky insert logic as single source).
          • I changed reusableSD in TopDocCollector to private (it was
            protected – was there a reason?).
          • Made another small optimization to PriorityQueue that saves the if
            statement in top() (this was #2 in Nadav's additional suggestions).
          • Small javadoc fixes.
          Show
          mikemccand Michael McCandless added a comment - Patch looks good; I made a few changes (attached take3): Put back the minScore optimization in TopDocCollector.collect & FuzzyQuery, but, using reusableSD.score as the bar. This saves changing reusableSD & calling insertWithOverflow for every candidate. Then I put back TopFieldDocCollector.collect since it can't make that same optimization. I implemented PriorityQueue.insert by calling insertWithOverflow (so we keep the tricky insert logic as single source). I changed reusableSD in TopDocCollector to private (it was protected – was there a reason?). Made another small optimization to PriorityQueue that saves the if statement in top() (this was #2 in Nadav's additional suggestions). Small javadoc fixes.
          Hide
          shaie Shai Erera added a comment -

          1. Actually there was why I removed the minScore. There's no point in evaluating minScore with current score because it's been evaluated again in insertWithOverflow. You don't lose anything by just populating reusableSD and calling insertWithOverflow. For that reason I removed collect() from TopFieldDocCollector. To me it looks like a cleaner code, and gain, performance wise, you only make the same comparison twice.

          2. That's a good one.

          3. Initially I thought TopFieldDocCollector would use it as FieldDoc (which extends ScoreDoc) and using TopDocCollector's collect() method. Then I implemented a protected newScoreDoc(doc, score) which was overidden by TopFieldDocCollector. That way, if TFDC uses TDC's collect() method, all it needs to do is override newScoreDoc(doc, score) to return a FieldDoc.

          4. Initially I implemented top() the same as you did, but then there were tests that created a PQ of size 0. However, with the change to initialize, it is safe to do it.

          I think that TDC and TFDC as I put them in the patch are extended cleaner and that way we don't do any extra logic (that's the reason why you would have needed to write the comment in TFDC's collect() method about simply comparing the score. Please re-consider my changes.

          Show
          shaie Shai Erera added a comment - 1. Actually there was why I removed the minScore. There's no point in evaluating minScore with current score because it's been evaluated again in insertWithOverflow. You don't lose anything by just populating reusableSD and calling insertWithOverflow. For that reason I removed collect() from TopFieldDocCollector. To me it looks like a cleaner code, and gain, performance wise, you only make the same comparison twice. 2. That's a good one. 3. Initially I thought TopFieldDocCollector would use it as FieldDoc (which extends ScoreDoc) and using TopDocCollector's collect() method. Then I implemented a protected newScoreDoc(doc, score) which was overidden by TopFieldDocCollector. That way, if TFDC uses TDC's collect() method, all it needs to do is override newScoreDoc(doc, score) to return a FieldDoc. 4. Initially I implemented top() the same as you did, but then there were tests that created a PQ of size 0. However, with the change to initialize, it is safe to do it. I think that TDC and TFDC as I put them in the patch are extended cleaner and that way we don't do any extra logic (that's the reason why you would have needed to write the comment in TFDC's collect() method about simply comparing the score. Please re-consider my changes.
          Hide
          mikemccand Michael McCandless added a comment -

          1. Actually there was why I removed the minScore. There's no point in evaluating minScore with current score because it's been evaluated again in insertWithOverflow. You don't lose anything by just populating reusableSD and calling insertWithOverflow. For that reason I removed collect() from TopFieldDocCollector. To me it looks like a cleaner code, and gain, performance wise, you only make the same comparison twice.

          But, imagine a query that has 55K hits to be collected (the avg. from
          your tests).

          With the original patch, you saved a tiny number of allocations (~70
          in your tests) yet added 55K inits of recycledSD and 55K additional
          method calls, which is surely a net loss of performance for the
          "typical" query. We set out here to improve performance of searching,
          but I think the original patch does the reverse.

          The proposed (modified) patch, puts back the original optimization
          (well, close to it: the original "minScore" is actually the minScore
          in the PQ) so we don't have the extra 55K MM inits+method call.

          I suppose we could also just make the API addition to PriorityQueue,
          but not change how TDC and TFDC and FQ call PriorityQueue.

          Show
          mikemccand Michael McCandless added a comment - 1. Actually there was why I removed the minScore. There's no point in evaluating minScore with current score because it's been evaluated again in insertWithOverflow. You don't lose anything by just populating reusableSD and calling insertWithOverflow. For that reason I removed collect() from TopFieldDocCollector. To me it looks like a cleaner code, and gain, performance wise, you only make the same comparison twice. But, imagine a query that has 55K hits to be collected (the avg. from your tests). With the original patch, you saved a tiny number of allocations (~70 in your tests) yet added 55K inits of recycledSD and 55K additional method calls, which is surely a net loss of performance for the "typical" query. We set out here to improve performance of searching, but I think the original patch does the reverse. The proposed (modified) patch, puts back the original optimization (well, close to it: the original "minScore" is actually the minScore in the PQ) so we don't have the extra 55K MM inits+method call. I suppose we could also just make the API addition to PriorityQueue, but not change how TDC and TFDC and FQ call PriorityQueue.
          Hide
          shaie Shai Erera added a comment -

          You're right, I wasn't thinking in those terms. Your changes should remain.

          Show
          shaie Shai Erera added a comment - You're right, I wasn't thinking in those terms. Your changes should remain.
          Hide
          mikemccand Michael McCandless added a comment -

          OK, thanks. I plan to commit in a day or two.

          Show
          mikemccand Michael McCandless added a comment - OK, thanks. I plan to commit in a day or two.
          Hide
          mikemccand Michael McCandless added a comment -

          I just committed this. Thanks Shai!

          Show
          mikemccand Michael McCandless added a comment - I just committed this. Thanks Shai!

            People

            • Assignee:
              mikemccand Michael McCandless
              Reporter:
              shaie Shai Erera
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development