Lucene - Core
  1. Lucene - Core
  2. LUCENE-5795

More Like This: ensures selection of best terms is indeed O(n)

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New
    1. LUCENE-5795
      10 kB
      Alex Ksikes
    2. LUCENE-5795
      10 kB
      Alex Ksikes
    3. LUCENE-5795
      7 kB
      Alex Ksikes
    4. LUCENE-5795
      7 kB
      Alex Ksikes
    5. LUCENE-5795
      2 kB
      Alex Ksikes

      Activity

      Hide
      Simon Willnauer added a comment - - edited

      I think making the priority queue bounded makes perfect sense. Yet, I think when we touch this queue we should also make sure we get rid of the crazy Object[] that is holds. We should really add a struct like object that holds the individual values rather than an Object array that requires Float rather than float (we should prefer the latter).
      Further I think if we really want to make this more efficient we (not sure about the O( n ) though but that's a different story) we should not use

            res.insertWithOverflow(new Object[]{word, // the word
                topField, // the top field
                score,// overall score
                idf, // idf
                docFreq, // freq in all docs
                tf
            });
      

      it will add the element no matter if we exceed the size of the queue or not.

      we should rather do something like:

      final int limit = Math.min(maxQueryTerms, words.size());
      //...
      
      if (queue.size() < limit) {
        // there is still space in the queue
        queue.add(new ScoreTerm(word, topField, score, ...))
      } else {
        ScoreTerm term = queue.top();
        if (term.score() < score) { // update the smallest in the queue in place and update the queue.
          term.update(word, topField, score, ...);
          queue.updateTop();
        }
      } 
      

      I hope that makes sense?

      Show
      Simon Willnauer added a comment - - edited I think making the priority queue bounded makes perfect sense. Yet, I think when we touch this queue we should also make sure we get rid of the crazy Object[] that is holds. We should really add a struct like object that holds the individual values rather than an Object array that requires Float rather than float (we should prefer the latter). Further I think if we really want to make this more efficient we (not sure about the O( n ) though but that's a different story) we should not use res.insertWithOverflow( new Object []{word, // the word topField, // the top field score, // overall score idf, // idf docFreq, // freq in all docs tf }); it will add the element no matter if we exceed the size of the queue or not. we should rather do something like: final int limit = Math .min(maxQueryTerms, words.size()); //... if (queue.size() < limit) { // there is still space in the queue queue.add( new ScoreTerm(word, topField, score, ...)) } else { ScoreTerm term = queue.top(); if (term.score() < score) { // update the smallest in the queue in place and update the queue. term.update(word, topField, score, ...); queue.updateTop(); } } I hope that makes sense?
      Hide
      Alex Ksikes added a comment -

      Thanks, I've updated the patch as per your suggestions.

      Show
      Alex Ksikes added a comment - Thanks, I've updated the patch as per your suggestions.
      Hide
      Simon Willnauer added a comment -

      look pretty good though. can you make all the members and methods in ScoreTerm package private or private? I also think we should make retrieveTerms() private or package private, it seem not used elsewhere no?

      Show
      Simon Willnauer added a comment - look pretty good though. can you make all the members and methods in ScoreTerm package private or private? I also think we should make retrieveTerms() private or package private, it seem not used elsewhere no?
      Hide
      Alex Ksikes added a comment -

      Added test to check for top N.

      Show
      Alex Ksikes added a comment - Added test to check for top N.
      Hide
      Simon Willnauer added a comment -

      thanks alex, your test uses internal sun APIs ie. import com.sun.deploy.util.StringUtils; I don't think there is a string utils in java can you fix that, the test doesn't compile for me.

      Show
      Simon Willnauer added a comment - thanks alex, your test uses internal sun APIs ie. import com.sun.deploy.util.StringUtils; I don't think there is a string utils in java can you fix that, the test doesn't compile for me.
      Hide
      Alex Ksikes added a comment -

      Thanks for the comment Simon. I've just updated the patch.

      Show
      Alex Ksikes added a comment - Thanks for the comment Simon. I've just updated the patch.
      Hide
      ASF subversion and git services added a comment -

      Commit 1609474 from Simon Willnauer in branch 'dev/trunk'
      [ https://svn.apache.org/r1609474 ]

      LUCENE-5795: MoreLikeThisQuery now only collects the top N terms

      Show
      ASF subversion and git services added a comment - Commit 1609474 from Simon Willnauer in branch 'dev/trunk' [ https://svn.apache.org/r1609474 ] LUCENE-5795 : MoreLikeThisQuery now only collects the top N terms
      Hide
      ASF subversion and git services added a comment -

      Commit 1609493 from Simon Willnauer in branch 'dev/branches/branch_4x'
      [ https://svn.apache.org/r1609493 ]

      LUCENE-5795: MoreLikeThisQuery now only collects the top N terms

      Show
      ASF subversion and git services added a comment - Commit 1609493 from Simon Willnauer in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1609493 ] LUCENE-5795 : MoreLikeThisQuery now only collects the top N terms
      Hide
      Simon Willnauer added a comment -

      committed thanks Alex

      Show
      Simon Willnauer added a comment - committed thanks Alex

        People

        • Assignee:
          Simon Willnauer
          Reporter:
          Alex Ksikes
        • Votes:
          0 Vote for this issue
          Watchers:
          3 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved:

            Development