Lucene - Core
  1. Lucene - Core
  2. LUCENE-6754

Optimize IndexSearcher.count for simple queries

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      IndexSearcher.count currently always create a collector to compute the number of hits, but it could optimize some queries like MatchAllDocsQuery or TermQuery.

        Activity

        Hide
        Adrien Grand added a comment -

        Here is a patch. count(MatchAllDocsQuery) returns reader.numDocs() and count(TermQuery) returns the sum of the doc freqs if there are no deletions.

        Show
        Adrien Grand added a comment - Here is a patch. count(MatchAllDocsQuery) returns reader.numDocs() and count(TermQuery) returns the sum of the doc freqs if there are no deletions.
        Hide
        ASF subversion and git services added a comment -

        Commit 1700791 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1700791 ]

        LUCENE-6754: Optimized IndexSearcher.count for simple queries.

        Show
        ASF subversion and git services added a comment - Commit 1700791 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1700791 ] LUCENE-6754 : Optimized IndexSearcher.count for simple queries.
        Hide
        ASF subversion and git services added a comment -

        Commit 1700793 from Adrien Grand in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1700793 ]

        LUCENE-6754: Optimized IndexSearcher.count for simple queries.

        Show
        ASF subversion and git services added a comment - Commit 1700793 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1700793 ] LUCENE-6754 : Optimized IndexSearcher.count for simple queries.
        Hide
        Uwe Schindler added a comment - - edited

        Hi, thanks for committing this. I am just not really happy about the instanceof checks. It also makes it impossible for custom queries to maybe improve this. How about adding a method to the query class that can be "optionally" implemented to return a count? By default it may throw exception or alternative use the collector approach. Queries like term query may implement this count methods, if they have some optimized way to do this.

        An alternative would be to let the Weight have the new method. TermWeight would have its statistics already and could implement TermWeight#count() easily. if there are deletions, weight would call super.count().

        Queries like ConstantScoreQuery could delegate to the inner queries.

        Show
        Uwe Schindler added a comment - - edited Hi, thanks for committing this. I am just not really happy about the instanceof checks. It also makes it impossible for custom queries to maybe improve this. How about adding a method to the query class that can be "optionally" implemented to return a count? By default it may throw exception or alternative use the collector approach. Queries like term query may implement this count methods, if they have some optimized way to do this. An alternative would be to let the Weight have the new method. TermWeight would have its statistics already and could implement TermWeight#count() easily. if there are deletions, weight would call super.count(). Queries like ConstantScoreQuery could delegate to the inner queries.
        Hide
        Adrien Grand added a comment -

        How about adding a method to the query class that can be "optionally" implemented to return a count?

        I would really like to avoid adding new methods for that.

        There is another change that I have been thinking about recently, that would add a `boolean needsScores` parameter to Query.rewrite. This could be useful eg. to flatten boolean queries when scores are not needed so that we make a better use of the cost API. In the context of this issue, this means that queries could rewrite to a MatchAllDocsQuery or to a TermQuery if scores are not needed so that this optimization would apply. Would it work for you? Unwrapping CSQ would not be necessary anymore as a CSQ would return the inner query in Query.rewrite if scores are not needed.

        Show
        Adrien Grand added a comment - How about adding a method to the query class that can be "optionally" implemented to return a count? I would really like to avoid adding new methods for that. There is another change that I have been thinking about recently, that would add a `boolean needsScores` parameter to Query.rewrite. This could be useful eg. to flatten boolean queries when scores are not needed so that we make a better use of the cost API. In the context of this issue, this means that queries could rewrite to a MatchAllDocsQuery or to a TermQuery if scores are not needed so that this optimization would apply. Would it work for you? Unwrapping CSQ would not be necessary anymore as a CSQ would return the inner query in Query.rewrite if scores are not needed.
        Hide
        Uwe Schindler added a comment -

        Let me think about the rewrite logic...!

        About the other problem: I hate those instanceof checks and they always remind me about Highlighter - which is a desaster! I was thinking about my previous mail, I tend to think that not Query, but Weight would just get this method. Weight already has an optimization for Bulk scoring, so I see no issue in adding a "bulk/smart/fast counting", if there is a default implementation available that all queries not taking care inherit automatically. Queries that really would like to implementation are free to do so, but there is no requirement.

        I just wanted to start the discussion about this on the issue. Unfortunately I was a bit late, but does not matter. I can make a proposal if others think the same.

        Show
        Uwe Schindler added a comment - Let me think about the rewrite logic...! About the other problem: I hate those instanceof checks and they always remind me about Highlighter - which is a desaster! I was thinking about my previous mail, I tend to think that not Query, but Weight would just get this method. Weight already has an optimization for Bulk scoring, so I see no issue in adding a "bulk/smart/fast counting", if there is a default implementation available that all queries not taking care inherit automatically. Queries that really would like to implementation are free to do so, but there is no requirement. I just wanted to start the discussion about this on the issue. Unfortunately I was a bit late, but does not matter. I can make a proposal if others think the same.

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development