Details

    • Type: Task Task
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.2, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Today we have extractTerms on Query, but it is supposed to only be called after the query has been specialized to a given IndexReader using Query.rewrite(IndexReader) to allow some complex queries to replace terms "matchers" with actual terms (eg. WildcardQuery).

      However, we already have an abstraction for indexreader-specialized queries: Weight. So I think it would make more sense to have extractTerms on Weight. This would also remove the trap of calling extractTerms on a query which is not rewritten yet.

      Since Weights know about whether scores are needed or not, I also hope this would help improve the extractTerms semantics. We currently have 2 use-cases for extractTerms: distributed IDF and highlighting. While the former only cares about terms which are used for scoring, it could make sense to highlight terms that were used for matching, even if they did not contribute to the score (eg. if wrapped in a ConstantScoreQuery or a BooleanQuery FILTER clause). So highlighters could do searcher.createNormalizedWeight(query, false).extractTerms(termSet) to get all terms that were used for matching the query while distributed IDF would instead do searcher.createNormalizedWeight(query, true).extractTerms(termSet) to get scoring terms only.

      1. LUCENE-6425.patch
        64 kB
        Adrien Grand
      2. LUCENE-6425.patch
        63 kB
        Adrien Grand

        Activity

        Hide
        Adrien Grand added a comment -

        Here is a patch: highlighters extract terms on a Weight that does not need scores (as mentionned in the description) while ShardSearchingTestBase and Solr's ExactStatsCache (used for distributed IDF) use a Weight with needsScores=true.

        Show
        Adrien Grand added a comment - Here is a patch: highlighters extract terms on a Weight that does not need scores (as mentionned in the description) while ShardSearchingTestBase and Solr's ExactStatsCache (used for distributed IDF) use a Weight with needsScores=true.
        Hide
        Yonik Seeley added a comment -

        However, we already have an abstraction for indexreader-specialized queries: Weight.

        Yeah, it makes sense to have a call to get the matching terms there.

        searcher.createNormalizedWeight(query, true).extractTerms(termSet) to get scoring terms only.

        +1

        It might still make sense to have an extractTerms on query to get explicit terms (those that do not depend any specific index?)

        Show
        Yonik Seeley added a comment - However, we already have an abstraction for indexreader-specialized queries: Weight. Yeah, it makes sense to have a call to get the matching terms there. searcher.createNormalizedWeight(query, true).extractTerms(termSet) to get scoring terms only. +1 It might still make sense to have an extractTerms on query to get explicit terms (those that do not depend any specific index?)
        Hide
        Adrien Grand added a comment -

        It might still make sense to have an extractTerms on query to get explicit terms (those that do not depend any specific index?)

        In that case, I think it would be as easy to rewrite against an empty index? This is what PostingsHighlighter does in order to avoid costly rewrites:

        IndexReader emptyReader = new MultiReader();
        Set<Term> termSet = new HashSet<>();
        new IndexSearcher(emptyReader).createNormalizedWeight(query, false).extractTerms(termSet);
        
        Show
        Adrien Grand added a comment - It might still make sense to have an extractTerms on query to get explicit terms (those that do not depend any specific index?) In that case, I think it would be as easy to rewrite against an empty index? This is what PostingsHighlighter does in order to avoid costly rewrites: IndexReader emptyReader = new MultiReader(); Set<Term> termSet = new HashSet<>(); new IndexSearcher(emptyReader).createNormalizedWeight(query, false ).extractTerms(termSet);
        Hide
        Robert Muir added a comment -

        -1 to maintain two extractTerms method, or make the API confusing in that way.

        Weight or Query, choose one.

        Show
        Robert Muir added a comment - -1 to maintain two extractTerms method, or make the API confusing in that way. Weight or Query, choose one.
        Hide
        Robert Muir added a comment -

        Can we please cleanup the postingshighlighter changes?

        • Don't leave the now-unused rewrite() method as dead code.
        • Don't have a constant EMPTY_INDEXREADER, instead EMPTY_SEARCHER, and have it explicitly disable caching.
        Show
        Robert Muir added a comment - Can we please cleanup the postingshighlighter changes? Don't leave the now-unused rewrite() method as dead code. Don't have a constant EMPTY_INDEXREADER, instead EMPTY_SEARCHER, and have it explicitly disable caching.
        Hide
        Adrien Grand added a comment -

        Here is a cleaned-up patch as requested by Robert.

        Show
        Adrien Grand added a comment - Here is a cleaned-up patch as requested by Robert.
        Hide
        Robert Muir added a comment -

        I am +1 but concerned about the change to TermContext. the plumbing involved there should only happen from FuzzyLikeThisQuery... And i agree, the way it lies about docFreq, totalTermFreq should be unsupported, but i don't understand how this change discovered/required it.

        Show
        Robert Muir added a comment - I am +1 but concerned about the change to TermContext. the plumbing involved there should only happen from FuzzyLikeThisQuery... And i agree, the way it lies about docFreq, totalTermFreq should be unsupported, but i don't understand how this change discovered/required it.
        Hide
        Adrien Grand added a comment -

        the plumbing involved there should only happen from FuzzyLikeThisQuery...

        I have tests failing that do not use FuzzyLikeThisQuery. If you apply this patch and comment out the reset of totalTermFreq then eg. HighlighterSearchTests.testGetFuzzyFragments will fail. If I understand the issue correctly what happens is that the FuzzyQuery is rewritten against the main index using SCORING_BOOLEAN_REWRITE which in-turn creates a TermQuery using the `TermQuery(Term t, TermContext states)` (which sets the docFreq explicitely to 3) and then for highlighting purposes, the query is executed against the memory index which has a ttf and a df of 1. So the following line of TermQuery.createWeight is called if (docFreq != -1) termState.setDocFreq(docFreq); and then it breaks an assertion in TermStatistics' constructor: assert totalTermFreq == -1 || totalTermFreq >= docFreq; // #positions must be >= #postings because we overrode the doc freq but did not care to update the total term freq to a consistent value.

        Show
        Adrien Grand added a comment - the plumbing involved there should only happen from FuzzyLikeThisQuery... I have tests failing that do not use FuzzyLikeThisQuery. If you apply this patch and comment out the reset of totalTermFreq then eg. HighlighterSearchTests.testGetFuzzyFragments will fail. If I understand the issue correctly what happens is that the FuzzyQuery is rewritten against the main index using SCORING_BOOLEAN_REWRITE which in-turn creates a TermQuery using the `TermQuery(Term t, TermContext states)` (which sets the docFreq explicitely to 3) and then for highlighting purposes, the query is executed against the memory index which has a ttf and a df of 1. So the following line of TermQuery.createWeight is called if (docFreq != -1) termState.setDocFreq(docFreq); and then it breaks an assertion in TermStatistics' constructor: assert totalTermFreq == -1 || totalTermFreq >= docFreq; // #positions must be >= #postings because we overrode the doc freq but did not care to update the total term freq to a consistent value.
        Hide
        Robert Muir added a comment -

        Then there are bugs here.

        Apparently there are 3 ways to use termquery, one avoids reseeking the terms dict, another lies about docfreq, and other is normal usage. The first two are mixing each other up and we have broken statistics... that should only be the case for FuzzyLikeThis. so something is really wrong.

        Show
        Robert Muir added a comment - Then there are bugs here. Apparently there are 3 ways to use termquery, one avoids reseeking the terms dict, another lies about docfreq, and other is normal usage. The first two are mixing each other up and we have broken statistics... that should only be the case for FuzzyLikeThis. so something is really wrong.
        Hide
        ASF subversion and git services added a comment -

        Commit 1674091 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1674091 ]

        LUCENE-6425: Replaced Query.extractTerms with Weight.extractTerms.

        Show
        ASF subversion and git services added a comment - Commit 1674091 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1674091 ] LUCENE-6425 : Replaced Query.extractTerms with Weight.extractTerms.
        Hide
        ASF subversion and git services added a comment -

        Commit 1674100 from Adrien Grand in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1674100 ]

        LUCENE-6425: Replaced Query.extractTerms with Weight.extractTerms.

        Show
        ASF subversion and git services added a comment - Commit 1674100 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1674100 ] LUCENE-6425 : Replaced Query.extractTerms with Weight.extractTerms.
        Hide
        Anshum Gupta added a comment -

        Bulk close for 5.2.0.

        Show
        Anshum Gupta added a comment - Bulk close for 5.2.0.

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development