Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 9.0
    • None
    • None

    Description

      "fq" filter queries that have cache=false and which aren't processed as a PostFilter (thus either aren't a PostFilter or have a cost < 100) are processed in SolrIndexSearcher using a custom Filter thingy which uses a cost-ordered series of DocIdSetIterators. This is not TwoPhaseIterator aware, and thus the match() method may be called on docs that ideally would have been filtered by lower-cost filter queries.

      Attachments

        Issue Links

          Activity

            dsmiley David Smiley added a comment -

            The PR has the code details but I want to mention some more bigger picture here.

            I have this as a sub-task of Remove/refactor Filter because this reduces the use of the old Filter abstraction. SolrIndexSearcher.ProcessedFilter.filter is now declared as a Query. SolrIndexSearcher no longer has FilterImpl. Now that pf.filter is a Query, this allowed for SolrIndexSearcher.getDocSet(List<Query> fqs) to be simpler and allowed me to remove the similar getDocSetScore.

            So how is TwoPhaseIterator used efficiently you may ask? BooleanQuery's FILTER clauses use this internally via ConjunctionDISI. I modified SolrIndexSearcher.getProcessedFilter to create a BooleanQuery with these FILTER clauses for the non-cached queries.

            Unfortunately we lose the ability for the "cost" param on these non-cached filter queries to have meaning. Instead, the Queries themselves and any TPIs they may have ought to have suitable costs, and they are not externally configurable. Maybe we could make a wrapping query that wraps the underlying TPI.matchCost... or just not bother, letting the queries themselves actually compute an internal cost that is perhaps better than whatever the user supplies. I lean this way; less complexity. Unfortunately, ValueSourceScorer's TPI matchCost is a constant 100 instead of varying based on the particular FunctionValues implementation. That should be its own issue to address.

            dsmiley David Smiley added a comment - The PR has the code details but I want to mention some more bigger picture here. I have this as a sub-task of Remove/refactor Filter because this reduces the use of the old Filter abstraction. SolrIndexSearcher.ProcessedFilter.filter is now declared as a Query. SolrIndexSearcher no longer has FilterImpl. Now that pf.filter is a Query, this allowed for SolrIndexSearcher.getDocSet(List<Query> fqs) to be simpler and allowed me to remove the similar getDocSetScore. So how is TwoPhaseIterator used efficiently you may ask? BooleanQuery's FILTER clauses use this internally via ConjunctionDISI. I modified SolrIndexSearcher.getProcessedFilter to create a BooleanQuery with these FILTER clauses for the non-cached queries. Unfortunately we lose the ability for the "cost" param on these non-cached filter queries to have meaning. Instead, the Queries themselves and any TPIs they may have ought to have suitable costs, and they are not externally configurable. Maybe we could make a wrapping query that wraps the underlying TPI.matchCost... or just not bother, letting the queries themselves actually compute an internal cost that is perhaps better than whatever the user supplies. I lean this way; less complexity. Unfortunately, ValueSourceScorer's TPI matchCost is a constant 100 instead of varying based on the particular FunctionValues implementation. That should be its own issue to address.
            dsmiley David Smiley added a comment -

            I filed LUCENE-9114 to get the matchCost on ValueSource/FunctionValues API. It's perhaps a "required" dependency to get reasonable performance.

            dsmiley David Smiley added a comment - I filed LUCENE-9114 to get the matchCost on ValueSource/FunctionValues API. It's perhaps a "required" dependency to get reasonable performance.
            dsmiley David Smiley added a comment -

            CC yonik jbernste hossman as possible reviewers for this attached PR which is rather technical into code which few people have touched but you all three in some shape/form. Please review the issue description, and take a look at the PR. In the PR, each commit is well isolated to the what the commit message says, so you may prefer to go commit-by-commit, or you could just look at the thing as a whole. In a comment above I pondered "Maybe we could make a wrapping query that wraps the underlying TPI.matchCost"; as you'll see in the PR, I did that. The test works in validating that match() isn't called more than it needs to be. It used to be called more which is verifiable by copying the test to the 8x line (if I recall, it was called two additional times). I suspect the test doesn't test that MatchCostQuery is having an effect... I may need to think a bit more on how to do that.

            I suspect someone will ask me if I did some performance tests. No I did not. My goal is removal of tech debt – Filter, and in the process expect some performance improvements that Filter was blocking. So in this issue, anyone with non-cached filter queries may see a benefit, especially when those queries have TwoPhaseIterators (phrase queries, frange, spatial, more). The benefit may be further pronounced if the main query also has TPIs because Lucene cleverly sees through the boolean queries to group the TPIs of required clauses in the tree.

            dsmiley David Smiley added a comment - CC yonik jbernste hossman as possible reviewers for this attached PR which is rather technical into code which few people have touched but you all three in some shape/form. Please review the issue description, and take a look at the PR. In the PR, each commit is well isolated to the what the commit message says, so you may prefer to go commit-by-commit, or you could just look at the thing as a whole. In a comment above I pondered "Maybe we could make a wrapping query that wraps the underlying TPI.matchCost"; as you'll see in the PR, I did that. The test works in validating that match() isn't called more than it needs to be. It used to be called more which is verifiable by copying the test to the 8x line (if I recall, it was called two additional times). I suspect the test doesn't test that MatchCostQuery is having an effect... I may need to think a bit more on how to do that. I suspect someone will ask me if I did some performance tests. No I did not. My goal is removal of tech debt – Filter, and in the process expect some performance improvements that Filter was blocking. So in this issue, anyone with non-cached filter queries may see a benefit, especially when those queries have TwoPhaseIterators (phrase queries, frange, spatial, more). The benefit may be further pronounced if the main query also has TPIs because Lucene cleverly sees through the boolean queries to group the TPIs of required clauses in the tree.
            dsmiley David Smiley added a comment -

            I did a bit of benchmarking. Depending on the scenario, things can be a bit better (possible a lot) or worse, but never much worse. I filed LUCENE-9938 which may help further. I'll commit this in a couple days.

            dsmiley David Smiley added a comment - I did a bit of benchmarking. Depending on the scenario, things can be a bit better (possible a lot) or worse, but never much worse. I filed LUCENE-9938 which may help further. I'll commit this in a couple days.

            Commit b27b58704c84b67375080a6e9f818fb7987c203b in solr's branch refs/heads/main from David Smiley
            [ https://gitbox.apache.org/repos/asf?p=solr.git;h=b27b587 ]

            SOLR-14166 fq cache=false should use TwoPhaseIterator (#57)

            instead of custom leapfrogging iterator thing, FilterImpl.

            • Simplify SolrIndexSearcher.getDocSet and makeBitDocSet
            • ExtendedQuery: remove getCacheSep & setCacheSep which never did anything
            • Javadocs: Clarify ExtendedQuery vs PostFilter
            • Add MatchCostQuery wrapper so that the cost local-param to "fq" is used for TPI.matchCost
            jira-bot ASF subversion and git services added a comment - Commit b27b58704c84b67375080a6e9f818fb7987c203b in solr's branch refs/heads/main from David Smiley [ https://gitbox.apache.org/repos/asf?p=solr.git;h=b27b587 ] SOLR-14166 fq cache=false should use TwoPhaseIterator (#57) instead of custom leapfrogging iterator thing, FilterImpl. Simplify SolrIndexSearcher.getDocSet and makeBitDocSet ExtendedQuery: remove getCacheSep & setCacheSep which never did anything Javadocs: Clarify ExtendedQuery vs PostFilter Add MatchCostQuery wrapper so that the cost local-param to "fq" is used for TPI.matchCost
            mdrob Mike Drob added a comment -

            Can you describe (and give some example queries) about what performed much better/somewhat better/somewhat worse? I'm not looking for full tables, but just a general idea of what we give up and what we gained.

            Do we need to update the ref guide, especially around https://solr.apache.org/guide/8_8/common-query-parameters.html#cache-parameter ?

            mdrob Mike Drob added a comment - Can you describe (and give some example queries) about what performed much better/somewhat better/somewhat worse? I'm not looking for full tables, but just a general idea of what we give up and what we gained. Do we need to update the ref guide, especially around https://solr.apache.org/guide/8_8/common-query-parameters.html#cache-parameter  ?
            dsmiley David Smiley added a comment -

            Right; I just noticed those docs are obsolete while I was working on sibling issues. I will propose an update to the docs in a new PR.
            I'll follow-up this week sometime with some more specific query examples. My notes from a couple weeks ago are terrible .

            dsmiley David Smiley added a comment - Right; I just noticed those docs are obsolete while I was working on sibling issues. I will propose an update to the docs in a new PR. I'll follow-up this week sometime with some more specific query examples. My notes from a couple weeks ago are terrible .

            Commit a264165db4bd2173d259275a12d6090634dac28e in solr's branch refs/heads/SOLR-14166-docs from David Smiley
            [ https://gitbox.apache.org/repos/asf?p=solr.git;h=a264165 ]

            SOLR-14166: ref guide docs

            jira-bot ASF subversion and git services added a comment - Commit a264165db4bd2173d259275a12d6090634dac28e in solr's branch refs/heads/ SOLR-14166 -docs from David Smiley [ https://gitbox.apache.org/repos/asf?p=solr.git;h=a264165 ] SOLR-14166 : ref guide docs

            Commit abdca25b86c32064689c749f53715e8eed04a1da in solr's branch refs/heads/main from David Smiley
            [ https://gitbox.apache.org/repos/asf?p=solr.git;h=abdca25 ]

            SOLR-14166: update and refactor docs on cache local-param (#113)

            The "cache" is a local-param, not top level. It's nearly always used with "fq" so move there.
            Update the fact that "cost" is just a hint now.
            spatial-search.adoc: remove PostFilter reference that is now inconsequential (a good thing).

            jira-bot ASF subversion and git services added a comment - Commit abdca25b86c32064689c749f53715e8eed04a1da in solr's branch refs/heads/main from David Smiley [ https://gitbox.apache.org/repos/asf?p=solr.git;h=abdca25 ] SOLR-14166 : update and refactor docs on cache local-param (#113) The "cache" is a local-param, not top level. It's nearly always used with "fq" so move there. Update the fact that "cost" is just a hint now. spatial-search.adoc: remove PostFilter reference that is now inconsequential (a good thing).
            janhoy Jan Høydahl added a comment -

            Closing after the 9.0.0 release

            janhoy Jan Høydahl added a comment - Closing after the 9.0.0 release

            People

              dsmiley David Smiley
              dsmiley David Smiley
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 50m
                  1h 50m