Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2
    • Component/s: search
    • Labels:
      None

      Description

      There is a need for negative filter queries to avoid long filter generation times and large caching requirements.

      Currently, if someone wants to filter out a small number of documents, they must specify the complete set of documents to express those negative conditions against.

      q=foo&fq=id:[* TO *] -id:101

      In this example, to filter out a single document, the complete set of documents (minus one) is generated, and a large bitset is cached. You could also add the restriction to the main query, but that doesn't work with the dismax handler which doesn't have a facility for this.

      1. negative_filters.patch
        20 kB
        Yonik Seeley
      2. negative_filters.patch
        11 kB
        Yonik Seeley

        Activity

        Hide
        Hoss Man added a comment -

        This bug was modified as part of a bulk update using the criteria...

        • Marked ("Resolved" or "Closed") and "Fixed"
        • Had no "Fix Version" versions
        • Was listed in the CHANGES.txt for 1.2

        The Fix Version for all 39 issues found was set to 1.2, email notification
        was suppressed to prevent excessive email.

        For a list of all the issues modified, search jira comments for this
        (hopefully) unique string: 20080415hossman2

        Show
        Hoss Man added a comment - This bug was modified as part of a bulk update using the criteria... Marked ("Resolved" or "Closed") and "Fixed" Had no "Fix Version" versions Was listed in the CHANGES.txt for 1.2 The Fix Version for all 39 issues found was set to 1.2, email notification was suppressed to prevent excessive email. For a list of all the issues modified, search jira comments for this (hopefully) unique string: 20080415hossman2
        Hide
        Mike Klaas added a comment -

        Hoss: thanks for the explanation.

        I might throw this in our production code this week and see how it fares.

        Show
        Mike Klaas added a comment - Hoss: thanks for the explanation. I might throw this in our production code this week and see how it fares.
        Hide
        Mike Klaas added a comment -

        Surely Hoss' example doesn't use matchAllDocs--he has a positive query in both cases.

        Show
        Mike Klaas added a comment - Surely Hoss' example doesn't use matchAllDocs--he has a positive query in both cases.
        Hide
        Hoss Man added a comment -

        I was strating to think the same thing as Mike: but doing more testing i seee what yonik's refering to (note to self: test more then one query when doing cache testing) ... only the first use of a negative query results in the double insert .. afterthat every thing is golden.

        Mike: i think the key is that unless faceting is turned on, the StandardRequestHandler only calls getDocList, not getDocListAndSet ... so by the time the call makes it to getDocListC the falgs never contain GET_DOCSET, so the main query isn't included in the list passed to getDocSet.

        Show
        Hoss Man added a comment - I was strating to think the same thing as Mike: but doing more testing i seee what yonik's refering to (note to self: test more then one query when doing cache testing) ... only the first use of a negative query results in the double insert .. afterthat every thing is golden. Mike: i think the key is that unless faceting is turned on, the StandardRequestHandler only calls getDocList, not getDocListAndSet ... so by the time the call makes it to getDocListC the falgs never contain GET_DOCSET, so the main query isn't included in the list passed to getDocSet.
        Hide
        Mike Klaas added a comment -

        I think this is due to the last line of this fragment of the patch:

        protected DocSet getDocSet(List<Query> queries) throws IOException {
        + if (queries==null) return null;
        + if (queries.size()==1) return getDocSet(queries.get(0));
        DocSet answer=null;

        • if (queries==null) return null;
        • for (Query q : queries) {
        • if (answer==null) {
        • answer = getDocSet(q);
          +
          + boolean[] neg = new boolean[queries.size()];
          + DocSet[] sets = new DocSet[queries.size()];
          +
          + int smallestIndex = -1;
          + int smallestCount = Integer.MAX_VALUE;
          + for (int i=0; i<sets.length; i++) {
          + Query q = queries.get;
          + Query posQuery = QueryUtils.getAbs(q);
          + sets[i] = getPositiveDocSet(posQuery);

        getPositiveDocSet() caches all docsets returned, so both the query part and the filter part would be cached in the filterCache.

        Show
        Mike Klaas added a comment - I think this is due to the last line of this fragment of the patch: protected DocSet getDocSet(List<Query> queries) throws IOException { + if (queries==null) return null; + if (queries.size()==1) return getDocSet(queries.get(0)); DocSet answer=null; if (queries==null) return null; for (Query q : queries) { if (answer==null) { answer = getDocSet(q); + + boolean[] neg = new boolean [queries.size()] ; + DocSet[] sets = new DocSet [queries.size()] ; + + int smallestIndex = -1; + int smallestCount = Integer.MAX_VALUE; + for (int i=0; i<sets.length; i++) { + Query q = queries.get ; + Query posQuery = QueryUtils.getAbs(q); + sets [i] = getPositiveDocSet(posQuery); getPositiveDocSet() caches all docsets returned, so both the query part and the filter part would be cached in the filterCache.
        Hide
        Yonik Seeley added a comment -

        You are seeing a MatchAllDocsQuery filter.

        If getDocSet(List<Query>) is called with a single negative query, or
        or getDocSet(Query, Filter) is called with a null filter and a negative query, we call getDocSet(MatchAllDocsQuery)
        to use as a base to andNot the passed query.

        If you continue your example with fq=+memory and fq=-memory, you will see what you expect (only one new filter).

        Show
        Yonik Seeley added a comment - You are seeing a MatchAllDocsQuery filter. If getDocSet(List<Query>) is called with a single negative query, or or getDocSet(Query, Filter) is called with a null filter and a negative query, we call getDocSet(MatchAllDocsQuery) to use as a base to andNot the passed query. If you continue your example with fq=+memory and fq=-memory, you will see what you expect (only one new filter).
        Hide
        Hoss Man added a comment -

        i think there may be a caching bugwith this... if you load up the example schema, populate it, and then stop/start the port so all of the caches are empty, and hit this URL...

        http://localhost:8983/solr/select/?q=cat%3Aelectronics&version=2.2&start=0&fl=score,name&fq=samsung

        you see that the lookup/insert/size for both the queryResultsCache and filterCaches are 1 (as expected)

        if you then hit this URL...

        http://localhost:8983/solr/select/?q=cat%3Aelectronics&version=2.2&start=0&fl=score,name&fq=-samsung

        the queryResultCache changes predictably to a size of 2 (because the filters are part of the cache key) but the size of the filterCache has also grown to 2 ... which doesn't make sense since the patch is suppose to be caching the inverse of negative queries (fq=samsung and fq=-samsung should be a cahce hit right?)

        if you stop/start the port and only hit this URL...

        http://localhost:8983/solr/select/?q=cat%3Aelectronics&version=2.2&start=0&fl=score,name&fq=-samsung

        ...you'll see that the filterCache gets 2 lookups, and 2 insertions.

        I wouldn't think this is wrong, except that the patch explicitly says "// cache negative queries as positive"

        Show
        Hoss Man added a comment - i think there may be a caching bugwith this... if you load up the example schema, populate it, and then stop/start the port so all of the caches are empty, and hit this URL... http://localhost:8983/solr/select/?q=cat%3Aelectronics&version=2.2&start=0&fl=score,name&fq=samsung you see that the lookup/insert/size for both the queryResultsCache and filterCaches are 1 (as expected) if you then hit this URL... http://localhost:8983/solr/select/?q=cat%3Aelectronics&version=2.2&start=0&fl=score,name&fq=-samsung the queryResultCache changes predictably to a size of 2 (because the filters are part of the cache key) but the size of the filterCache has also grown to 2 ... which doesn't make sense since the patch is suppose to be caching the inverse of negative queries (fq=samsung and fq=-samsung should be a cahce hit right?) if you stop/start the port and only hit this URL... http://localhost:8983/solr/select/?q=cat%3Aelectronics&version=2.2&start=0&fl=score,name&fq=-samsung ...you'll see that the filterCache gets 2 lookups, and 2 insertions. I wouldn't think this is wrong, except that the patch explicitly says "// cache negative queries as positive"
        Hide
        Yonik Seeley added a comment -

        committed.
        Thanks for the review Mike!

        Show
        Yonik Seeley added a comment - committed. Thanks for the review Mike!
        Hide
        Mike Klaas added a comment -

        Took a close look at the patch and I think it all looks good!

        The query==absQuery test for query negativity reads just mysteriously enough that it might be worth putting in a small comment when it is used (or at least once in SOlrIndexSearch.java).

        Show
        Mike Klaas added a comment - Took a close look at the patch and I think it all looks good! The query==absQuery test for query negativity reads just mysteriously enough that it might be worth putting in a small comment when it is used (or at least once in SOlrIndexSearch.java).
        Hide
        Yonik Seeley added a comment -

        New working version attached with tests.
        Negative queries (like -id:10) should now work everywhere for filters and queries.

        Show
        Yonik Seeley added a comment - New working version attached with tests. Negative queries (like -id:10) should now work everywhere for filters and queries.
        Hide
        Yonik Seeley added a comment -

        attached draft (it doesn't work yet, and there isn't any test code).

        Show
        Yonik Seeley added a comment - attached draft (it doesn't work yet, and there isn't any test code).
        Hide
        Yonik Seeley added a comment -

        From an interface point of view, I'm heavily leaning toward getting rid of the restriction of lucene queries having to be all negative. This would allow using a negative-only query anywhere one currently can use a positive query.

        One could simply and naturally do fq=-id:10 to filter out a single document.

        Show
        Yonik Seeley added a comment - From an interface point of view, I'm heavily leaning toward getting rid of the restriction of lucene queries having to be all negative. This would allow using a negative-only query anywhere one currently can use a positive query. One could simply and naturally do fq=-id:10 to filter out a single document.
        Hide
        Hoss Man added a comment -

        With ConstantScoreQuery and QueryFilter it's basically possible to do any negating clauses as either a boolean clause or as a filter ... the big trade off decission really comes down to how big is the negated set, and is it worth caching independently of the main query.

        if you're trying to exclude only a handful of items by ID, then BooleanClauses make a lot of sense based on the way the BooleanQuery Scorers work ... but if you are trying to exclude 9/10ths of the documents in your index using a complicated query expression, it starts being more worth while to generate that set once and cache it for reuse as a Filter – but there are still trade off questions to be asked about wether or not caching the set -A is as efficient as caching the set A (from a memory standpoint)

        Show
        Hoss Man added a comment - With ConstantScoreQuery and QueryFilter it's basically possible to do any negating clauses as either a boolean clause or as a filter ... the big trade off decission really comes down to how big is the negated set, and is it worth caching independently of the main query. if you're trying to exclude only a handful of items by ID, then BooleanClauses make a lot of sense based on the way the BooleanQuery Scorers work ... but if you are trying to exclude 9/10ths of the documents in your index using a complicated query expression, it starts being more worth while to generate that set once and cache it for reuse as a Filter – but there are still trade off questions to be asked about wether or not caching the set -A is as efficient as caching the set A (from a memory standpoint)
        Hide
        Paul Elschot added a comment -

        With Lucene filters as Matchers:
        http://issues.apache.org/jira/browse/LUCENE-584
        and the possibility to add a Matcher to a Lucene BooleanQuery
        as excluded for a negative filter (or as required for a positive filter),
        this could be implemented efficiently in Lucene's BooleanQuery.

        Show
        Paul Elschot added a comment - With Lucene filters as Matchers: http://issues.apache.org/jira/browse/LUCENE-584 and the possibility to add a Matcher to a Lucene BooleanQuery as excluded for a negative filter (or as required for a positive filter), this could be implemented efficiently in Lucene's BooleanQuery.
        Hide
        Yonik Seeley added a comment -

        One potential issue to watch out for...
        DocSet(": -id:foo") is not the same as not(DocSet(id:foo))
        That's normally fine, depending on how the set is used.

        Show
        Yonik Seeley added a comment - One potential issue to watch out for... DocSet(" : -id:foo") is not the same as not(DocSet(id:foo)) That's normally fine, depending on how the set is used.

          People

          • Assignee:
            Unassigned
            Reporter:
            Yonik Seeley
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development