Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.0, 6.5
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently we just require term filters to be used a lot in order to cache them. Maybe instead we should look into never caching them. This should not hurt performance since term filters are plenty fast, and would make other filters more likely to be cached since we would not "pollute" the history with filters that are not worth caching.

      1. LUCENE-7680.patch
        6 kB
        Adrien Grand
      2. LUCENE-7680.patch
        3 kB
        Adrien Grand

        Activity

        Hide
        jpountz Adrien Grand added a comment -

        Here is a patch.

        Show
        jpountz Adrien Grand added a comment - Here is a patch.
        Hide
        jpountz Adrien Grand added a comment -

        For the record, this would still allow compound filters that wrap term filters to be cached (eg. a BooleanQuery or a ToParentBlockJoinQuery).

        Show
        jpountz Adrien Grand added a comment - For the record, this would still allow compound filters that wrap term filters to be cached (eg. a BooleanQuery or a ToParentBlockJoinQuery).
        Hide
        mikemccand Michael McCandless added a comment -

        +1

        Show
        mikemccand Michael McCandless added a comment - +1
        Hide
        dsmiley David Smiley added a comment -

        The hard coded rules here (and others) concern me. If I override shouldCache I should hopefully be able to have my own different rules. I see that this method invokes a frequency(...) method and minFrequencyToCache(...) pair of methods; the former is package level. It should be protected? Granted I could implement minFrequencyToCache and return Integer.MAX_VALUE.

        Curious; did you consider marking TermFilter as "cheap"?

        Show
        dsmiley David Smiley added a comment - The hard coded rules here (and others) concern me. If I override shouldCache I should hopefully be able to have my own different rules. I see that this method invokes a frequency(...) method and minFrequencyToCache(...) pair of methods; the former is package level. It should be protected? Granted I could implement minFrequencyToCache and return Integer.MAX_VALUE . Curious; did you consider marking TermFilter as "cheap"?
        Hide
        jpountz Adrien Grand added a comment -

        I see this class as a default set of heuristics that should work well for most use-cases. If someone wants something more specific, I think the way to go should be to write a new impl, the API should be pretty simple to implement? As it stands, the class is indeed not designed for inheritance: in addition to those pkg-private methods, it is final.

        Granted I could implement minFrequencyToCache and return Integer.MAX_VALUE.

        Requiring that a filter has been seen Integer.MAX_VALUE times would indeed make it never cached. However this change goes a bit further in the case of term filters: it also does not add them to the history, which makes other filters more likely of being cached than they are today. To take an extreme example, say you have a query with 100 term filters and 1 other filter (which is not a term). Even if that other filter was the same in every query, it would never get cached because term queries "pollute" the history (we only keep track of the last 256 used filters) and that other filter would only occur at most twice in the history. By not putting term filters in the history of recently used filters, then Lucene would be more likely to notice that that other filter gets reused all the time.

        Curious; did you consider marking TermFilter as "cheap"?

        What do you mean? Maybe it is the cause of the confusion, but when I say term filter, I mean a TermQuery that is consumed with needsScores=false.

        Show
        jpountz Adrien Grand added a comment - I see this class as a default set of heuristics that should work well for most use-cases. If someone wants something more specific, I think the way to go should be to write a new impl, the API should be pretty simple to implement? As it stands, the class is indeed not designed for inheritance: in addition to those pkg-private methods, it is final. Granted I could implement minFrequencyToCache and return Integer.MAX_VALUE. Requiring that a filter has been seen Integer.MAX_VALUE times would indeed make it never cached. However this change goes a bit further in the case of term filters: it also does not add them to the history, which makes other filters more likely of being cached than they are today. To take an extreme example, say you have a query with 100 term filters and 1 other filter (which is not a term). Even if that other filter was the same in every query, it would never get cached because term queries "pollute" the history (we only keep track of the last 256 used filters) and that other filter would only occur at most twice in the history. By not putting term filters in the history of recently used filters, then Lucene would be more likely to notice that that other filter gets reused all the time. Curious; did you consider marking TermFilter as "cheap"? What do you mean? Maybe it is the cause of the confusion, but when I say term filter, I mean a TermQuery that is consumed with needsScores=false.
        Hide
        dsmiley David Smiley added a comment -

        Thanks; for a moment there I confused this with the old TermsFilter now called TermInSetQuery. And I am aware of that using a high integer frequency to avoid caching will nonetheless pollute the cache, which is bad. I guess it'd be easy to write an implementation that delegates to a UsageTrackingQueryCachingPolicy. What do you think about making UsageTrackingQueryCachingPolicy non-final so that it could be easily overridden and minFrequencyToCache might be customized by users?

        Show
        dsmiley David Smiley added a comment - Thanks; for a moment there I confused this with the old TermsFilter now called TermInSetQuery . And I am aware of that using a high integer frequency to avoid caching will nonetheless pollute the cache, which is bad. I guess it'd be easy to write an implementation that delegates to a UsageTrackingQueryCachingPolicy . What do you think about making UsageTrackingQueryCachingPolicy non-final so that it could be easily overridden and minFrequencyToCache might be customized by users?
        Hide
        jpountz Adrien Grand added a comment -

        Here is an updated patch. David, does it work better for you?

        Show
        jpountz Adrien Grand added a comment - Here is an updated patch. David, does it work better for you?
        Hide
        dsmiley David Smiley added a comment -

        +1 Nice documentation too.

        Show
        dsmiley David Smiley added a comment - +1 Nice documentation too.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 063954ce79f1e4babd22cc177b619c48136766d6 in lucene-solr's branch refs/heads/master from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=063954c ]

        LUCENE-7680: Never cache term filters.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 063954ce79f1e4babd22cc177b619c48136766d6 in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=063954c ] LUCENE-7680 : Never cache term filters.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 5682c33b5c3444905e2638d735f935d198c4cea6 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5682c33 ]

        LUCENE-7680: Never cache term filters.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 5682c33b5c3444905e2638d735f935d198c4cea6 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5682c33 ] LUCENE-7680 : Never cache term filters.

          People

          • Assignee:
            Unassigned
            Reporter:
            jpountz Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development