Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Lucene already has filter caching abilities through CachingWrapperFilter, but CachingWrapperFilter requires you to know which filters you want to cache up-front.

      Caching filters is not trivial. If you cache too aggressively, then you slow things down since you need to iterate over all documents that match the filter in order to load it into an in-memory cacheable DocIdSet. On the other hand, if you don't cache at all, you are potentially missing interesting speed-ups on frequently-used filters.

      Something that would be nice would be to have a generic filter cache that would track usage for individual filters and make the decision to cache or not a filter on a given segments based on usage statistics and various heuristics, such as:

      • the overhead to cache the filter (for instance some filters produce DocIdSets that are already cacheable)
      • the cost to build the DocIdSet (the getDocIdSet method is very expensive on some filters such as MultiTermQueryWrapperFilter that potentially need to merge lots of postings lists)
      • the segment we are searching on (flush segments will likely be merged right away so it's probably not worth building a cache on such segments)
      1. LUCENE-6077.patch
        54 kB
        Adrien Grand
      2. LUCENE-6077.patch
        52 kB
        Adrien Grand
      3. LUCENE-6077.patch
        41 kB
        Adrien Grand

        Activity

        Hide
        Adrien Grand added a comment -

        Here is a patch. It divides the work into 2 pieces:

        • FilterCache whose responsibility is to act as a per-segment cache for filters but doesn't make any decision about which filters should be cached
        • FilterCachingPolicy, whose responsibility is to decide about whether a filter is worth caching given the filter itself, the current segment and the produced (uncached) DocIdSet.

        FilterCache has an implementation called LRUFilterCache that accepts a maximum size (number of cached filters) and ram usage and is going to evict least-recently-used filters first. It has some protected methods that allow to configure which impl should be used to cache DocIdSets (RoaringDocIdSet by default), and how to measure ram usage of filters (the default impl uses Accountable#ramBytesUsed if the filter implements Accountable, and falls back to an arbitrary constant (1024) otherwise).

        FilterCachingPolicy has an implementation called UsageTrackingFilterCachingPolicy that tries to provide sensible defaults:

        • it tracks the 256 most recently used filters (through their hash codes) globally (not per segment)
        • it only caches on segments whose source is a merge or addIndexes (not flushes)
        • it uses some heuristics to decide how many times a filter should appear in the history of 256 filters in order to be cached.

        The filter caching policy can be configured on a per-filter basis, so that even if there are filters that you want to cache more aggressively than others, it is possible to cache them all in a single FilterCache instance.

        Show
        Adrien Grand added a comment - Here is a patch. It divides the work into 2 pieces: FilterCache whose responsibility is to act as a per-segment cache for filters but doesn't make any decision about which filters should be cached FilterCachingPolicy, whose responsibility is to decide about whether a filter is worth caching given the filter itself, the current segment and the produced (uncached) DocIdSet. FilterCache has an implementation called LRUFilterCache that accepts a maximum size (number of cached filters) and ram usage and is going to evict least-recently-used filters first. It has some protected methods that allow to configure which impl should be used to cache DocIdSets (RoaringDocIdSet by default), and how to measure ram usage of filters (the default impl uses Accountable#ramBytesUsed if the filter implements Accountable, and falls back to an arbitrary constant (1024) otherwise). FilterCachingPolicy has an implementation called UsageTrackingFilterCachingPolicy that tries to provide sensible defaults: it tracks the 256 most recently used filters (through their hash codes) globally (not per segment) it only caches on segments whose source is a merge or addIndexes (not flushes) it uses some heuristics to decide how many times a filter should appear in the history of 256 filters in order to be cached. The filter caching policy can be configured on a per-filter basis, so that even if there are filters that you want to cache more aggressively than others, it is possible to cache them all in a single FilterCache instance.
        Hide
        Robert Muir added a comment -

        This looks great!

        Do we really need to default CachingWrapperFilter to a "stupid" policy?
        Is there a better name for FilterCache.cache() method? it can be a noun or a verb, so its kind of confusing. Maybe doCache would be better?
        CachingWrapperFilter's new ctor: can we fix the typo?
        FilterCachingPolicy.onCache, can we correct the param name?

        Show
        Robert Muir added a comment - This looks great! Do we really need to default CachingWrapperFilter to a "stupid" policy? Is there a better name for FilterCache.cache() method? it can be a noun or a verb, so its kind of confusing. Maybe doCache would be better? CachingWrapperFilter's new ctor: can we fix the typo? FilterCachingPolicy.onCache, can we correct the param name?
        Hide
        Adrien Grand added a comment -

        Updated patch:

        • CachingWrapperFilter now uses a policy that only caches on merged segments by default (instead of all segments)
        • applied other suggestions about typos/naming
        Show
        Adrien Grand added a comment - Updated patch: CachingWrapperFilter now uses a policy that only caches on merged segments by default (instead of all segments) applied other suggestions about typos/naming
        Hide
        Adrien Grand added a comment -

        Updated patch:

        • fixed LRUFilterCache.getChildResources to not make a copy of the cache (since Accountables.namedAccountables already takes care of taking a snapshot)
        • replaced the heuristics based on the segment source in the diagnostics by a heuristic on the segment size compared to the size of the top-level context. This should give better results since merged segments are not necessarily large and flushed segments can be large if you have large IW buffers.

        I think it's ready?

        Show
        Adrien Grand added a comment - Updated patch: fixed LRUFilterCache.getChildResources to not make a copy of the cache (since Accountables.namedAccountables already takes care of taking a snapshot) replaced the heuristics based on the segment source in the diagnostics by a heuristic on the segment size compared to the size of the top-level context. This should give better results since merged segments are not necessarily large and flushed segments can be large if you have large IW buffers. I think it's ready?
        Hide
        Robert Muir added a comment -

        +1

        Show
        Robert Muir added a comment - +1
        Hide
        ASF subversion and git services added a comment -

        Commit 1642183 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1642183 ]

        LUCENE-6077: Add a filter cache.

        Show
        ASF subversion and git services added a comment - Commit 1642183 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1642183 ] LUCENE-6077 : Add a filter cache.
        Hide
        ASF subversion and git services added a comment -

        Commit 1642185 from Adrien Grand in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1642185 ]

        LUCENE-6077: Add a filter cache.

        Show
        ASF subversion and git services added a comment - Commit 1642185 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1642185 ] LUCENE-6077 : Add a filter cache.
        Hide
        Anshum Gupta added a comment -

        Bulk close after 5.0 release.

        Show
        Anshum Gupta added a comment - Bulk close after 5.0 release.

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development