Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-3763

Make solr use lucene filters directly

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 4.0, 4.1, 6.0
    • 6.0
    • None
    • None

    Description

      Presently solr uses bitsets, queries and collectors to implement the concept of filters. This has proven to be very powerful, but does come at the cost of introducing a large body of code into solr making it harder to optimise and maintain.

      Another issue here is that filters currently cache sub-optimally given the changes in lucene towards atomic readers.

      Rather than patch these issues, this is an attempt to rework the filters in solr to leverage the Filter subsystem from lucene as much as possible.

      In good time the aim is to get this to do the following:

      ∘ Handle setting up filter implementations that are able to correctly cache with reference to the AtomicReader that they are caching for rather that for the entire index at large

      ∘ Get the post filters working, I am thinking that this can be done via lucenes chained filter, with the ‟expensive” filters being put towards the end of the chain - this has different semantics internally to the original implementation but IMHO should have the same result for end users

      ∘ Learn how to create filters that are potentially more efficient, at present solr basically runs a simple query that gathers a DocSet that relates to the documents that we want filtered; it would be interesting to make use of filter implementations that are in theory faster than query filters (for instance there are filters that are able to query the FieldCache)

      ∘ Learn how to decompose filters so that a complex filter query can be cached (potentially) as its constituent parts; for example the filter below currently needs love, care and feeding to ensure that the filter cache is not unduly stressed

        'category:(100) OR category:(200) OR category:(300)'
      

      Really there is no reason not to express this in a cached form as

      BooleanFilter(
          FilterClause(CachedFilter(TermFilter(Term("category", 100))), SHOULD),
          FilterClause(CachedFilter(TermFilter(Term("category", 200))), SHOULD),
          FilterClause(CachedFilter(TermFilter(Term("category", 300))), SHOULD)
        )
      

      This would yield better cache usage I think as we can reuse docsets across multiple queries, as well as avoid issues when filters are presented in differing orders

      ∘ Instead of end users providing costing we might (and this is a big might FWIW), be able to create a sort of execution plan of filters, leveraging a combination of what the index is able to tell us as well as sampling and ‟educated guesswork”; in essence this is what some DBMS software, for example postgresql does - it has a genetic algo that attempts to solve the travelling salesman - to great effect

      ∘ I am sure I will probably come up with other ambitious ideas to plug in here ..... :S

      Patches obviously forthcoming but the bulk of the work can be followed here https://github.com/GregBowyer/lucene-solr/commits/solr-uses-lucene-filters

      Attachments

        Activity

          People

            gbowyer@fastmail.co.uk Greg Bowyer
            gbowyer@fastmail.co.uk Greg Bowyer
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated: