Lucene - Core
  1. Lucene - Core
  2. LUCENE-1911

When using QueryWrapperFilter with CachingWrapperFilter, QueryWrapperFilter returns a DocIdSet that creates a Scorer, which gets cached rather than a bit set

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      there is a large performance cost to this.

      The old impl for this type of thing, QueryFilter, recommends :

      @deprecated use a CachingWrapperFilter with QueryWrapperFilter

      The deprecated QueryFilter itself also suffers from the problem because its now implemented using a CachingWrapperFilter and QueryWrapperFilter.

      see http://search.lucidimagination.com/search/document/7f54715f14b8b7a/lucene_2_9_0rc4_slower_than_2_4_1

      1. LUCENE-1911.patch
        3 kB
        Uwe Schindler
      2. LUCENE-1911.patch
        10 kB
        Uwe Schindler
      3. LUCENE-1911.patch
        12 kB
        Uwe Schindler
      4. LUCENE-1911.patch
        18 kB
        Uwe Schindler
      5. LUCENE-1911.patch
        18 kB
        Uwe Schindler
      6. lucene_29_newapi_patched.png
        151 kB
        Thomas Becker

        Activity

        Hide
        Mark Miller added a comment -

        sounds like Uwe is working on a patch so that you can tell the CachingWrapperFilter to cache the results as a bitset (optionally), which will allow you to duplicated the deprecated QueryFilter.

        Show
        Mark Miller added a comment - sounds like Uwe is working on a patch so that you can tell the CachingWrapperFilter to cache the results as a bitset (optionally), which will allow you to duplicated the deprecated QueryFilter.
        Hide
        Mark Miller added a comment -

        whether we do an RC or not after this I don't know.

        Technically, if we don't want to - its not required. All thats required is that we get the 3 votes and no -1 votes with valid tech arguments.

        Show
        Mark Miller added a comment - whether we do an RC or not after this I don't know. Technically, if we don't want to - its not required. All thats required is that we get the 3 votes and no -1 votes with valid tech arguments.
        Hide
        Uwe Schindler added a comment -

        Here a patch, that implements my proposal.

        Thomas Becker: Could you apply this patch and test if itr resolves your speed problem?

        Show
        Uwe Schindler added a comment - Here a patch, that implements my proposal. Thomas Becker: Could you apply this patch and test if itr resolves your speed problem?
        Hide
        Mark Miller added a comment -

        Very nice Uwe! Patch looks very good. Nicely takes care of the QueryFilter issue as well.

        Show
        Mark Miller added a comment - Very nice Uwe! Patch looks very good. Nicely takes care of the QueryFilter issue as well.
        Hide
        Uwe Schindler added a comment -

        By the way, the added IOException to the protected method is no BW break, as the method is new in 2.9.

        Show
        Uwe Schindler added a comment - By the way, the added IOException to the protected method is no BW break, as the method is new in 2.9.
        Hide
        Mark Miller added a comment -

        the protected method

        Its funny, because literally this morning, I was looking at CachingWrapperFilter for other reasons and I saw that method and briefly wondered what it was for - it just return what was passed to it by default. I moved on without thinking too much of it - but it makes perfect sense now

        Show
        Mark Miller added a comment - the protected method Its funny, because literally this morning, I was looking at CachingWrapperFilter for other reasons and I saw that method and briefly wondered what it was for - it just return what was passed to it by default. I moved on without thinking too much of it - but it makes perfect sense now
        Hide
        Uwe Schindler added a comment -

        I will post a new patch later with some added tests. I will also add SortedVIntList to the cacheable implementations.

        Maybe in 3.1 we could add a DocIdSet.isCacheable() method (like the proposed isRandomAccess()), that defaults to false. The 4 utils impls of DocIdSet could then return true here. This would be cleaner than the instanceof check.

        Or do somebody want isCacheable to DocIdSet added for 2.9, too?

        Show
        Uwe Schindler added a comment - I will post a new patch later with some added tests. I will also add SortedVIntList to the cacheable implementations. Maybe in 3.1 we could add a DocIdSet.isCacheable() method (like the proposed isRandomAccess()), that defaults to false. The 4 utils impls of DocIdSet could then return true here. This would be cleaner than the instanceof check. Or do somebody want isCacheable to DocIdSet added for 2.9, too?
        Hide
        Mark Miller added a comment -

        Could we commit now Uwe? Its easy to tweak if we have to. Its likely to work and stick like this though - so if we commit, I can make the release artifcats and run tests and all that jazz - then if we are good, it will be a very fast turnaround. Otherwise I have to start all that much later in either case. I'd prefer to be optimistic and get everything ready by tonight.

        I'll leave it up to you though.

        Show
        Mark Miller added a comment - Could we commit now Uwe? Its easy to tweak if we have to. Its likely to work and stick like this though - so if we commit, I can make the release artifcats and run tests and all that jazz - then if we are good, it will be a very fast turnaround. Otherwise I have to start all that much later in either case. I'd prefer to be optimistic and get everything ready by tonight. I'll leave it up to you though.
        Hide
        Uwe Schindler added a comment -

        I understand you, but:

        After reviewing other DocIdSets, I found more of them, that could be easily cached, like e.g. one of the DocIdSets returned by FieldCacheRangeFilter (the only-FieldCache-non-TermDocs-backed one). You cannot check all of them with instanceof and may some of them twice, which is a bad idea. A simple check with DocIdSet.isCacheable() looks like a good solution. If we default it to false, we have no backwards-problem.

        If we do not want to do this, I will add at least SortedVIntList to the cacheable implementations.

        Show
        Uwe Schindler added a comment - I understand you, but: After reviewing other DocIdSets, I found more of them, that could be easily cached, like e.g. one of the DocIdSets returned by FieldCacheRangeFilter (the only-FieldCache-non-TermDocs-backed one). You cannot check all of them with instanceof and may some of them twice, which is a bad idea. A simple check with DocIdSet.isCacheable() looks like a good solution. If we default it to false, we have no backwards-problem. If we do not want to do this, I will add at least SortedVIntList to the cacheable implementations.
        Hide
        Mark Miller added a comment -

        Okay - no problem - if we already know theres more to do, no point yet. Lets not do it before it makes sense.

        Show
        Mark Miller added a comment - Okay - no problem - if we already know theres more to do, no point yet. Lets not do it before it makes sense.
        Hide
        Mark Miller added a comment -

        I guess I would lean toward isCacheable myself. Its much cleaner than trying to juggle a list and overrides.

        I guess the downside is that someone could be using a custom one that was cacheable, and so it would be copied. But they can
        update to the new API as a workaround.
        And you get that with the other solution in certain cases too (unless the user updates to use the new API and does the override).
        That also requires tracking which CachingWrapperFilters get passed what Filters and its just much uglier right?

        +1 on isCacheable.

        You could argue to keep the override too - if someone was using a jar that says its not cachaeable, but it is, and can't change the source. A
        Likely scenario I'm sure But worth the method it would seem. It still becomes much less necessary though.

        Show
        Mark Miller added a comment - I guess I would lean toward isCacheable myself. Its much cleaner than trying to juggle a list and overrides. I guess the downside is that someone could be using a custom one that was cacheable, and so it would be copied. But they can update to the new API as a workaround. And you get that with the other solution in certain cases too (unless the user updates to use the new API and does the override). That also requires tracking which CachingWrapperFilters get passed what Filters and its just much uglier right? +1 on isCacheable. You could argue to keep the override too - if someone was using a jar that says its not cachaeable, but it is, and can't change the source. A Likely scenario I'm sure But worth the method it would seem. It still becomes much less necessary though.
        Hide
        Michael McCandless added a comment -

        we could add a DocIdSet.isCacheable() method

        +1 to this, and to keeping docIdSetToCache() for overriding. This is much cleaner than trying to check for every known DocIdSet impl.

        Thanks for fixing this Uwe!

        Show
        Michael McCandless added a comment - we could add a DocIdSet.isCacheable() method +1 to this, and to keeping docIdSetToCache() for overriding. This is much cleaner than trying to check for every known DocIdSet impl. Thanks for fixing this Uwe!
        Hide
        Mark Miller added a comment -

        Now I regret not pushing for the isCachable route to fix the Highlighter caching issue

        Show
        Mark Miller added a comment - Now I regret not pushing for the isCachable route to fix the Highlighter caching issue
        Hide
        Uwe Schindler added a comment -

        What do you mean?

        Show
        Uwe Schindler added a comment - What do you mean?
        Hide
        Mark Miller added a comment -

        Remember the CachingTokenFilter issue that first popped with the Highlighter? And you suggested we add the isCachable? I was hoping the RC process was going to be smoother at the time, so I just pushed to have the simple override for now. But looking back with what I know now, I wish we had gone for it.

        Show
        Mark Miller added a comment - Remember the CachingTokenFilter issue that first popped with the Highlighter? And you suggested we add the isCachable? I was hoping the RC process was going to be smoother at the time, so I just pushed to have the simple override for now. But looking back with what I know now, I wish we had gone for it.
        Hide
        Uwe Schindler added a comment -

        Here the implementation with isCacheable(). It also has an extra test, checking cacheable on various filter with/without CachingWrapperFilter. The switch was removed from CachingWrapperFilter, as the DocIdSet now says, if it should be cached or not.

        I scanned through all DocIdSets and checked, if cacheable (in general: Do they do disk I/O during iteration or are else very slow). I looked in core and contrib looking for "extends DocIdSet" and "new DocIdSet()" anon inner classes.

        If somebody knows another DocIdSet, that is cacheable, please tell!

        All tests pass.

        Show
        Uwe Schindler added a comment - Here the implementation with isCacheable(). It also has an extra test, checking cacheable on various filter with/without CachingWrapperFilter. The switch was removed from CachingWrapperFilter, as the DocIdSet now says, if it should be cached or not. I scanned through all DocIdSets and checked, if cacheable (in general: Do they do disk I/O during iteration or are else very slow). I looked in core and contrib looking for "extends DocIdSet" and "new DocIdSet()" anon inner classes. If somebody knows another DocIdSet, that is cacheable, please tell! All tests pass.
        Hide
        Uwe Schindler added a comment -

        Here an updated patch, with a test that QueryWrapperFilter's DocIdSet is cached correctly and the copy to BitSet code works as exspected (TestCachingWrapperFilter does not check this).

        I think this is ready to commit now.

        Show
        Uwe Schindler added a comment - Here an updated patch, with a test that QueryWrapperFilter's DocIdSet is cached correctly and the copy to BitSet code works as exspected (TestCachingWrapperFilter does not check this). I think this is ready to commit now.
        Hide
        Uwe Schindler added a comment -

        Mark/Yonik: Do you know if Solr has possible DocIdSet impls that are cacheable and are used for filtering. It would be good to also overide isCacheable() to return true.

        Show
        Uwe Schindler added a comment - Mark/Yonik: Do you know if Solr has possible DocIdSet impls that are cacheable and are used for filtering. It would be good to also overide isCacheable() to return true.
        Hide
        Yonik Seeley added a comment -

        Mark/Yonik: Do you know if Solr has possible DocIdSet impls that are cacheable and are used for filtering.

        Doesn't really matter too much - Solr doesn't use CachingWrapperFilter at all. But I'll handle any instances on upgrade of the Lucene libs.

        Show
        Yonik Seeley added a comment - Mark/Yonik: Do you know if Solr has possible DocIdSet impls that are cacheable and are used for filtering. Doesn't really matter too much - Solr doesn't use CachingWrapperFilter at all. But I'll handle any instances on upgrade of the Lucene libs.
        Hide
        Mark Miller added a comment -

        There are a couple - including the deprecated Solr version of OpenBitSet itself (only used in tests in Solr right now) - then a couple that use an underlying OpenBitSet -

        but as Yonik says - it doesn't appear it would end up being an issue - CachingWrapperFilter isn't likely ever going to be used with them - nice to have them correct anyhow though.

        Show
        Mark Miller added a comment - There are a couple - including the deprecated Solr version of OpenBitSet itself (only used in tests in Solr right now) - then a couple that use an underlying OpenBitSet - but as Yonik says - it doesn't appear it would end up being an issue - CachingWrapperFilter isn't likely ever going to be used with them - nice to have them correct anyhow though.
        Hide
        Uwe Schindler added a comment - - edited
        • Added CHANGES.txt entry.
        • Added a test to FieldCacheRangeFilter, that the iterators work correct and isCacheable() is false, when the index contains deleted docs and the range contains 0 (in this case TermDocs must be visited to test if a doc is deleted or has value 0 in cache).

        I will go to bed now and commit tomorrow!

        Show
        Uwe Schindler added a comment - - edited Added CHANGES.txt entry. Added a test to FieldCacheRangeFilter, that the iterators work correct and isCacheable() is false, when the index contains deleted docs and the range contains 0 (in this case TermDocs must be visited to test if a doc is deleted or has value 0 in cache). I will go to bed now and commit tomorrow!
        Hide
        Mark Miller added a comment -

        Patch looks good to me - just looked for any you might have missed and didn't see a one.

        Show
        Mark Miller added a comment - Patch looks good to me - just looked for any you might have missed and didn't see a one.
        Hide
        Uwe Schindler added a comment -

        Doesn't really matter too much - Solr doesn't use CachingWrapperFilter at all. But I'll handle any instances on upgrade of the Lucene libs

        Thanks!

        then a couple that use an underlying OpenBitSet

        These are easy, because OpenBitSet has isCacheable()==true -> nothing to do

        but as Yonik says - it doesn't appear it would end up being an issue - CachingWrapperFilter isn't likely ever going to be used with them - nice to have them correct anyhow though.

        Correct. And nothing breaks if you have a cacheable DocIdSet not marked as such. It will simply be copied to an OpenBitSet (with some perf and memory cost on the first call to CachingWrapperFilter.getDocIdSet()).

        Show
        Uwe Schindler added a comment - Doesn't really matter too much - Solr doesn't use CachingWrapperFilter at all. But I'll handle any instances on upgrade of the Lucene libs Thanks! then a couple that use an underlying OpenBitSet These are easy, because OpenBitSet has isCacheable()==true -> nothing to do but as Yonik says - it doesn't appear it would end up being an issue - CachingWrapperFilter isn't likely ever going to be used with them - nice to have them correct anyhow though. Correct. And nothing breaks if you have a cacheable DocIdSet not marked as such. It will simply be copied to an OpenBitSet (with some perf and memory cost on the first call to CachingWrapperFilter.getDocIdSet()).
        Hide
        Mark Miller added a comment -

        These are easy, because OpenBitSet has isCacheable()==true -> nothing to do

        They (or was it just one?) does it with delegation though - so it will reg as just a DocIdSet (if it mattered).

        I've lost the code right now, but I think it was still cacheable.

        Show
        Mark Miller added a comment - These are easy, because OpenBitSet has isCacheable()==true -> nothing to do They (or was it just one?) does it with delegation though - so it will reg as just a DocIdSet (if it mattered). I've lost the code right now, but I think it was still cacheable.
        Hide
        Uwe Schindler added a comment -

        These are easy, because OpenBitSet has isCacheable()==true -> nothing to do

        They (or was it just one?) does it with delegation though - so it will reg as just a DocIdSet (if it mattered).

        OK, I misunderstood. In principle the same like FilteredDocIdSet (see my patch).

        Show
        Uwe Schindler added a comment - These are easy, because OpenBitSet has isCacheable()==true -> nothing to do They (or was it just one?) does it with delegation though - so it will reg as just a DocIdSet (if it mattered). OK, I misunderstood. In principle the same like FilteredDocIdSet (see my patch).
        Hide
        Thomas Becker added a comment -

        Thanks for the patch. Issue can be closed I guess.

        Attached you'll find a hotspot list for the same request with a patched 2.9-RC4. With the old api it's as fast as 2.4. Maybe a tad faster. With the new api it's a bit slower. Most probably due to not caching sorted resultsets anymore in our implementation.

        Great work guys. Nice patch @Uwe. And big thanks for your support.

        Was a pleasure. Now it's time to find a way to contribute something back to the community.

        Show
        Thomas Becker added a comment - Thanks for the patch. Issue can be closed I guess. Attached you'll find a hotspot list for the same request with a patched 2.9-RC4. With the old api it's as fast as 2.4. Maybe a tad faster. With the new api it's a bit slower. Most probably due to not caching sorted resultsets anymore in our implementation. Great work guys. Nice patch @Uwe. And big thanks for your support. Was a pleasure. Now it's time to find a way to contribute something back to the community.
        Hide
        Thomas Becker added a comment -

        HotSpots - lucene 2.9-RC4 patched

        Show
        Thomas Becker added a comment - HotSpots - lucene 2.9-RC4 patched
        Hide
        Uwe Schindler added a comment -

        Thanks Thomas!

        Attached is a improved patch, that uses OpenBitSetDISI, a subclass of OpenBitSet, which does the copying of the iterator to the bit set by itsself. Why duplicate the code?

        It also handles the null return value of iterator(), which is allowed. It caches an EMPTY_DOCIDSET in this case.

        I will commit soon!

        Show
        Uwe Schindler added a comment - Thanks Thomas! Attached is a improved patch, that uses OpenBitSetDISI, a subclass of OpenBitSet, which does the copying of the iterator to the bit set by itsself. Why duplicate the code? It also handles the null return value of iterator(), which is allowed. It caches an EMPTY_DOCIDSET in this case. I will commit soon!
        Hide
        Uwe Schindler added a comment -

        Committed revision: 816154. Thanks Thomas for reporting this!

        Show
        Uwe Schindler added a comment - Committed revision: 816154. Thanks Thomas for reporting this!

          People

          • Assignee:
            Uwe Schindler
            Reporter:
            Mark Miller
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development