Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      For things that are like primary keys and don't exist in some segments (worst case is primary/unique key that only exists in 1)
      we do wasted seeks.

      While LUCENE-2694 tries to solve some of this issue with TermState, I'm concerned we could every backport that to 3.1 for example.

      This is a simpler solution here just to solve this one problem in termquery... we could just revert it in trunk when we resolve LUCENE-2694,
      but I don't think we should leave things as they are in 3.x

      1. LUCENE-2829.patch
        16 kB
        Michael McCandless
      2. LUCENE-2829.patch
        14 kB
        Michael McCandless
      3. LUCENE-2829.patch
        2 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        quickly hacked up patch,

        for the IndexSearcher case, we sum up docFreq ourselves, along the way saving the hashcodes
        of the readers where the term exists into a set.

        if this list exists (IndexSearcher case), the scorer then checks the reader's hashcode against this list...
        if we get a collision, worst case we do a wasted seek. but we don't have to keep any hairy references
        to readers or anything.

        Show
        Robert Muir added a comment - quickly hacked up patch, for the IndexSearcher case, we sum up docFreq ourselves, along the way saving the hashcodes of the readers where the term exists into a set. if this list exists (IndexSearcher case), the scorer then checks the reader's hashcode against this list... if we get a collision, worst case we do a wasted seek. but we don't have to keep any hairy references to readers or anything.
        Hide
        Michael McCandless added a comment -

        I made a random PK lookup tester (committed to luceneutil), to lookup by docid (unique key) from the luceneutil index.

        Pre-patch it's 53 usec per lookup and with this patch it's 31 usec – ~42% faster!

        Show
        Michael McCandless added a comment - I made a random PK lookup tester (committed to luceneutil), to lookup by docid (unique key) from the luceneutil index. Pre-patch it's 53 usec per lookup and with this patch it's 31 usec – ~42% faster!
        Hide
        Robert Muir added a comment -

        right, we just have to not do stupid things like hash hashcodes to make it faster for when the data is hot...
        but as a start this is safe, hopefully we could do something non-invasive (and backportable) to make it faster.

        Show
        Robert Muir added a comment - right, we just have to not do stupid things like hash hashcodes to make it faster for when the data is hot... but as a start this is safe, hopefully we could do something non-invasive (and backportable) to make it faster.
        Hide
        Hoss Man added a comment -

        The patch is over my head, but providing a super optimized solution to the "primary key" type lookup problem definitely seems worthwhile – it has me wondering if a "PrimaryKeyQuery" that works like TermQuery bug quits collecting as soon as it finds one matching document would be a good idea?

        Show
        Hoss Man added a comment - The patch is over my head, but providing a super optimized solution to the "primary key" type lookup problem definitely seems worthwhile – it has me wondering if a "PrimaryKeyQuery" that works like TermQuery bug quits collecting as soon as it finds one matching document would be a good idea?
        Hide
        Robert Muir added a comment -

        Hoss Man, well I think if you surely know its a PK field you can definitely do something better, starting with a custom collector that does something like what you mentioned, with no PQ at all etc.

        But in this case, though i categorized it as PK, the general problem is this:

        • in lots of cases we do redundant seeks, like to get the docFreq, then to get the DocsEnum
        • in most cases the term dictionary cache helps here because the 2nd time (e.g. getting DocsEnum) is cached.

        Here's the problem with "PK" or "PK-ish" (low freq terms like what wildcards/fuzzies/range queries hit too):

        • our cache doesn't cache "negative" hits, the fact that a term doesnt exist in some segment.
        • For example in the PK case, if there are 15 segments we always get at most 1 cache hit and
          at least 14 misses when getting the DocsEnum, so we do at least 14 wasted seeks always.
        • For other low frequency terms that don't exist in all segments (very precise dates or what have you)
          the same idea applies, just to a lesser extent: the PK is the worst.
        Show
        Robert Muir added a comment - Hoss Man, well I think if you surely know its a PK field you can definitely do something better, starting with a custom collector that does something like what you mentioned, with no PQ at all etc. But in this case, though i categorized it as PK, the general problem is this: in lots of cases we do redundant seeks, like to get the docFreq, then to get the DocsEnum in most cases the term dictionary cache helps here because the 2nd time (e.g. getting DocsEnum) is cached. Here's the problem with "PK" or "PK-ish" (low freq terms like what wildcards/fuzzies/range queries hit too): our cache doesn't cache "negative" hits, the fact that a term doesnt exist in some segment. For example in the PK case, if there are 15 segments we always get at most 1 cache hit and at least 14 misses when getting the DocsEnum, so we do at least 14 wasted seeks always. For other low frequency terms that don't exist in all segments (very precise dates or what have you) the same idea applies, just to a lesser extent: the PK is the worst.
        Hide
        Michael McCandless added a comment -

        I think a cleaner interface may be for the Weight.scorer method to receive the ord of the sub reader in the parent?

        This way TermWeight doesn't need a hash / ReaderView – it can use just an array.

        We could also make it a struct/class, that contain parent, sub, and ord. This way TermWeight (and others) could assert they are not invoked on a different parent reader.

        Show
        Michael McCandless added a comment - I think a cleaner interface may be for the Weight.scorer method to receive the ord of the sub reader in the parent? This way TermWeight doesn't need a hash / ReaderView – it can use just an array. We could also make it a struct/class, that contain parent, sub, and ord. This way TermWeight (and others) could assert they are not invoked on a different parent reader.
        Hide
        Robert Muir added a comment -

        I think a cleaner interface may be for the Weight.scorer method to receive the ord of the sub reader in the parent?

        Yes, ideally with the actual df in there. This would save the third seek in the bulkpostings branch.

        But at the same time, i'm worried/don't want this issue to evolve into TermState (LUCENE-2694). I wasn't thinking
        that this was any kind of end-solution but just an approach we could take that would work against 3.1

        Show
        Robert Muir added a comment - I think a cleaner interface may be for the Weight.scorer method to receive the ord of the sub reader in the parent? Yes, ideally with the actual df in there. This would save the third seek in the bulkpostings branch. But at the same time, i'm worried/don't want this issue to evolve into TermState ( LUCENE-2694 ). I wasn't thinking that this was any kind of end-solution but just an approach we could take that would work against 3.1
        Hide
        Michael McCandless added a comment -

        but just an approach we could take that would work against 3.1

        Well it's already a 42% speedup so it seems very worthwhile already.

        But, then, passing a struct (parent/sub/ord) is a fairly small change, and, if it "matches" the change we will make on LUCENE-2694, then that's great.

        Show
        Michael McCandless added a comment - but just an approach we could take that would work against 3.1 Well it's already a 42% speedup so it seems very worthwhile already. But, then, passing a struct (parent/sub/ord) is a fairly small change, and, if it "matches" the change we will make on LUCENE-2694 , then that's great.
        Hide
        Robert Muir added a comment -

        But, then, passing a struct (parent/sub/ord) is a fairly small change, and, if it "matches" the change we will make on LUCENE-2694, then that's great.

        Ok, that might be a good approach, to fix the it this way in LUCENE-2694 (or actually, preferably add the parent/sub/ord in its own issue!),
        but in 3.1 we could use the struct to avoid wasted seeks on PK terms...

        Seems like backporting the entire termstate thing could be a little tricky/risky for 3.1, with not much to gain there except
        PK lookups anyway, since the multitermqueries there tend to be slow (dominated by term comparison) and don't even work
        per-segment anyway.

        Show
        Robert Muir added a comment - But, then, passing a struct (parent/sub/ord) is a fairly small change, and, if it "matches" the change we will make on LUCENE-2694 , then that's great. Ok, that might be a good approach, to fix the it this way in LUCENE-2694 (or actually, preferably add the parent/sub/ord in its own issue!), but in 3.1 we could use the struct to avoid wasted seeks on PK terms... Seems like backporting the entire termstate thing could be a little tricky/risky for 3.1, with not much to gain there except PK lookups anyway, since the multitermqueries there tend to be slow (dominated by term comparison) and don't even work per-segment anyway.
        Hide
        Yonik Seeley added a comment - - edited

        Why not keep the TermState cache and use it for all queries except MTQ, while using a different mechanism for MTQ to avoid trashing the cache?

        The cache has a number of advantages that may never be duplicated in a different type of API, including

        • actually cache frequently used terms across different requests
        • cache terms reused in the same request. term proximity boosting is an example: +united +states "united states"^10

        edit: and as robert previously pointed out, if we cached misses as well, then we could avoid needless seeks on segments that don't contain the term.

        Show
        Yonik Seeley added a comment - - edited Why not keep the TermState cache and use it for all queries except MTQ, while using a different mechanism for MTQ to avoid trashing the cache? The cache has a number of advantages that may never be duplicated in a different type of API, including actually cache frequently used terms across different requests cache terms reused in the same request. term proximity boosting is an example: +united +states "united states"^10 edit: and as robert previously pointed out, if we cached misses as well, then we could avoid needless seeks on segments that don't contain the term.
        Hide
        Robert Muir added a comment -

        edit: and as robert previously pointed out, if we cached misses as well, then we could avoid needless seeks on segments that don't contain the term.

        True, this is a good idea, just a little tricker:

        • In trunk, we have TermsEnum.seek(BytesRef text, boolean useCache), defaulting to true.
        • FilteredTermsEnum passes false here, so the multitermqueries don't populate the cache with
          garbage while enumerating (eg foo*), only explicitly at the end with cacheTerm() (per-segment)
          for the ones that were actually accepted. They sum up their docFreq themselves to prevent the
          first wasted seek in TermQuery.
        • So this solution would make MTQ worse, as it would cause them to trash the caches in the
          second wasted seek (the docsenum) where they do not today, with negative entries for the
          segments where the term doesn't exist. Today they do this wasted seek, but they don't
          trash the cache here. The only solution to prevent that is the PerReaderTermState
          (or something equally complicated).
        • We would have to look at other places where negative entries would hurt, for example
          rebuilding spellcheck indexes uses this 'termExists()' method implemented with docFreq.
          So we would have to likely change spellcheck's code to use a TermsEnum and
          seek(term, false)... using a termsenum in parallel with the spellcheck dictionary would
          obviously be more efficient for the index-based spellcheck case (forget about caching)
          versus docFreq()'ing every term... but we cannot assume the spellcheck "Dictionary"
          is actually in term order, (imagine the File-based dictionary case), so we can't
          implement this today.

        On 3.x i think its slightly less complicated as there is already a hack in the cache to
        prevent sequential termsenums from trashing it (e.g. foo*), and pretty much all the MTQs
        just enumerate sequentially anyway... (except NRQ which doesn't enum many terms
        anyway, likely not a problem).

        But we would have to at least fix the spellcheck case there too I think.

        Not saying I don't like your idea... just saying there's more work to do it.

        Show
        Robert Muir added a comment - edit: and as robert previously pointed out, if we cached misses as well, then we could avoid needless seeks on segments that don't contain the term. True, this is a good idea, just a little tricker: In trunk, we have TermsEnum.seek(BytesRef text, boolean useCache), defaulting to true. FilteredTermsEnum passes false here, so the multitermqueries don't populate the cache with garbage while enumerating (eg foo*), only explicitly at the end with cacheTerm() (per-segment) for the ones that were actually accepted. They sum up their docFreq themselves to prevent the first wasted seek in TermQuery. So this solution would make MTQ worse, as it would cause them to trash the caches in the second wasted seek (the docsenum) where they do not today, with negative entries for the segments where the term doesn't exist. Today they do this wasted seek, but they don't trash the cache here. The only solution to prevent that is the PerReaderTermState (or something equally complicated). We would have to look at other places where negative entries would hurt, for example rebuilding spellcheck indexes uses this 'termExists()' method implemented with docFreq. So we would have to likely change spellcheck's code to use a TermsEnum and seek(term, false)... using a termsenum in parallel with the spellcheck dictionary would obviously be more efficient for the index-based spellcheck case (forget about caching) versus docFreq()'ing every term... but we cannot assume the spellcheck "Dictionary" is actually in term order, (imagine the File-based dictionary case), so we can't implement this today. On 3.x i think its slightly less complicated as there is already a hack in the cache to prevent sequential termsenums from trashing it (e.g. foo*), and pretty much all the MTQs just enumerate sequentially anyway... (except NRQ which doesn't enum many terms anyway, likely not a problem). But we would have to at least fix the spellcheck case there too I think. Not saying I don't like your idea... just saying there's more work to do it.
        Hide
        Robert Muir added a comment -

        On further thought Yonik, your idea is really completely unrelated.

        We shouldn't be seeking to terms/relying upon the terms dictionary cache
        internally when we don't need to...

        whether or not its populated with negative entries for the more general case is unrelated,
        even if we go that route we shouldn't be lazy and rely upon that.

        Show
        Robert Muir added a comment - On further thought Yonik, your idea is really completely unrelated. We shouldn't be seeking to terms/relying upon the terms dictionary cache internally when we don't need to... whether or not its populated with negative entries for the more general case is unrelated, even if we go that route we shouldn't be lazy and rely upon that.
        Hide
        Michael McCandless added a comment -

        The cache has a number of advantages that may never be duplicated in a different type of API

        +1 – I agree we should keep the TermState cache. It has benefits outside of re-use win a single query.

        But allowing term-lookup-intensive clients like MTQ to do their own caching (ie pulling the TermState from the enum) is also important. I think we need both.

        On caching misses... that makes me nervous. If there are apps out there that do alot of checking for terms that don't exist that can destroy the cache.

        The cache is a great safety net but I think our core queries should be good consumers, when possible, and hold their own TermState.

        Show
        Michael McCandless added a comment - The cache has a number of advantages that may never be duplicated in a different type of API +1 – I agree we should keep the TermState cache. It has benefits outside of re-use win a single query. But allowing term-lookup-intensive clients like MTQ to do their own caching (ie pulling the TermState from the enum) is also important. I think we need both. On caching misses... that makes me nervous. If there are apps out there that do alot of checking for terms that don't exist that can destroy the cache. The cache is a great safety net but I think our core queries should be good consumers, when possible, and hold their own TermState.
        Hide
        Earwin Burrfoot added a comment -

        Term lookup misses can be alleviated by a simple Bloom Filter.
        No caching misses required, helps both PK and near-PK queries.

        Show
        Earwin Burrfoot added a comment - Term lookup misses can be alleviated by a simple Bloom Filter. No caching misses required, helps both PK and near-PK queries.
        Hide
        Robert Muir added a comment -

        Bloom filters and negative caches are nice, but please open separate issues!
        I am starting to feel like its mandatory to refactor the entirety of lucene to make a single incremental improvement.

        So, I'd like to proceed with this issue as-is, to make TermWeight explicitly do less seeks.

        Show
        Robert Muir added a comment - Bloom filters and negative caches are nice, but please open separate issues! I am starting to feel like its mandatory to refactor the entirety of lucene to make a single incremental improvement. So, I'd like to proceed with this issue as-is, to make TermWeight explicitly do less seeks.
        Hide
        Earwin Burrfoot added a comment -

        Nobody halts your progress, we're merely discussing.

        I, on the other hand, have a feeling that Lucene is overflowing with "single incremental improvements" aka "hacks", as they are easier and faster to implement than trying to get a bigger picture, and, yes, rebuilding everything
        For example, better term dict code will make this issue (somewhat hackish, admit it?) irrelevant. Whether we implement bloom filters, or just guarantee to keep the whole term dict in memory with reasonable lookup routine (eg. as FST).

        Having said that, I reiterate, I'm not here to stop you or turn this issue into something else.

        Show
        Earwin Burrfoot added a comment - Nobody halts your progress, we're merely discussing. I, on the other hand, have a feeling that Lucene is overflowing with "single incremental improvements" aka "hacks", as they are easier and faster to implement than trying to get a bigger picture, and, yes, rebuilding everything For example, better term dict code will make this issue (somewhat hackish, admit it?) irrelevant. Whether we implement bloom filters, or just guarantee to keep the whole term dict in memory with reasonable lookup routine (eg. as FST). Having said that, I reiterate, I'm not here to stop you or turn this issue into something else.
        Hide
        Robert Muir added a comment -

        For example, better term dict code will make this issue (somewhat hackish, admit it?) irrelevant.

        Right, it is hackish, but what is a worse hack is wasted seeks in our next 3.1 release because we can't
        keep scope under control and fix small problems without rewriting everything, which means less
        gets backported to our stable branch.

        Anyway, I'm just gonna mark this won't fix so I don't have to deal with it anymore.

        Show
        Robert Muir added a comment - For example, better term dict code will make this issue (somewhat hackish, admit it?) irrelevant. Right, it is hackish, but what is a worse hack is wasted seeks in our next 3.1 release because we can't keep scope under control and fix small problems without rewriting everything, which means less gets backported to our stable branch. Anyway, I'm just gonna mark this won't fix so I don't have to deal with it anymore.
        Hide
        Michael McCandless added a comment -

        I think we should commit this, and if/when LUCENE-2694 and/or LUCENE-2831 are committed to 3.x, we can revisit it.

        Show
        Michael McCandless added a comment - I think we should commit this, and if/when LUCENE-2694 and/or LUCENE-2831 are committed to 3.x, we can revisit it.
        Hide
        Michael McCandless added a comment -

        I ported this patch to 3.x, and fix a few tests that were illegally passing top readers to Weight.scorer. There's still a couple nocommits, but I think it's close.

        Show
        Michael McCandless added a comment - I ported this patch to 3.x, and fix a few tests that were illegally passing top readers to Weight.scorer. There's still a couple nocommits, but I think it's close.
        Hide
        Michael McCandless added a comment -

        New patch. I added VirtualMethods to Sim to make sure Sim subclasses that don't override idfExplain that takes docFreq are still called.

        Show
        Michael McCandless added a comment - New patch. I added VirtualMethods to Sim to make sure Sim subclasses that don't override idfExplain that takes docFreq are still called.
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development