Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: modules/join
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Solr has (psuedo) join query for a while now. I think this should also be available in Lucene.

      1. LUCENE-3602.patch
        30 kB
        Martijn van Groningen
      2. LUCENE-3602.patch
        30 kB
        Martijn van Groningen
      3. LUCENE-3602.patch
        30 kB
        Martijn van Groningen
      4. LUCENE-3602.patch
        30 kB
        Martijn van Groningen
      5. LUCENE-3602.patch
        30 kB
        Martijn van Groningen
      6. LUCENE-3602.patch
        30 kB
        Martijn van Groningen
      7. LUCENE-3602.patch
        20 kB
        Martijn van Groningen
      8. LUCENE-3602.patch
        4 kB
        Martijn van Groningen
      9. LUCENE-3602.patch
        25 kB
        Martijn van Groningen
      10. LUCENE-3602.patch
        17 kB
        Martijn van Groningen
      11. LUCENE-3602-3x.patch
        28 kB
        Martijn van Groningen
      12. LUCENE-3602-3x.patch
        24 kB
        Martijn van Groningen

        Activity

        Hide
        Martijn van Groningen added a comment -

        Attached initial version of the JoinQuery which based on the one in Solr.

        Show
        Martijn van Groningen added a comment - Attached initial version of the JoinQuery which based on the one in Solr.
        Hide
        Michael McCandless added a comment -

        Awesome to finally bring JoinQuery to pure Lucene apps!

        Can we cut back to normal ctor (not builder API) to create the
        JoinQuery? One can always create a builder API layer on top if
        necessary.

        How does the preComputedFromDocs work? It's not per-segment? Like
        it's a bitset across entire toplevel doc space?

        Hmm we are also using MultiFields.getLiveDocs, which is quite slow to
        use (must do binary search on each doc lookup).

        I wonder if we can make this work per-segment... but that can be a 2nd
        phase.

        I think you can use seekExact instead of seekCeil? Better
        performance...

        What is the AdjustedDISI for (and when would Weight.scorer get a
        top-level context...)?

        Show
        Michael McCandless added a comment - Awesome to finally bring JoinQuery to pure Lucene apps! Can we cut back to normal ctor (not builder API) to create the JoinQuery? One can always create a builder API layer on top if necessary. How does the preComputedFromDocs work? It's not per-segment? Like it's a bitset across entire toplevel doc space? Hmm we are also using MultiFields.getLiveDocs, which is quite slow to use (must do binary search on each doc lookup). I wonder if we can make this work per-segment... but that can be a 2nd phase. I think you can use seekExact instead of seekCeil? Better performance... What is the AdjustedDISI for (and when would Weight.scorer get a top-level context...)?
        Hide
        Martijn van Groningen added a comment - - edited

        I'll remove the builder api. This was just my sugar api. I'll change that in a constructor where the toSearcher and preComputedFromDocs are optional.

        How does the preComputedFromDocs work? It's not per-segment? Like it's a bitset across entire toplevel doc space?

        Yes, it is a bitset across all segments, so toplevel. The whole query implementation is top level. People can use this the execute the real query outside the JoinQuery (if this isn't specified the query is executed when the Weight is created for the JoinQuery). I think this would be nice if people want the for example cache the encapsulated query.

        I wonder if we can make this work per-segment... but that can be a 2nd phase.

        I thought about that. But this would mean we need to execute the join query in two phases. The first phase would collect all the "from" values from the documents matching the encapsulated query. The second phase would match the documents that have matching on the "to" side with the "from" values collected in the first phase. The will work per segment and I think would also make the JoinQuery work in distributed environment. We can do this in a second development phase.

        I think you can use seekExact instead of seekCeil? Better performance...

        I'll change that.

        What is the AdjustedDISI for (and when would Weight.scorer get a top-level context...)?

        Actually that is only used to map top level docids to segment docids (Basically doing: toplevel_docid - base). This is necessary because Weight.scorer works per segment and the query implementation not.

        Show
        Martijn van Groningen added a comment - - edited I'll remove the builder api. This was just my sugar api. I'll change that in a constructor where the toSearcher and preComputedFromDocs are optional. How does the preComputedFromDocs work? It's not per-segment? Like it's a bitset across entire toplevel doc space? Yes, it is a bitset across all segments, so toplevel. The whole query implementation is top level. People can use this the execute the real query outside the JoinQuery (if this isn't specified the query is executed when the Weight is created for the JoinQuery). I think this would be nice if people want the for example cache the encapsulated query. I wonder if we can make this work per-segment... but that can be a 2nd phase. I thought about that. But this would mean we need to execute the join query in two phases. The first phase would collect all the "from" values from the documents matching the encapsulated query. The second phase would match the documents that have matching on the "to" side with the "from" values collected in the first phase. The will work per segment and I think would also make the JoinQuery work in distributed environment. We can do this in a second development phase. I think you can use seekExact instead of seekCeil? Better performance... I'll change that. What is the AdjustedDISI for (and when would Weight.scorer get a top-level context...)? Actually that is only used to map top level docids to segment docids (Basically doing: toplevel_docid - base). This is necessary because Weight.scorer works per segment and the query implementation not.
        Hide
        Jason Rutherglen added a comment -

        Great to see this moving out of Solr and getting new eyes on it (with added improvements)!

        Show
        Jason Rutherglen added a comment - Great to see this moving out of Solr and getting new eyes on it (with added improvements)!
        Hide
        Martijn van Groningen added a comment -

        Updated patch.

        • Removed builder api.
        • Updated to latest Lucene api changes.
        • Use seekExact instead of seekCeil method.
        • Added random test.
        Show
        Martijn van Groningen added a comment - Updated patch. Removed builder api. Updated to latest Lucene api changes. Use seekExact instead of seekCeil method. Added random test.
        Hide
        Michael McCandless added a comment -

        Patch looks good!

        • I like the test...
        • Maybe rename actualQuery to fromQuery? (So it's clear that
          JoinQuery runs fromQuery using fromSearcher, joining on
          fromSearcher.fromField to toSearcher.toField).
        • Why preComputedFromDocs...? Like if you were to cache something,
          wouldn't you want cache the toSearcher's bitset instead?
        • Maybe rename JoinQueryWeight.joinResult to topLevelJoinResult, to
          contrast it w/ the per-segment scoring? And add a comment
          explaining that we compute it once (on first segment) and all
          later segments then reuse it?
        • I wonder if we could make this a Filter instead, somehow? Ie, at
          its core it converts a top-level bitset in the fromSearcher doc
          space into the joined bitset in the toSearcher doc space. It
          could even maybe just be a static method taking in fromBitset and
          returning toBitset, which could operate per-segment on the
          toSearcher side? (Separately: I wonder if JoinQuery should do
          something with the scores of the fromQuery....? Not right now but
          maybe later...).
        • Why does the JoinQuery javadoc say "The downside of this
          is that in a sharded environment not all documents might get
          joined / linked." as a downside to the top-level approach? Maybe
          reword that to state that all joined to/from docs must reside in
          the same shard? In theory we could (later) make a shard friendly
          approach? Eg, first pass builds up all unique Terms in the
          fromSearcher.fromField for docs matching the query (across all
          shards) and 2nd pass is basically a TermFilter on those...
        • Not sure it matters, but... including the preComputedFromDocs in
          hashCode/equals is quite costly (it goes bit by bit...). Maybe it
          shouldn't be included, since it contains details about the
          particular searcher that query had been run against? (In theory
          Query instances are searcher independent.)

        In general I think this approach is somewhat inefficient, because it
        always iterates over every possible term in fromSearcher.fromField,
        checking the docs for each to see if there is a match in the query.
        Ie, it's like FieldCache, in that it un-inverts, but it's uninverting
        on every query.

        I wonder if we could DocTermOrds instead? (Or,
        FieldCache.DocTermsIndex or DocValues.BYTES_*, if we know fromSearcher.fromField is
        single-valued). This way we uninvert once (on init), and then doing
        the join should be much faster since for each fromDocID we can lookup
        the term(s) to join on.

        Likewise on the toSearcher side, if we had doc <-> ord/term loaded we
        could do the forward (term -> ord) lookup quickly (in memory binary
        search).

        But then this will obviously use RAM... so we should have the choice
        (and start w/ the current patch!).

        Show
        Michael McCandless added a comment - Patch looks good! I like the test... Maybe rename actualQuery to fromQuery? (So it's clear that JoinQuery runs fromQuery using fromSearcher, joining on fromSearcher.fromField to toSearcher.toField). Why preComputedFromDocs...? Like if you were to cache something, wouldn't you want cache the toSearcher's bitset instead? Maybe rename JoinQueryWeight.joinResult to topLevelJoinResult, to contrast it w/ the per-segment scoring? And add a comment explaining that we compute it once (on first segment) and all later segments then reuse it? I wonder if we could make this a Filter instead, somehow? Ie, at its core it converts a top-level bitset in the fromSearcher doc space into the joined bitset in the toSearcher doc space. It could even maybe just be a static method taking in fromBitset and returning toBitset, which could operate per-segment on the toSearcher side? (Separately: I wonder if JoinQuery should do something with the scores of the fromQuery....? Not right now but maybe later...). Why does the JoinQuery javadoc say "The downside of this is that in a sharded environment not all documents might get joined / linked." as a downside to the top-level approach? Maybe reword that to state that all joined to/from docs must reside in the same shard? In theory we could (later) make a shard friendly approach? Eg, first pass builds up all unique Terms in the fromSearcher.fromField for docs matching the query (across all shards) and 2nd pass is basically a TermFilter on those... Not sure it matters, but... including the preComputedFromDocs in hashCode/equals is quite costly (it goes bit by bit...). Maybe it shouldn't be included, since it contains details about the particular searcher that query had been run against? (In theory Query instances are searcher independent.) In general I think this approach is somewhat inefficient, because it always iterates over every possible term in fromSearcher.fromField, checking the docs for each to see if there is a match in the query. Ie, it's like FieldCache, in that it un-inverts, but it's uninverting on every query. I wonder if we could DocTermOrds instead? (Or, FieldCache.DocTermsIndex or DocValues.BYTES_*, if we know fromSearcher.fromField is single-valued). This way we uninvert once (on init), and then doing the join should be much faster since for each fromDocID we can lookup the term(s) to join on. Likewise on the toSearcher side, if we had doc <-> ord/term loaded we could do the forward (term -> ord) lookup quickly (in memory binary search). But then this will obviously use RAM... so we should have the choice (and start w/ the current patch!).
        Hide
        Martijn van Groningen added a comment -

        Maybe rename actualQuery to fromQuery?

        Yes, fromQuery makes more sense than actualQuery.

        Why preComputedFromDocs...? Like if you were to cache something,
        wouldn't you want cache the toSearcher's bitset instead?

        This is in the case if your from query was cached and your toSearch's
        bitset isn't, which is a likely scenario.
        But caching the toSearcher's bitset is better off course when
        possible. But this should be happen outside the JoinQuery, right?

        Maybe rename JoinQueryWeight.joinResult to topLevelJoinResult,

        I agree a much more descriptive name.

        I wonder if we could make this a Filter instead, somehow? Ie, at
        its core it converts a top-level bitset in the fromSearcher doc
        space into the joined bitset in the toSearcher doc space. It
        could even maybe just be a static method taking in fromBitset and
        returning toBitset, which could operate per-segment on the
        toSearcher side? (Separately: I wonder if JoinQuery should do
        something with the scores of the fromQuery....? Not right now but
        maybe later...).

        It just matches docs from one side to the to side. That is all... So static method / filter should be able to do the job.
        I'm not sure, but if it is a query it might be able to one day encapsulate the joining in the Lucene query language?

        Maybe reword that to state that all joined to/from docs must reside in the same shard?

        +1

        I wonder if we could DocTermOrds instead? (Or,
        FieldCache.DocTermsIndex or DocValues.BYTES_*, if we know
        fromSearcher.fromField is
        single-valued). This way we uninvert once (on init), and then doing
        the join should be much faster since for each fromDocID we can lookup
        the term(s) to join on.

        I really like that idea! This already crossed my mind a few days ago
        as an improvement to speedup the joining. Would be nice if the user can
        choose between a more ram but faster variant and a less ram but slower variant.
        I think we can just make two concrete JoinQuery impl that both have a different
        joinResult(...) impl.

        Show
        Martijn van Groningen added a comment - Maybe rename actualQuery to fromQuery? Yes, fromQuery makes more sense than actualQuery. Why preComputedFromDocs...? Like if you were to cache something, wouldn't you want cache the toSearcher's bitset instead? This is in the case if your from query was cached and your toSearch's bitset isn't, which is a likely scenario. But caching the toSearcher's bitset is better off course when possible. But this should be happen outside the JoinQuery, right? Maybe rename JoinQueryWeight.joinResult to topLevelJoinResult, I agree a much more descriptive name. I wonder if we could make this a Filter instead, somehow? Ie, at its core it converts a top-level bitset in the fromSearcher doc space into the joined bitset in the toSearcher doc space. It could even maybe just be a static method taking in fromBitset and returning toBitset, which could operate per-segment on the toSearcher side? (Separately: I wonder if JoinQuery should do something with the scores of the fromQuery....? Not right now but maybe later...). It just matches docs from one side to the to side. That is all... So static method / filter should be able to do the job. I'm not sure, but if it is a query it might be able to one day encapsulate the joining in the Lucene query language? Maybe reword that to state that all joined to/from docs must reside in the same shard? +1 I wonder if we could DocTermOrds instead? (Or, FieldCache.DocTermsIndex or DocValues.BYTES_*, if we know fromSearcher.fromField is single-valued). This way we uninvert once (on init), and then doing the join should be much faster since for each fromDocID we can lookup the term(s) to join on. I really like that idea! This already crossed my mind a few days ago as an improvement to speedup the joining. Would be nice if the user can choose between a more ram but faster variant and a less ram but slower variant. I think we can just make two concrete JoinQuery impl that both have a different joinResult(...) impl.
        Hide
        Michael McCandless added a comment -

        This is in the case if your from query was cached and your toSearch's
        bitset isn't, which is a likely scenario.

        Hmm can you describe this? You mean the app sometimes actually uses the fromSearcher.fromQuery's results, directly, without joining?

        It just matches docs from one side to the to side. That is all... So static method / filter should be able to do the job.
        I'm not sure, but if it is a query it might be able to one day encapsulate the joining in the Lucene query language?

        Yeah... the core API is really the join method, to translate top-level docIDs in fromSearcher over to toSearcher's top-level docIDs.

        The AdjustedDISI (maybe rename SliceDISI? SubReaderDISI? ie, something to indicate it "slices" a sub-reader's portion of the top-level docID space) can then be used to translate back into a per-segment Filter.

        I think it would be cleaner as a Filter...? This is actually similar to DuplicateFilter, which also must operate on top-level docIDs (since dups can happen in different segments).

        Would be nice if the user can choose between a more ram but faster variant and a less ram but slower variant.

        Yeah I agree... but what worries me is just how slow this non-RAM version will be. Ie, it must do the full join and uninvert every time; so even if your fromQuery only matches a tiny number of docs... you pay massive cost of the full join. Even better than using FC/DV/DTO to map docID -> term(s) per query, we could hold in RAM the join result itself (docID -> docID(s)) in some form, then the query just directly maps the docIDs w/o having to lookup terms again.

        Stepping back a bit... do we know how this impl compares to how ElasticSearch does joins? And to how Solr does...?

        Show
        Michael McCandless added a comment - This is in the case if your from query was cached and your toSearch's bitset isn't, which is a likely scenario. Hmm can you describe this? You mean the app sometimes actually uses the fromSearcher.fromQuery's results, directly, without joining? It just matches docs from one side to the to side. That is all... So static method / filter should be able to do the job. I'm not sure, but if it is a query it might be able to one day encapsulate the joining in the Lucene query language? Yeah... the core API is really the join method, to translate top-level docIDs in fromSearcher over to toSearcher's top-level docIDs. The AdjustedDISI (maybe rename SliceDISI? SubReaderDISI? ie, something to indicate it "slices" a sub-reader's portion of the top-level docID space) can then be used to translate back into a per-segment Filter. I think it would be cleaner as a Filter...? This is actually similar to DuplicateFilter, which also must operate on top-level docIDs (since dups can happen in different segments). Would be nice if the user can choose between a more ram but faster variant and a less ram but slower variant. Yeah I agree... but what worries me is just how slow this non-RAM version will be. Ie, it must do the full join and uninvert every time; so even if your fromQuery only matches a tiny number of docs... you pay massive cost of the full join. Even better than using FC/DV/DTO to map docID -> term(s) per query, we could hold in RAM the join result itself (docID -> docID(s)) in some form, then the query just directly maps the docIDs w/o having to lookup terms again. Stepping back a bit... do we know how this impl compares to how ElasticSearch does joins? And to how Solr does...?
        Hide
        Martijn van Groningen added a comment -

        You mean the app sometimes actually uses the fromSearcher.fromQuery's results, directly, without joining?

        Yes. In the case of Solr it is checking the filter cache.

        but what worries me is just how slow this non-RAM version will be.

        I have been running the JoinQuery on my test data set (10.1 M docs) and it isn't as bad as I expect it would be. This data set contains 100000 products each product having 100 offers. The JoinQuery with a : query as fromQuery takes about 900 ms and a fromQuery selecting all products with a specific keyword takes about 350 ms. I think this specific query implementation is suitable for environments where RAM might be scarce. The RAM version should be the default.

        Stepping back a bit... do we know how this impl compares to how ElasticSearch does joins? And to how Solr does...?

        ES only has index time joining, right? Solr basically uses the same mechanism as the JoinQuery in this patch, but a bit smarter. It tries to cache the from term to to term lookup (see JoinQParserPlugin.java line 367). I couldn't port this part to joining module since this caching relies heavily on the SolrIndexSearcher.

        Show
        Martijn van Groningen added a comment - You mean the app sometimes actually uses the fromSearcher.fromQuery's results, directly, without joining? Yes. In the case of Solr it is checking the filter cache. but what worries me is just how slow this non-RAM version will be. I have been running the JoinQuery on my test data set (10.1 M docs) and it isn't as bad as I expect it would be. This data set contains 100000 products each product having 100 offers. The JoinQuery with a : query as fromQuery takes about 900 ms and a fromQuery selecting all products with a specific keyword takes about 350 ms. I think this specific query implementation is suitable for environments where RAM might be scarce. The RAM version should be the default. Stepping back a bit... do we know how this impl compares to how ElasticSearch does joins? And to how Solr does...? ES only has index time joining, right? Solr basically uses the same mechanism as the JoinQuery in this patch, but a bit smarter. It tries to cache the from term to to term lookup (see JoinQParserPlugin.java line 367). I couldn't port this part to joining module since this caching relies heavily on the SolrIndexSearcher.
        Hide
        Jason Rutherglen added a comment -

        Maybe we can (in another issue) move bit set filter caching into SearchManager, for use by Lucene Join (here), and others. At the same time making bitset filtering per-segment, a fundamental improvement from the existing (old) Solr code.

        Show
        Jason Rutherglen added a comment - Maybe we can (in another issue) move bit set filter caching into SearchManager, for use by Lucene Join (here), and others. At the same time making bitset filtering per-segment, a fundamental improvement from the existing (old) Solr code.
        Hide
        Martijn van Groningen added a comment -

        Attached a new patch with a completely different joining implementation. Basically the joining happens inside a static method, works per segment and uses FieldCache.

        This is just a small try-out and is lacking any tests.

        Show
        Martijn van Groningen added a comment - Attached a new patch with a completely different joining implementation. Basically the joining happens inside a static method, works per segment and uses FieldCache. This is just a small try-out and is lacking any tests.
        Hide
        Michael McCandless added a comment -

        Wow new patch is tiny – just 2 static methods!

        Right now you do 3 passes – 1st pass records fromSearcher's docIDs
        matching fromQuery; 2nd pass translates those matching docIDs into
        the joinable terms in fromSearcher.fromField; 3rd pass then records
        toSearcher docIDs matching those terms in toField.

        But I think the first 2 passes could be combined? Ie, as you collect
        each hit from fromQuery, instead of recording the docID, go and look up
        the term in fromField for that doc and save it away? Then you don't
        need to save the fromSearcher docIDs? (3rd pass would then be the
        same).

        Also, instead of making a toplevel bit set as the return
        result... could it be an ordinary filter? Then the 3rd pass would be
        implemented in Filter.getDocIDSet, and the Filter instance would hold
        onto these terms computed by the combined 1st/2nd pass?

        I think this is a great step forward over previous patch... so tiny
        too

        The 1st/2nd pass would have "expected" cost, ie on the order of how
        many hits matched in fromQuery. But the 3rd pass has a high cost even
        for tiny queries since it visits every doc, checking whether its terms
        are in the set. We might be able to improve on that somehow... eg if
        the number of terms is small, really you want to invert that process
        (ie visit the postings and gather all matching docs), either with an
        OR query or with TermsFilter (in modules/queries)? Hmm, in fact,
        isn't this just a MultiTermQuery? We can then use auto rewrite mode
        to rewrite as filter or small BooleanQuery?

        Show
        Michael McCandless added a comment - Wow new patch is tiny – just 2 static methods! Right now you do 3 passes – 1st pass records fromSearcher's docIDs matching fromQuery; 2nd pass translates those matching docIDs into the joinable terms in fromSearcher.fromField; 3rd pass then records toSearcher docIDs matching those terms in toField. But I think the first 2 passes could be combined? Ie, as you collect each hit from fromQuery, instead of recording the docID, go and look up the term in fromField for that doc and save it away? Then you don't need to save the fromSearcher docIDs? (3rd pass would then be the same). Also, instead of making a toplevel bit set as the return result... could it be an ordinary filter? Then the 3rd pass would be implemented in Filter.getDocIDSet, and the Filter instance would hold onto these terms computed by the combined 1st/2nd pass? I think this is a great step forward over previous patch... so tiny too The 1st/2nd pass would have "expected" cost, ie on the order of how many hits matched in fromQuery. But the 3rd pass has a high cost even for tiny queries since it visits every doc, checking whether its terms are in the set. We might be able to improve on that somehow... eg if the number of terms is small, really you want to invert that process (ie visit the postings and gather all matching docs), either with an OR query or with TermsFilter (in modules/queries)? Hmm, in fact, isn't this just a MultiTermQuery? We can then use auto rewrite mode to rewrite as filter or small BooleanQuery?
        Hide
        Martijn van Groningen added a comment - - edited

        But I think the first 2 passes could be combined?

        That idea also crossed my mind. We should definitely do this.

        Also, instead of making a toplevel bit set as the return result... could it be an ordinary filter?

        I like that approach. It can just return a TermsFilter instance, right?

        Maybe we should have to variants one returns a MTQ and one a TermsFilter? So users can choose whether to want a filter or a query.

        Show
        Martijn van Groningen added a comment - - edited But I think the first 2 passes could be combined? That idea also crossed my mind. We should definitely do this. Also, instead of making a toplevel bit set as the return result... could it be an ordinary filter? I like that approach. It can just return a TermsFilter instance, right? Maybe we should have to variants one returns a MTQ and one a TermsFilter? So users can choose whether to want a filter or a query.
        Hide
        Robert Muir added a comment -

        Maybe we should have to variants one returns a MTQ and one a TermsFilter? So users can choose whether to want a filter or a query.

        But they can always choose the rewriteMode to not be AUTO and explicitly choose filter if they want this.

        Show
        Robert Muir added a comment - Maybe we should have to variants one returns a MTQ and one a TermsFilter? So users can choose whether to want a filter or a query. But they can always choose the rewriteMode to not be AUTO and explicitly choose filter if they want this.
        Hide
        Martijn van Groningen added a comment -

        Attached a new patch. Changes:

        • Everything happens now in two passes.
        • The second pass is in the custom MTQ impl.
        • JoinUtil contains one method that returns a MTQ impl.
        • Reused the tests from previous patch for the JoinUtil.

        But they can always choose the rewriteMode to not be AUTO and explicitly choose filter if they want this.

        Yes makes sense. First time I used the MTQ directly. So feedback on my on MTQ impl is welcome

        I think that this patch can easily be backported to 3x.

        Show
        Martijn van Groningen added a comment - Attached a new patch. Changes: Everything happens now in two passes. The second pass is in the custom MTQ impl. JoinUtil contains one method that returns a MTQ impl. Reused the tests from previous patch for the JoinUtil. But they can always choose the rewriteMode to not be AUTO and explicitly choose filter if they want this. Yes makes sense. First time I used the MTQ directly. So feedback on my on MTQ impl is welcome I think that this patch can easily be backported to 3x.
        Hide
        Michael McCandless added a comment -

        Looking great! I think we are getting close

        Maybe use BytesRefHash (oal.util) to gather the terms and sort
        them in the end? Then hold onto the BytesRef[] (sorted). It's
        designed for exactly this usage...

        Also, I think you should always seek in your FilteredTermsEnum, and
        simply iterate yourself through the BytesRef[]? Probably perf wasn't
        great before because you were calling TreeSet.higher on each term?

        I think you should call setRewriteMethod in your TermSetQuery ctor, to
        CONSTANT_SCORE_AUTO_REWRITE_DEFAULT, since we have no scores here.
        This rewrite method uses BQ if number of terms/docs is "smallish" else
        creates a filter, so it should give good perf.

        Show
        Michael McCandless added a comment - Looking great! I think we are getting close Maybe use BytesRefHash (oal.util) to gather the terms and sort them in the end? Then hold onto the BytesRef[] (sorted). It's designed for exactly this usage... Also, I think you should always seek in your FilteredTermsEnum, and simply iterate yourself through the BytesRef[]? Probably perf wasn't great before because you were calling TreeSet.higher on each term? I think you should call setRewriteMethod in your TermSetQuery ctor, to CONSTANT_SCORE_AUTO_REWRITE_DEFAULT, since we have no scores here. This rewrite method uses BQ if number of terms/docs is "smallish" else creates a filter, so it should give good perf.
        Hide
        Martijn van Groningen added a comment -

        Cool. I will use the BytesRefHash in the patch and see how this pans out.

        Also, I think you should always seek in your FilteredTermsEnum, and
        simply iterate yourself through the BytesRef[]? Probably perf wasn't
        great before because you were calling TreeSet.higher on each term?

        I've been testing this out with treeSet.higher(). In the case when there're gaps between the terms the performance is much better with seeking then not seeking (I picked random terms in a small test). In the terms are next / near to each other it seems it is better use a non seeking implementation. I think we should have a seeking and non seeking impl available. Maybe there should should be an option named useSeekingTermsEnum which decides what impl to us?

        I think you should call setRewriteMethod in your TermSetQuery ctor, to CONSTANT_SCORE_AUTO_REWRITE_DEFAULT

        CONSTANT_SCORE_AUTO_REWRITE_DEFAULT is already the default in MTQ, right?

        Show
        Martijn van Groningen added a comment - Cool. I will use the BytesRefHash in the patch and see how this pans out. Also, I think you should always seek in your FilteredTermsEnum, and simply iterate yourself through the BytesRef[]? Probably perf wasn't great before because you were calling TreeSet.higher on each term? I've been testing this out with treeSet.higher(). In the case when there're gaps between the terms the performance is much better with seeking then not seeking (I picked random terms in a small test). In the terms are next / near to each other it seems it is better use a non seeking implementation. I think we should have a seeking and non seeking impl available. Maybe there should should be an option named useSeekingTermsEnum which decides what impl to us? I think you should call setRewriteMethod in your TermSetQuery ctor, to CONSTANT_SCORE_AUTO_REWRITE_DEFAULT CONSTANT_SCORE_AUTO_REWRITE_DEFAULT is already the default in MTQ, right?
        Hide
        Michael McCandless added a comment -

        Oh yeah MTQ already defaults to that rewrite method

        Really, since you know the exact number of terms up front... you could specialize it to constant BQ or filter yourself... but that can be later.

        I don't think you should be calling TreeSet.higher on every term check – that's a relatively costly binary search through the TreeSet each time. Really you should be able to just walk through (with your own upto++) the pre-sorted BytesRef[] you pulled from the BytesRefHash. Then we should retest seek vs next...

        Show
        Michael McCandless added a comment - Oh yeah MTQ already defaults to that rewrite method Really, since you know the exact number of terms up front... you could specialize it to constant BQ or filter yourself... but that can be later. I don't think you should be calling TreeSet.higher on every term check – that's a relatively costly binary search through the TreeSet each time. Really you should be able to just walk through (with your own upto++) the pre-sorted BytesRef[] you pulled from the BytesRefHash. Then we should retest seek vs next...
        Hide
        Martijn van Groningen added a comment - - edited

        Attached new patch.

        • Extract collector from JoinUtil and named it TermsCollector.
        • TermsCollector uses BytesRefHash.
        • Renamed TermSetQuery to TermsQuery.
        • TermsQuery uses sorted ByteRefs[]
        • TermsCollector has two concrete impl. One for single value per field and one for multiple values per field.
        • Moved the classes for query time joining under the package o.a.l.search.join.query
        • Added documentation.

        The JoinUtil is now just a convience class.

        Should we add the existing block joining classes under o.a.l.search.join.index or o.a.l.search.join.block?

        I think this patch is ready to get committed.

        Show
        Martijn van Groningen added a comment - - edited Attached new patch. Extract collector from JoinUtil and named it TermsCollector. TermsCollector uses BytesRefHash. Renamed TermSetQuery to TermsQuery. TermsQuery uses sorted ByteRefs[] TermsCollector has two concrete impl. One for single value per field and one for multiple values per field. Moved the classes for query time joining under the package o.a.l.search.join.query Added documentation. The JoinUtil is now just a convience class. Should we add the existing block joining classes under o.a.l.search.join.index or o.a.l.search.join.block? I think this patch is ready to get committed.
        Hide
        Martijn van Groningen added a comment -

        Oops... I found out that the assertion in FilteredTermsEnum line 207 was failing. Before I added the previous patch I ran the tests from my IDE and assertions weren't enabled...

        In some cases I was seeking backwards which isn't good. The attached patch fixes this issue. All the tests in the join module pass now.

        Show
        Martijn van Groningen added a comment - Oops... I found out that the assertion in FilteredTermsEnum line 207 was failing. Before I added the previous patch I ran the tests from my IDE and assertions weren't enabled... In some cases I was seeking backwards which isn't good. The attached patch fixes this issue. All the tests in the join module pass now.
        Hide
        Michael McCandless added a comment -

        Looking great!

        • Can't TermsCollector be package private? Like it's only used
          privately in JoinUtil.createJoinQuery?
        • Can JoinUtil.createJoinQuery return Query not MultiTermQuery...?
          Just gives us freedom in the future to impl a different Query if
          we want/need to...
        • Typo: collecter -> collector
        • Can TermsQuery be package private? Also, we can save a pass over
          the sorted terms by having TermsQuery hold the int[] ord array and
          then just do the lookup (against BytesRefHash) as it
          goes... really TermsQuery could just take the BytesRefHash. Then
          we wouldn't have to materialize a new BytesRef for each matched
          term... just reuse a single scratch BytesRef inside TermsQuery.
        • In the TermsQuery.accept... should that return AcceptStatus.YES
          in the if (cmp == 0) be a YES_AND_SEEK
          (after setting the next term as the seekTerm)?
        • Hmm we sort by unicode order in TermsCollector.getCollectedTerms,
          but then by the term dict's comparator in TermsQuery; maybe just
          use the UTF8AsUnicode comparator in TermsQuery too? And note in
          jdocs that this is required?
        Show
        Michael McCandless added a comment - Looking great! Can't TermsCollector be package private? Like it's only used privately in JoinUtil.createJoinQuery? Can JoinUtil.createJoinQuery return Query not MultiTermQuery...? Just gives us freedom in the future to impl a different Query if we want/need to... Typo: collecter -> collector Can TermsQuery be package private? Also, we can save a pass over the sorted terms by having TermsQuery hold the int[] ord array and then just do the lookup (against BytesRefHash) as it goes... really TermsQuery could just take the BytesRefHash. Then we wouldn't have to materialize a new BytesRef for each matched term... just reuse a single scratch BytesRef inside TermsQuery. In the TermsQuery.accept... should that return AcceptStatus.YES in the if (cmp == 0) be a YES_AND_SEEK (after setting the next term as the seekTerm)? Hmm we sort by unicode order in TermsCollector.getCollectedTerms, but then by the term dict's comparator in TermsQuery; maybe just use the UTF8AsUnicode comparator in TermsQuery too? And note in jdocs that this is required?
        Hide
        Michael McCandless added a comment -

        Also: I wonder if we should use DocValues in addition (instead of?) FieldCache.getTerms...

        Show
        Michael McCandless added a comment - Also: I wonder if we should use DocValues in addition (instead of?) FieldCache.getTerms...
        Hide
        Martijn van Groningen added a comment -

        I think that joining by docvalues would be great addition! I do think that joining by indexed terms should be possible to.
        To support joining by docvalues we can make a different TermsCollector implementation, but I'm unsure how the TermsQuery can be used. Can A MTQ also iterate over docvalues?

        I think we should commit what we currently have first and then address joining by docvalues in a new issue.

        Show
        Martijn van Groningen added a comment - I think that joining by docvalues would be great addition! I do think that joining by indexed terms should be possible to. To support joining by docvalues we can make a different TermsCollector implementation, but I'm unsure how the TermsQuery can be used. Can A MTQ also iterate over docvalues? I think we should commit what we currently have first and then address joining by docvalues in a new issue.
        Hide
        Michael McCandless added a comment -

        To support joining by docvalues we can make a different TermsCollector implementation, but I'm unsure how the TermsQuery can be used.

        Actually I think TermsQuery (once we change it to just take the BytesRefHash and iterate over the int[] ord instead of BytesRef[]) can be re-used.

        Ie, when collecting terms from DocValues, you'd just stuff them into the BytesRefHash like you do now... so supporting terms from DocValues should simply be another private impl inside TermsCollector (just like you can now pull from DocTermOrds or FieldCache).

        But I agree: let's do this (enable JoinQuery to use DocValues) in a follow-on issue!

        Show
        Michael McCandless added a comment - To support joining by docvalues we can make a different TermsCollector implementation, but I'm unsure how the TermsQuery can be used. Actually I think TermsQuery (once we change it to just take the BytesRefHash and iterate over the int[] ord instead of BytesRef[]) can be re-used. Ie, when collecting terms from DocValues, you'd just stuff them into the BytesRefHash like you do now... so supporting terms from DocValues should simply be another private impl inside TermsCollector (just like you can now pull from DocTermOrds or FieldCache). But I agree: let's do this (enable JoinQuery to use DocValues) in a follow-on issue!
        Hide
        Robert Muir added a comment -

        Hi Martijn: the patch is looking really nice!

        Just a trivial thing, can we swap ctor arguments of

        public TermsQuery(BytesRef[] terms, String field) {
        

        to

        public TermsQuery(String field, BytesRef[] terms) {
        

        Most of the apis on indexreader etc take field,term so I think it looks more consistent.

        Show
        Robert Muir added a comment - Hi Martijn: the patch is looking really nice! Just a trivial thing, can we swap ctor arguments of public TermsQuery(BytesRef[] terms, String field) { to public TermsQuery( String field, BytesRef[] terms) { Most of the apis on indexreader etc take field,term so I think it looks more consistent.
        Hide
        Martijn van Groningen added a comment -

        Attached a new patch!

        • The static method now has Query as return type.
        • TermsCollector and TermsQuery are package protected.
        • TermsQuery now holds an int[] ords instead of an array of ByteRefs. This is a really nice improvement
        • Swapped the const args of TermQuery. Consistency is important!
        • After some discussion moved the classes back to o.a.l.search.join package.
        • The query time joining uses the dict comparator on all places now (instead of ByteRef#UTF8AsUnicode).

        In the that cmp==0 in TermsQuery at line 121; YES_AND_SEEK can't be returned. The FilteredTermsEnum prohibits seeking to a term smaller or equal then the current term.

        Show
        Martijn van Groningen added a comment - Attached a new patch! The static method now has Query as return type. TermsCollector and TermsQuery are package protected. TermsQuery now holds an int[] ords instead of an array of ByteRefs. This is a really nice improvement Swapped the const args of TermQuery. Consistency is important! After some discussion moved the classes back to o.a.l.search.join package. The query time joining uses the dict comparator on all places now (instead of ByteRef#UTF8AsUnicode). In the that cmp==0 in TermsQuery at line 121; YES_AND_SEEK can't be returned. The FilteredTermsEnum prohibits seeking to a term smaller or equal then the current term.
        Hide
        Martijn van Groningen added a comment -

        Small update. Removed unused method.

        Show
        Martijn van Groningen added a comment - Small update. Removed unused method.
        Hide
        Martijn van Groningen added a comment -

        Fixed the YES_AND_SEEK issue. In the previously described case it uses YES_AND_SEEK now without seeking to the current term (that triggered the assertion error).

        Show
        Martijn van Groningen added a comment - Fixed the YES_AND_SEEK issue. In the previously described case it uses YES_AND_SEEK now without seeking to the current term (that triggered the assertion error).
        Hide
        Michael McCandless added a comment -

        +1 it looks great! Good work

        Show
        Michael McCandless added a comment - +1 it looks great! Good work
        Hide
        Martijn van Groningen added a comment -

        Attached a new patch.

        • Fixed some small comment typos.
        • Removed the reuse field in the TermsQuery. Since it doesn't make sense to reuse b/c it assigned null! Also I cannot know if the caller is done with the TermsEnum.

        It is ready. I'll commit this asap.

        Show
        Martijn van Groningen added a comment - Attached a new patch. Fixed some small comment typos. Removed the reuse field in the TermsQuery. Since it doesn't make sense to reuse b/c it assigned null! Also I cannot know if the caller is done with the TermsEnum. It is ready. I'll commit this asap.
        Hide
        Martijn van Groningen added a comment -

        Committed to trunk. I'll leave this issue open for the backport to 3.6

        Show
        Martijn van Groningen added a comment - Committed to trunk. I'll leave this issue open for the backport to 3.6
        Hide
        Jason Rutherglen added a comment -

        Sweet! How join would work in distributed mode, that would be very useful for BigData projects.

        Show
        Jason Rutherglen added a comment - Sweet! How join would work in distributed mode, that would be very useful for BigData projects.
        Hide
        Robert Muir added a comment -

        Just a few ideas about 3.x backport:

        • MultiTermQuery doesn't have a real 'seek' there... you have to open up and swap in a new 'actualEnum' (does this clone the indexinput etc?) starting at the new seek position... I think its tricky and ugly. Maybe we can/should seek to 3.x TermEnum/FilteredTermEnum so this will work nicer...? I havent looked at how hairy this would be though.
        • The rewrites are not always per-segment like they are in trunk, and it could be bad to "seek" a lot when MTQ is in 'auto mode' because if I recall its expensive (creating lots of multitermsenums). Maybe since Join knows up-front how many terms are in the hash it should determine which method to use itself (setRewriteMethod)... maybe on 3.x just set filter rewrite since you know will be "seeking"? Since you return the query the user could still always override the rewrite method if they really care, it would just be a default.
        Show
        Robert Muir added a comment - Just a few ideas about 3.x backport: MultiTermQuery doesn't have a real 'seek' there... you have to open up and swap in a new 'actualEnum' (does this clone the indexinput etc?) starting at the new seek position... I think its tricky and ugly. Maybe we can/should seek to 3.x TermEnum/FilteredTermEnum so this will work nicer...? I havent looked at how hairy this would be though. The rewrites are not always per-segment like they are in trunk, and it could be bad to "seek" a lot when MTQ is in 'auto mode' because if I recall its expensive (creating lots of multitermsenums). Maybe since Join knows up-front how many terms are in the hash it should determine which method to use itself (setRewriteMethod)... maybe on 3.x just set filter rewrite since you know will be "seeking"? Since you return the query the user could still always override the rewrite method if they really care, it would just be a default.
        Hide
        Martijn van Groningen added a comment -

        Sweet! How join would work in distributed mode, that would be very useful for BigData projects.

        The join is actually executed in a two pass search. During the first pass search all the terms are gathered for the matching documents based on the fromQuery. In the second pass search all documents are collected that match with the gather terms for a specific field. To only way I currently see how this can work in a distributed environment is that all machines in the cluster execute the first pass search and then copy the collected terms between machines. After this is done each machine can execute the second pass. If your data allows it, you can partition data in your cluster this allows you to skip the copying of terms.

        Currently the api is just one static method and assumes that the joining happens locally. I think we need to have two more methods. One method that returns the first pass terms and one method that constructs a query based on terms from the first pass.

        Robert: Yes I see that 3.x MTQ isn't as great as MTQ in trunk. Maybe we need a different approach (not use MTQ)? The api is clean for users, and allows us to do joining different in 3x. I'll start backporting and see how well it goes.

        Show
        Martijn van Groningen added a comment - Sweet! How join would work in distributed mode, that would be very useful for BigData projects. The join is actually executed in a two pass search. During the first pass search all the terms are gathered for the matching documents based on the fromQuery. In the second pass search all documents are collected that match with the gather terms for a specific field. To only way I currently see how this can work in a distributed environment is that all machines in the cluster execute the first pass search and then copy the collected terms between machines. After this is done each machine can execute the second pass. If your data allows it, you can partition data in your cluster this allows you to skip the copying of terms. Currently the api is just one static method and assumes that the joining happens locally. I think we need to have two more methods. One method that returns the first pass terms and one method that constructs a query based on terms from the first pass. Robert: Yes I see that 3.x MTQ isn't as great as MTQ in trunk. Maybe we need a different approach (not use MTQ)? The api is clean for users, and allows us to do joining different in 3x. I'll start backporting and see how well it goes.
        Hide
        Jason Rutherglen added a comment -

        I was reviewing this issue to use where Solr's join implementation may not be the right choice.

        In this Lucene Join implementation, a new BytesRefHash is built per query (and cannot be reused). This could generate a fair amount of garbage quickly.

        Also the sort compare using BRH is per byte (not as cheap as an ord compare). We can probably use DocTermsIndex to replace the use of BytesRefHash by comparing DTI's ords. Then we are saving off the bytes into BRH per query, and the comparison would be faster.

        Additionally, for a join with many terms, the number of postings could become a factor in performance. Because we are not caching bitsets like Solr does, it seems like an excellent occasion for a faster less-compressed codec.

        Further, to save on term seeking, if the term state was cached (eg, the file pointers into the posting list), the iteration on the terms dict would be removed.

        Granted all this requires more RAM, however in many cases (eg, mine), that would be acceptable.

        Show
        Jason Rutherglen added a comment - I was reviewing this issue to use where Solr's join implementation may not be the right choice. In this Lucene Join implementation, a new BytesRefHash is built per query (and cannot be reused). This could generate a fair amount of garbage quickly. Also the sort compare using BRH is per byte (not as cheap as an ord compare). We can probably use DocTermsIndex to replace the use of BytesRefHash by comparing DTI's ords. Then we are saving off the bytes into BRH per query, and the comparison would be faster. Additionally, for a join with many terms, the number of postings could become a factor in performance. Because we are not caching bitsets like Solr does, it seems like an excellent occasion for a faster less-compressed codec. Further, to save on term seeking, if the term state was cached (eg, the file pointers into the posting list), the iteration on the terms dict would be removed. Granted all this requires more RAM, however in many cases (eg, mine), that would be acceptable.
        Hide
        Martijn van Groningen added a comment -

        I'm not sure how you plan to sort by DTI ords. The terms collected in the first phase are from many segments. The ords from DTI are only valid inside a segment. You can create a toplevel DTI but that is expensive...

        Currently caching is minimal and can be improved at the cost of more RAM. The TermsCollector caches the from terms via DocTerms in the FC (per segment). Caching can be improved in the second phase as you described, by saving a bitset per fromTerm?. But I think we first need to tackle how bitsets are cached in general. Solr caches (which the Solr JoinQuery uses) are top level (one commit and you lose it all). I'm unsure how to cache the posting list file pointers with the current Lucene api... I think this (joining) caching should be addressed in a new issue.

        Performance of the JoinUtil looks actual quite good from what I have measured. I have a test data set containing 100000 products and 100 offers per product (10100000 docs in total). The JoinUtil is between 2 till 3 times faster than Solr's JoinQuery with this data set on my dev machine.

        Show
        Martijn van Groningen added a comment - I'm not sure how you plan to sort by DTI ords. The terms collected in the first phase are from many segments. The ords from DTI are only valid inside a segment. You can create a toplevel DTI but that is expensive... Currently caching is minimal and can be improved at the cost of more RAM. The TermsCollector caches the from terms via DocTerms in the FC (per segment). Caching can be improved in the second phase as you described, by saving a bitset per fromTerm?. But I think we first need to tackle how bitsets are cached in general. Solr caches (which the Solr JoinQuery uses) are top level (one commit and you lose it all). I'm unsure how to cache the posting list file pointers with the current Lucene api... I think this (joining) caching should be addressed in a new issue. Performance of the JoinUtil looks actual quite good from what I have measured. I have a test data set containing 100000 products and 100 offers per product (10100000 docs in total). The JoinUtil is between 2 till 3 times faster than Solr's JoinQuery with this data set on my dev machine.
        Hide
        Martijn van Groningen added a comment -

        Attached initial patch for 3x branch.

        • I replaced the BytesRefHash with a Set for now. FC in 3x is String based.
        • TermsQuery doesn't seek, but does a binary search to find the index in the String[] (created from the Set).... Not ideal. So we need some seeking.

        I ran some quick tests and performance is 3 till 4 times worse then the JoinUtil committed to trunk.

        Show
        Martijn van Groningen added a comment - Attached initial patch for 3x branch. I replaced the BytesRefHash with a Set for now. FC in 3x is String based. TermsQuery doesn't seek, but does a binary search to find the index in the String[] (created from the Set).... Not ideal. So we need some seeking. I ran some quick tests and performance is 3 till 4 times worse then the JoinUtil committed to trunk.
        Hide
        Jason Rutherglen added a comment -

        The terms collected in the first phase are from many segments

        Why is that necessary?

        Caching can be improved in the second phase as you described, by saving a bitset per fromTerm?

        Possibly, only for terms with a high number of documents. Or we can use a faster to decode (less compressed) posting codec.

        The JoinUtil is between 2 till 3 times faster than Solr's JoinQuery with this data set on my dev machine

        Interesting, thanks for sharing.

        Show
        Jason Rutherglen added a comment - The terms collected in the first phase are from many segments Why is that necessary? Caching can be improved in the second phase as you described, by saving a bitset per fromTerm? Possibly, only for terms with a high number of documents. Or we can use a faster to decode (less compressed) posting codec. The JoinUtil is between 2 till 3 times faster than Solr's JoinQuery with this data set on my dev machine Interesting, thanks for sharing.
        Hide
        Jason Rutherglen added a comment -

        Just following up on the per-segment terms collection. Join is going to be used as a filter in most cases . Filters can be applied per-segment (unlike scoring queries). So it seems possible to avoid the BRH creation by using the DTI?

        Show
        Jason Rutherglen added a comment - Just following up on the per-segment terms collection. Join is going to be used as a filter in most cases . Filters can be applied per-segment (unlike scoring queries). So it seems possible to avoid the BRH creation by using the DTI?
        Hide
        Martijn van Groningen added a comment -

        Jason: Better late then never... BRH is used to collect the matching from terms. The DTI just contains all terms / ords for a field. Comparing DTI ords isn't going to work when a term is in more than one segment or appears in a different field (fromField / toField). So I think the BRH can't be replaced by the DTI. The BRH could be cached per query.

        Show
        Martijn van Groningen added a comment - Jason: Better late then never... BRH is used to collect the matching from terms. The DTI just contains all terms / ords for a field. Comparing DTI ords isn't going to work when a term is in more than one segment or appears in a different field (fromField / toField). So I think the BRH can't be replaced by the DTI. The BRH could be cached per query.
        Hide
        Martijn van Groningen added a comment -

        Attached updated version of query time joining for 3x branch. Instead of doing a binary search for each term comparison it seeks / iterates forward. It can't do seeking like we do in trunk, so it isn't as fast as in trunk. However I do think this can be committed to at least have query time join support in 3x. Back porting per segment filtering and the MTQ that is in trunk is quite some work...

        Show
        Martijn van Groningen added a comment - Attached updated version of query time joining for 3x branch. Instead of doing a binary search for each term comparison it seeks / iterates forward. It can't do seeking like we do in trunk, so it isn't as fast as in trunk. However I do think this can be committed to at least have query time join support in 3x. Back porting per segment filtering and the MTQ that is in trunk is quite some work...
        Hide
        Martijn van Groningen added a comment -

        Committed latest 3x patch to branch3x.

        Show
        Martijn van Groningen added a comment - Committed latest 3x patch to branch3x.
        Hide
        Martijn van Groningen added a comment -

        The joining in 3x is ~3 times slower than the joining committed to trunk. This is mainly caused by the fact that the MTQ in trunk can do seeking while the the MTQ in 3x can only increment to the next term.

        Show
        Martijn van Groningen added a comment - The joining in 3x is ~3 times slower than the joining committed to trunk. This is mainly caused by the fact that the MTQ in trunk can do seeking while the the MTQ in 3x can only increment to the next term.

          People

          • Assignee:
            Unassigned
            Reporter:
            Martijn van Groningen
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development