Lucene - Core
  1. Lucene - Core
  2. LUCENE-1494

masking field of span for cross searching across multiple fields (many-to-one style)

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.4
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      This issue is to cover the changes required to do a search across multiple fields with the same name in a fashion similar to a many-to-one database. Below is my post on java-dev on the topic, which details the changes we need:

      We have an interesting situation where we are effectively indexing two 'entities' in our system, which share a one-to-many relationship (imagine 'User' and 'Delivery Address' for demonstration purposes). At the moment, we index one Lucene Document per 'many' end, duplicating the 'one' end data, like so:

      userid: 1
      userfirstname: fred
      addresscountry: au
      addressphone: 1234

      userid: 1
      userfirstname: fred
      addresscountry: nz
      addressphone: 5678

      userid: 2
      userfirstname: mary
      addresscountry: au
      addressphone: 5678

      (note: 2 Documents indexed for user 1). This is somewhat annoying for us, because when we search in Lucene the results we want back (conceptually) are at the 'user' level, so we have to collapse the results by distinct user id, etc. etc (let alone that it blows out the size of our index enormously). So why do we do it? It would make more sense to use multiple fields:
      userid: 1
      userfirstname: fred
      addresscountry: au
      addressphone: 1234
      addresscountry: nz
      addressphone: 5678

      userid: 2
      userfirstname: mary
      addresscountry: au
      addressphone: 5678

      But imagine the search "+addresscountry:au +addressphone:5678". We'd like this to match ONLY Mary, but of course it matches Fred also because he matches both those terms (just for different addresses).

      There are two aspects to the approach we've (more or less) got working but I'd like to run them past the group and see if they're worth trying to get them into Lucene proper (if so, I'll create a JIRA issue for them)

      1) Use a modified SpanNearQuery. If we assume that country + phone will always be one token, we can rely on the fact that the positions of 'au' and '5678' in Fred's document will be different.

      SpanQuery q1 = new SpanTermQuery(new Term("addresscountry", "au"));
      SpanQuery q2 = new SpanTermQuery(new Term("addressphone", "5678"));
      SpanQuery snq = new SpanNearQuery(new SpanQuery[]

      {q1, q2}

      , 0, false);

      the slop of 0 means that we'll only return those where the two terms are in the same position in their respective fields. This works brilliantly, BUT requires a change to SpanNearQuery's constructor (which checks that all the clauses are against the same field). Are people amenable to perhaps adding another constructor to SNQ which doesn't do the check, or subclassing it to do the same (give it a protected non-checking constructor for the subclass to call)?

      2) (snipped ... see LUCENE-1626 for second idea)

      1. LUCENE-1494-masking.patch
        17 kB
        Hoss Man
      2. LUCENE-1494-masking.patch
        5 kB
        Paul Cowan
      3. LUCENE-1494-multifield.patch
        9 kB
        Paul Cowan
      4. LUCENE-1494-positionincrement.patch
        6 kB
        Paul Cowan

        Issue Links

          Activity

          Hide
          Paul Cowan added a comment -

          Patch for part 1. Introduces a new sublcass of SpanNearQuery, which might be overkill, but the semantics here are quite different (and it interacts oddly with some other classes – getField(), for instance, can't be meaningfully implemented – so having a separate class here made sense to me. Includes test cases.

          Show
          Paul Cowan added a comment - Patch for part 1. Introduces a new sublcass of SpanNearQuery, which might be overkill, but the semantics here are quite different (and it interacts oddly with some other classes – getField(), for instance, can't be meaningfully implemented – so having a separate class here made sense to me. Includes test cases.
          Hide
          Paul Cowan added a comment -

          Patch for part 2). This follows the my original idea suggested above; Chris Hostetter suggested another approach:

          "couldn't this be solved by an Analyzer that counts the token per fieldname and implements getPositionIncrementGap as..
          int result - SOME_BIG_NUM - tokensSeenMap.get(fieldname);
          tokensSeenMap.put(fieldname, 0);
          return result;"

          but I think this seems much lower overhead and, while it affects the Analyzer base class (and so is potentially high-impact) the way it's implemented won't affect existing implementations and gently deprecates the old way, while still letting implementing subclasses work as they did before, so I think this is low impact. Interested to see what people think.

          Show
          Paul Cowan added a comment - Patch for part 2). This follows the my original idea suggested above; Chris Hostetter suggested another approach: "couldn't this be solved by an Analyzer that counts the token per fieldname and implements getPositionIncrementGap as.. int result - SOME_BIG_NUM - tokensSeenMap.get(fieldname); tokensSeenMap.put(fieldname, 0); return result;" but I think this seems much lower overhead and, while it affects the Analyzer base class (and so is potentially high-impact) the way it's implemented won't affect existing implementations and gently deprecates the old way, while still letting implementing subclasses work as they did before, so I think this is low impact. Interested to see what people think.
          Hide
          Paul Cowan added a comment -

          2 things about those patches:

          1) I did them against 2.4, not trunk. This is because I didn't because Luke isn't working against trunk and I wanted to double-check the index state during tests
          2) I don't know WHY I put the changes in 'CHANGES.txt' where I did (under the 2.4.0 release, which is already out). I'm not drunk, I swear. If this lands, those will obviously have to be put in the right spot.

          Show
          Paul Cowan added a comment - 2 things about those patches: 1) I did them against 2.4, not trunk. This is because I didn't because Luke isn't working against trunk and I wanted to double-check the index state during tests 2) I don't know WHY I put the changes in 'CHANGES.txt' where I did (under the 2.4.0 release, which is already out). I'm not drunk, I swear. If this lands, those will obviously have to be put in the right spot.
          Hide
          Andrzej Bialecki added a comment -

          Luke should work with trunk, possibly with only minor patches. Just grab the luke-0.9.jar and add jars from Lucene trunk on the classpath.

          Show
          Andrzej Bialecki added a comment - Luke should work with trunk, possibly with only minor patches. Just grab the luke-0.9.jar and add jars from Lucene trunk on the classpath.
          Hide
          Hoss Man added a comment -

          i've only looked at LUCENE-1494-multifield.patch ... one problem i see is that SpanNearQuery stores and utilizes the field name in ways that don't make sense for the new MultiFieldSpanNearQuery subclass (ie: getField, .

          I would suggest that instead you invert the inheritence: move the guts of SpanNearQuery into MultiFieldSpanNearQuery and make it a superclass of SpanNearQuery. This also eliminates the need for the mustBeSameField param...

          public class SpanNearQuery extends MultiFieldSpanNearQuery {
            private final String field;
            public String getField() { return field; }
            public SpanNearQuery(SpanQuery[] clauses, int slop, boolean inOrder) {
              super(clauses, slop, inOrder, true);
              for (int i = 0; i < clauses.length; i++) {
                SpanQuery clause = clauses[i];
                if (i == 0) { field = clause[i].getField(); }
                } else if (!clause[i[.getField().equals(field)) {
                  throw new IllegalArgumentException("Clauses must have same field.");
                }
            }
            // :TODO: need to override equals from super ... maybe hashCode too
          }
          
          Show
          Hoss Man added a comment - i've only looked at LUCENE-1494 -multifield.patch ... one problem i see is that SpanNearQuery stores and utilizes the field name in ways that don't make sense for the new MultiFieldSpanNearQuery subclass (ie: getField, . I would suggest that instead you invert the inheritence: move the guts of SpanNearQuery into MultiFieldSpanNearQuery and make it a superclass of SpanNearQuery. This also eliminates the need for the mustBeSameField param... public class SpanNearQuery extends MultiFieldSpanNearQuery { private final String field; public String getField() { return field; } public SpanNearQuery(SpanQuery[] clauses, int slop, boolean inOrder) { super (clauses, slop, inOrder, true ); for ( int i = 0; i < clauses.length; i++) { SpanQuery clause = clauses[i]; if (i == 0) { field = clause[i].getField(); } } else if (!clause[i[.getField().equals(field)) { throw new IllegalArgumentException( "Clauses must have same field." ); } } // :TODO: need to override equals from super ... maybe hashCode too }
          Hide
          Paul Cowan added a comment -

          Hi Hoss,

          I don't disagree that an inverted inheritance hierarchy would make more sense, but the problem with that is that getField (which I think is the only thing on SpanNearQuery that doesn't really make sense for a MultiField one) is mandated by the abstract method declaration of same in SpanQuery, which the inverted parent class would still extend. Looking at where getField() is used (primarily in SpanWeight.explain() and SpanWeight and BoostingTermWeight's .scorer() methods) I'm not sure how I can meaningfully deal with those in the case of a multifield span query.

          If you (or anyone else) have any suggestions for that then I'm all ears, this would be really useful for us (and a lot of other people I think, it's not an uncommon query on the lists etc).

          Personally I'd be equally happy with just eliminating the same-field requirement (as you mentioned, I think, that Doug suggested) but those explain()s and scorer() methods would still need to be changed. Any ideas?

          Paul

          Show
          Paul Cowan added a comment - Hi Hoss, I don't disagree that an inverted inheritance hierarchy would make more sense, but the problem with that is that getField (which I think is the only thing on SpanNearQuery that doesn't really make sense for a MultiField one) is mandated by the abstract method declaration of same in SpanQuery, which the inverted parent class would still extend. Looking at where getField() is used (primarily in SpanWeight.explain() and SpanWeight and BoostingTermWeight's .scorer() methods) I'm not sure how I can meaningfully deal with those in the case of a multifield span query. If you (or anyone else) have any suggestions for that then I'm all ears, this would be really useful for us (and a lot of other people I think, it's not an uncommon query on the lists etc). Personally I'd be equally happy with just eliminating the same-field requirement (as you mentioned, I think, that Doug suggested) but those explain()s and scorer() methods would still need to be changed. Any ideas? Paul
          Hide
          Hoss Man added a comment -

          I don't disagree that an inverted inheritance hierarchy would make more sense, but the problem with that is that getField (which I think is the only thing on SpanNearQuery that doesn't really make sense for a MultiField one) is mandated by the abstract method declaration of same in SpanQuery.

          Ah, right right ... of course. I was thinking getField was just a SpanNearQuery concept, but it's actually central to the whole concept of SpanQuery.

          This actually raises some interesting questions about the behavior of all of this...

          Beyond just the explain methods, SpanQuery.getField plays two important roles:

          1. it determines what norms get used by SpanScorer
          2. it dictates what other SpanQueries this query can be nested in – so far we've really only discussed directly executing a MultiFieldSpanNearQuery, but we also have to consider what it means to combine a MultiFieldSpanNearQuery in another SpanQuery

          At the moment, your patch treats the first element of the SpanQuery[] used to construct the MultiFieldSpanNearQuery as "special" – it specifies the field which determines the norms used and what oher SpanQueries it can be combined with. At a minimum that special case behavior needs to be documented, but we may also want to consider tweaking the API to make it more explicit (ie: perhaps when constructing a MultiFieldSpanNearQuery you should be required to explicitly state the field name you want to use). It may also be worth considering whether or not MultiFieldSpanNearQuery should use a custom Scorer that takes into account the norms of all the fields (averaging them maybe?)

          (FWIW: this highlights one of the reasons why a multi-field PhraseQuery would be much simpler to implement then a multi-field SpanNearQuery ... the super class of PhraseQuery (Query) has no inherent concept of a field, so it would be easy to inject a new superclass in the middle there)

          The more i think about this, the more i wonder if a simpler solution would be a SpanQuery that wrapped another SpanQuery and proxied all of hte method except for the getField() method, ie...

          public class MaskFieldSpanQuery extends SpanQuery {
            SpanQuery inner;
            String field;
            public MaskFieldSpanQuery(String field, SpanQuery inner) { ... }
            public String getField() { return field; }
            public Spans getSpans(IndexReader r) { return inner.getSpans(reader); }
            public PayloadSpans getPayloadSpans(IndexReader reader) { ...
            ...
          }
          

          I haven't tested this out, but it seems that wrapping a bunch of SpanQueries in something like this and then building up a SpanNearQuery should be functionally equivalent to the existing MultiFieldSpanNearQuery in the patch, but would also allow for other interesting things (like a SpanNotQuery where you want to find all docs that match on rating_year:2004 but not if rating_score:POOR matches in the same position.

          what do people think?

          Show
          Hoss Man added a comment - I don't disagree that an inverted inheritance hierarchy would make more sense, but the problem with that is that getField (which I think is the only thing on SpanNearQuery that doesn't really make sense for a MultiField one) is mandated by the abstract method declaration of same in SpanQuery. Ah, right right ... of course. I was thinking getField was just a SpanNearQuery concept, but it's actually central to the whole concept of SpanQuery. This actually raises some interesting questions about the behavior of all of this... Beyond just the explain methods, SpanQuery.getField plays two important roles: it determines what norms get used by SpanScorer it dictates what other SpanQueries this query can be nested in – so far we've really only discussed directly executing a MultiFieldSpanNearQuery, but we also have to consider what it means to combine a MultiFieldSpanNearQuery in another SpanQuery At the moment, your patch treats the first element of the SpanQuery[] used to construct the MultiFieldSpanNearQuery as "special" – it specifies the field which determines the norms used and what oher SpanQueries it can be combined with. At a minimum that special case behavior needs to be documented, but we may also want to consider tweaking the API to make it more explicit (ie: perhaps when constructing a MultiFieldSpanNearQuery you should be required to explicitly state the field name you want to use). It may also be worth considering whether or not MultiFieldSpanNearQuery should use a custom Scorer that takes into account the norms of all the fields (averaging them maybe?) (FWIW: this highlights one of the reasons why a multi-field PhraseQuery would be much simpler to implement then a multi-field SpanNearQuery ... the super class of PhraseQuery (Query) has no inherent concept of a field, so it would be easy to inject a new superclass in the middle there) The more i think about this, the more i wonder if a simpler solution would be a SpanQuery that wrapped another SpanQuery and proxied all of hte method except for the getField() method, ie... public class MaskFieldSpanQuery extends SpanQuery { SpanQuery inner ; String field; public MaskFieldSpanQuery( String field, SpanQuery inner ) { ... } public String getField() { return field; } public Spans getSpans(IndexReader r) { return inner .getSpans(reader); } public PayloadSpans getPayloadSpans(IndexReader reader) { ... ... } I haven't tested this out, but it seems that wrapping a bunch of SpanQueries in something like this and then building up a SpanNearQuery should be functionally equivalent to the existing MultiFieldSpanNearQuery in the patch, but would also allow for other interesting things (like a SpanNotQuery where you want to find all docs that match on rating_year:2004 but not if rating_score:POOR matches in the same position. what do people think?
          Hide
          Paul Cowan added a comment - - edited

          Hoss, I think you're right. I think the MaskFieldSpanQuery is a much nicer and cleaner implementation – and a lot easier to explain in Javadoc. 'Note that if you use this, the norms of the 'impersonated' field will be used for scoring'. That makes it completely clear what's happening and that if you use norms (we don't use them at all, which is why it doesn't bother us and we didn't pick up on that) and means that we don't have to have MultiFieldSpanOrQuery etc... it will all just work, and I think it's conceptually a lot clearer.

          Your other idea is a good one... all the SpanQueries could take in a strategy for dealing with multiple fields

          enum MultiFieldStrategy {
             SAME_FIELD_ONLY, // Current behaviour
             FIRST_FIELD_ADDED,
             AVERAGE_NORMS,
             DO_SOMETHING_ELSE_WITH_NORMS // ... or whatever
          }
          

          but I think that's overkill for now – at least for us – and the 'masking' span query is much much nicer. I'll look at coming up with a patch that incorporates this instead (but still includes the getPositionIncrementGap() change, as that's still needed per the above)

          Show
          Paul Cowan added a comment - - edited Hoss, I think you're right. I think the MaskFieldSpanQuery is a much nicer and cleaner implementation – and a lot easier to explain in Javadoc. 'Note that if you use this, the norms of the 'impersonated' field will be used for scoring'. That makes it completely clear what's happening and that if you use norms (we don't use them at all, which is why it doesn't bother us and we didn't pick up on that) and means that we don't have to have MultiFieldSpanOrQuery etc... it will all just work, and I think it's conceptually a lot clearer. Your other idea is a good one... all the SpanQueries could take in a strategy for dealing with multiple fields enum MultiFieldStrategy { SAME_FIELD_ONLY, // Current behaviour FIRST_FIELD_ADDED, AVERAGE_NORMS, DO_SOMETHING_ELSE_WITH_NORMS // ... or whatever } but I think that's overkill for now – at least for us – and the 'masking' span query is much much nicer. I'll look at coming up with a patch that incorporates this instead (but still includes the getPositionIncrementGap() change, as that's still needed per the above)
          Hide
          Paul Cowan added a comment -

          A patch to use Hoss' "masking" idea to support this query. I think this is relatively clean and much easier to explain in the Javadoc.

          Show
          Paul Cowan added a comment - A patch to use Hoss' "masking" idea to support this query. I think this is relatively clean and much easier to explain in the Javadoc.
          Hide
          Hoss Man added a comment -

          some things looked like they wouldn't work with the masking patch, so i wrote some test cases to convince myself they were broken (and because new code should always have test cases). In particular i was worried about the lack of equals/hashCode methods, and the broken rewrite method

          one interesting thing I discovered was that the code worked in many cases even though rewrite was constantly just returning the masked inner query – digging into it i realized the reason was because none of the other SpanQuery classes pay any attention to what their nested clauses return when they recursively rewrite, so a SpanNearQuery whose constructor freaks out if the fields of all the clauses don't match, happily generates spans if one of those clauses returns a complteley different SpanQuery on rewrite.

          I also removed the proxying of getBoost and setBoost ... it was causing problems with some stock testing framework code that expects a q1.equals(q1.clone().setBoost(newBoost)) to be false (this was evaluating to true because it's a shallow clone and setBoost was proxying and modifying the original inner query's boost value) ... this means that FieldMaskingSpanQuery is consistent with how other SpanQueries deal with boost (they ignore the boosts of their nested clauses)

          new patch (with tests) attached ... i'd like to have some more tests before committing (spans is deep voodoo, we're doing funky stuff with spans, all the more reason to test thoroughly)

          Show
          Hoss Man added a comment - some things looked like they wouldn't work with the masking patch, so i wrote some test cases to convince myself they were broken (and because new code should always have test cases). In particular i was worried about the lack of equals/hashCode methods, and the broken rewrite method one interesting thing I discovered was that the code worked in many cases even though rewrite was constantly just returning the masked inner query – digging into it i realized the reason was because none of the other SpanQuery classes pay any attention to what their nested clauses return when they recursively rewrite, so a SpanNearQuery whose constructor freaks out if the fields of all the clauses don't match, happily generates spans if one of those clauses returns a complteley different SpanQuery on rewrite. I also removed the proxying of getBoost and setBoost ... it was causing problems with some stock testing framework code that expects a q1.equals(q1.clone().setBoost(newBoost)) to be false (this was evaluating to true because it's a shallow clone and setBoost was proxying and modifying the original inner query's boost value) ... this means that FieldMaskingSpanQuery is consistent with how other SpanQueries deal with boost (they ignore the boosts of their nested clauses) new patch (with tests) attached ... i'd like to have some more tests before committing (spans is deep voodoo, we're doing funky stuff with spans, all the more reason to test thoroughly)
          Hide
          Hoss Man added a comment -

          trimming description due to forking of second idea to LUCENE-1626

          Show
          Hoss Man added a comment - trimming description due to forking of second idea to LUCENE-1626
          Hide
          Hoss Man added a comment -

          Committed revision 770794.

          Thanks for your patch Paul.

          The committed version is near-identical to my last revised patch, but with more tests (100% coverage ... woot!)

          Note: I cloned this issue so the positionIncrementGap patch changes could be addressed separately in LUCENE-1626 since it hasn't had any discussion in this issue so far, and constitute a fundamentally different type of change (even if the two ideas ultimately aid in a single larger use case)

          Show
          Hoss Man added a comment - Committed revision 770794. Thanks for your patch Paul. The committed version is near-identical to my last revised patch, but with more tests (100% coverage ... woot!) Note: I cloned this issue so the positionIncrementGap patch changes could be addressed separately in LUCENE-1626 since it hasn't had any discussion in this issue so far, and constitute a fundamentally different type of change (even if the two ideas ultimately aid in a single larger use case)

            People

            • Assignee:
              Hoss Man
              Reporter:
              Paul Cowan
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development