Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7575

UnifiedHighlighter: add requireFieldMatch=false support

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.4
    • Component/s: modules/highlighter
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The UnifiedHighlighter (like the PostingsHighlighter) only supports highlighting queries for the same fields that are being highlighted. The original Highlighter and FVH support loosening this, AKA requireFieldMatch=false.

      1. LUCENE-7575.patch
        26 kB
        Jim Ferenczi
      2. LUCENE-7575.patch
        40 kB
        Jim Ferenczi
      3. LUCENE-7575.patch
        43 kB
        Jim Ferenczi

        Activity

        Hide
        dsmiley David Smiley added a comment -

        Hoss notified me of my screw-up on the dev list where I replied to him and told him I fixed it (on the 6th):
        http://mail-archives.apache.org/mod_mbox/lucene-dev/201612.mbox/%3cCABEwPvF-Qjiuy7a4FT-EUmG4J2ptvufOLDjUKkdvoUxQaS+7NQ@mail.gmail.com%3e
        Here's the commit to CHANGES.txt https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=blobdiff;f=lucene/CHANGES.txt;h=21a80e00619d6723fc6d163b5068eae6f25b097f;hp=853f17125b938559c3fab94a8d1aca91a620e29e;hb=8c943b6084ffec3e22b16f4997df1f9fc55f20f5;hpb=135604f6327032d0258227aaa524369203d40822
        Perhaps I should have referenced the issue but the change/correction itself was CHANGES.txt housekeeping that wasn't actually about any particular issue.

        Show
        dsmiley David Smiley added a comment - Hoss notified me of my screw-up on the dev list where I replied to him and told him I fixed it (on the 6th): http://mail-archives.apache.org/mod_mbox/lucene-dev/201612.mbox/%3cCABEwPvF-Qjiuy7a4FT-EUmG4J2ptvufOLDjUKkdvoUxQaS+7NQ@mail.gmail.com%3e Here's the commit to CHANGES.txt https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=blobdiff;f=lucene/CHANGES.txt;h=21a80e00619d6723fc6d163b5068eae6f25b097f;hp=853f17125b938559c3fab94a8d1aca91a620e29e;hb=8c943b6084ffec3e22b16f4997df1f9fc55f20f5;hpb=135604f6327032d0258227aaa524369203d40822 Perhaps I should have referenced the issue but the change/correction itself was CHANGES.txt housekeeping that wasn't actually about any particular issue.
        Hide
        mikemccand Michael McCandless added a comment -

        David Smiley Please see Hoss Man's comment above?

        Show
        mikemccand Michael McCandless added a comment - David Smiley Please see Hoss Man 's comment above?
        Hide
        hossman Hoss Man added a comment -

        6x backport (commit 4e7a7dbf) wasn't clean – added a 7.0.0 section to CHANGES.txt ... not sure if anything else came along for the ride that wasn't suppose to.

        Show
        hossman Hoss Man added a comment - 6x backport (commit 4e7a7dbf) wasn't clean – added a 7.0.0 section to CHANGES.txt ... not sure if anything else came along for the ride that wasn't suppose to.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 2e948fea300f883b7dfb586e303d5720d09b3210 in lucene-solr's branch refs/heads/apiv2 from David Smiley
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2e948fe ]

        LUCENE-7575: Add UnifiedHighlighter field matcher predicate (AKA requireFieldMatch=false)

        Show
        jira-bot ASF subversion and git services added a comment - Commit 2e948fea300f883b7dfb586e303d5720d09b3210 in lucene-solr's branch refs/heads/apiv2 from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2e948fe ] LUCENE-7575 : Add UnifiedHighlighter field matcher predicate (AKA requireFieldMatch=false)
        Hide
        jim.ferenczi Jim Ferenczi added a comment -

        I was thinking a bit more about the wastefulness of re-creating SpanQueries with different field that are otherwise identical. Some day we could refactor out from WSTE a Query -> SpanQuery conversion utility that furthermore allows you to re-target the field. With that in place, we could avoid the waste for PhraseQuery and MultiPhraseQuery – the most typical position-sensitive queries.

        I agree, I'll work on this shortly. Thanks for the hint

        Show
        jim.ferenczi Jim Ferenczi added a comment - I was thinking a bit more about the wastefulness of re-creating SpanQueries with different field that are otherwise identical. Some day we could refactor out from WSTE a Query -> SpanQuery conversion utility that furthermore allows you to re-target the field. With that in place, we could avoid the waste for PhraseQuery and MultiPhraseQuery – the most typical position-sensitive queries. I agree, I'll work on this shortly. Thanks for the hint
        Hide
        dsmiley David Smiley added a comment -

        Thank you.

        I was thinking a bit more about the wastefulness of re-creating SpanQueries with different field that are otherwise identical. Some day we could refactor out from WSTE a Query -> SpanQuery conversion utility that furthermore allows you to re-target the field. With that in place, we could avoid the waste for PhraseQuery and MultiPhraseQuery – the most typical position-sensitive queries.

        Show
        dsmiley David Smiley added a comment - Thank you . I was thinking a bit more about the wastefulness of re-creating SpanQueries with different field that are otherwise identical. Some day we could refactor out from WSTE a Query -> SpanQuery conversion utility that furthermore allows you to re-target the field. With that in place, we could avoid the waste for PhraseQuery and MultiPhraseQuery – the most typical position-sensitive queries.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 4e7a7dbf9a56468f41e89f5289833081b27f1b14 in lucene-solr's branch refs/heads/branch_6x from David Smiley
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4e7a7db ]

        LUCENE-7575: Add UnifiedHighlighter field matcher predicate (AKA requireFieldMatch=false)

        (cherry picked from commit 2e948fe)

        Show
        jira-bot ASF subversion and git services added a comment - Commit 4e7a7dbf9a56468f41e89f5289833081b27f1b14 in lucene-solr's branch refs/heads/branch_6x from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4e7a7db ] LUCENE-7575 : Add UnifiedHighlighter field matcher predicate (AKA requireFieldMatch=false) (cherry picked from commit 2e948fe)
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 2e948fea300f883b7dfb586e303d5720d09b3210 in lucene-solr's branch refs/heads/master from David Smiley
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2e948fe ]

        LUCENE-7575: Add UnifiedHighlighter field matcher predicate (AKA requireFieldMatch=false)

        Show
        jira-bot ASF subversion and git services added a comment - Commit 2e948fea300f883b7dfb586e303d5720d09b3210 in lucene-solr's branch refs/heads/master from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2e948fe ] LUCENE-7575 : Add UnifiedHighlighter field matcher predicate (AKA requireFieldMatch=false)
        Hide
        jim.ferenczi Jim Ferenczi added a comment -

        Thanks David !

        Show
        jim.ferenczi Jim Ferenczi added a comment - Thanks David !
        Hide
        dsmiley David Smiley added a comment -

        Oh I see; right.

        I'll commit your patch later this evening.

        Show
        dsmiley David Smiley added a comment - Oh I see; right. I'll commit your patch later this evening.
        Hide
        jim.ferenczi Jim Ferenczi added a comment -

        Thanks David !
        Here is a new patch to address your last comments. Now we have a FieldFilteringTermSet and extractTerms uses a simple HashSet.

        couldn't defaultFieldMatcher be initialized to non-null to match the same field? Then getFieldMatcher() would simply return it.

        Not as a Predicate<String> since the predicate is only on the candidate field name. We could use a BiPredicate<String, String> to always provide the current field name to the predicate but I find it simpler this way.

        Show
        jim.ferenczi Jim Ferenczi added a comment - Thanks David ! Here is a new patch to address your last comments. Now we have a FieldFilteringTermSet and extractTerms uses a simple HashSet. couldn't defaultFieldMatcher be initialized to non-null to match the same field? Then getFieldMatcher() would simply return it. Not as a Predicate<String> since the predicate is only on the candidate field name. We could use a BiPredicate<String, String> to always provide the current field name to the predicate but I find it simpler this way.
        Hide
        dsmiley David Smiley added a comment -

        Tim, by the way, the style of checking booleans with == false is common in Lucene deliberately... some folks (like Rob and perhaps others) feel this is actually more clear than a leading exclamation point. I sorta agree but don't have a strong opinion. So I tend to follow this now too.

        Show
        dsmiley David Smiley added a comment - Tim, by the way, the style of checking booleans with == false is common in Lucene deliberately... some folks (like Rob and perhaps others) feel this is actually more clear than a leading exclamation point. I sorta agree but don't have a strong opinion. So I tend to follow this now too.
        Hide
        dsmiley David Smiley added a comment -

        This is looking really good now Jim! I like this Predicate<String> approach.

        UH:

        • Maybe change UH.extractTerms to simply be a Set (HashSet) since we needn't pay any sorting expense up front any longer.
        • couldn't defaultFieldMatcher be initialized to non-null to match the same field? Then getFieldMatcher() would simply return it.

        PhraseHelper:

        • the comment on the fieldName field about being non-null isn't true anymore; in fact it's required. Perhaps add Objects.requireNonNull(...) in c'tor if you want.
        • I can see why you changed FieldFilteringTermHashSet to extend TreeSet. But you now need to modify the javadocs & class name accordingly; perhaps removing the implementation detail like this

        Nice tests. That's it.

        Show
        dsmiley David Smiley added a comment - This is looking really good now Jim! I like this Predicate<String> approach. UH: Maybe change UH.extractTerms to simply be a Set (HashSet) since we needn't pay any sorting expense up front any longer. couldn't defaultFieldMatcher be initialized to non-null to match the same field? Then getFieldMatcher() would simply return it. PhraseHelper: the comment on the fieldName field about being non-null isn't true anymore; in fact it's required. Perhaps add Objects.requireNonNull(...) in c'tor if you want. I can see why you changed FieldFilteringTermHashSet to extend TreeSet. But you now need to modify the javadocs & class name accordingly; perhaps removing the implementation detail like this Nice tests. That's it.
        Hide
        jim.ferenczi Jim Ferenczi added a comment -

        Thanks David Smiley and Timothy M. Rodriguez !

        I pushed a new patch to address your comments.

        it'd be interesting if instead of a simple boolean toggle, if it were a Predicate<String> fieldMatchPredicate so that only some fields could be collected in the query but not all. Just an idea.

        I agree and this is why I changed the patch to include your idea. By default nothing changes, queries are extracted based on the field name to highlight. Though with this change the user can now define which query (based on the field name) should be highlighted. I think it's better like this but I can revert if you think this should not implemented in the first iteration.

        I fixed the bugs that David spotted (terms from different fields not sorted after filteredExtractTerms and redundant initialization of the filter leaf reader for the span queries) and split the tests based on the type of query that is tested.

        Show
        jim.ferenczi Jim Ferenczi added a comment - Thanks David Smiley and Timothy M. Rodriguez ! I pushed a new patch to address your comments. it'd be interesting if instead of a simple boolean toggle, if it were a Predicate<String> fieldMatchPredicate so that only some fields could be collected in the query but not all. Just an idea. I agree and this is why I changed the patch to include your idea. By default nothing changes, queries are extracted based on the field name to highlight. Though with this change the user can now define which query (based on the field name) should be highlighted. I think it's better like this but I can revert if you think this should not implemented in the first iteration. I fixed the bugs that David spotted (terms from different fields not sorted after filteredExtractTerms and redundant initialization of the filter leaf reader for the span queries) and split the tests based on the type of query that is tested.
        Hide
        Timothy055 Timothy M. Rodriguez added a comment -

        Looks good to me too. Some additional suggestions:

        UnifiedHighlighter:

        • +1 on the suggestion to use HighlightFlags instead.

        PhraseHelper:

        • It's clearer in my opinion to change the boolean branch to something like
           if (!requireFieldMatch) {} else {} 

          instead of checking

           requireFieldMatch == false 

          . Even better would be swapping the branches so it's

          if (requireFieldBranch) {} else {}
        • Similar point for line 287
           if (requireFieldMatch && fieldName.equals(queryTerm.field()) == false) {} 

        TestUnifiedHiglighter:

        • I think it'd be clearer to separate the the cases for term/phrase/multi-term queries into separate tests. This makes it easier to chase bugs down the line if only 1 fails. (And provides more information if all 3 fail)
        Show
        Timothy055 Timothy M. Rodriguez added a comment - Looks good to me too. Some additional suggestions: UnifiedHighlighter: +1 on the suggestion to use HighlightFlags instead. PhraseHelper: It's clearer in my opinion to change the boolean branch to something like if (!requireFieldMatch) {} else {} instead of checking requireFieldMatch == false . Even better would be swapping the branches so it's if (requireFieldBranch) {} else {} Similar point for line 287 if (requireFieldMatch && fieldName.equals(queryTerm.field()) == false ) {} TestUnifiedHiglighter: I think it'd be clearer to separate the the cases for term/phrase/multi-term queries into separate tests. This makes it easier to chase bugs down the line if only 1 fails. (And provides more information if all 3 fail)
        Hide
        dsmiley David Smiley added a comment -

        Thanks for this contribution Jim! You're the first one to improve the UH outside of those who first worked on it.

        Overall patch looks pretty good.

        UnifiedHighlighter:

        • What do you think of adding this to the HighlightFlags instead? It's intended to be a single point to capture various boolean options. As an aside, I'm kinda wondering if the default* and DEFAULT* boolean fields shouldn't exist and instead simply have a highlightFlags enumSet field.
        • I think the results of filterExtractedTerms might now contain duplicated terms (BytesRefs)? (see my note later about testing the same term and varying the field). We could simply collect those bytes into a HashSet, then extract to an array and then sort.

        PhraseHelper:

        • You applied SingleFieldFilterLeafReader at the top of getTermToSpans but I think this should be done by the caller so it happens just once, not per SpanQuery.
        • FieldRewritingTermHashSet is so close to the other other one... hmm.... what if we had just one, remove "static" from class (thus has access to fieldName & requireFieldMatch), and then implement add() appropriately.

        Tests:

        • you used the same test input string for both the "field" and "field_require_field_match" fields. To make this more clear; can you vary them if even a little?
        • in no test queries do I see the same term BytesRef across more than one field. For example, maybe add a test incorporating something like field:test OR field_require_field_match:test – granted the results might not be interesting but lets hope it doesn't puke. Do for phrase as well.

        I agree this requireFieldMatch=false should not be the default. It'll add some overhead – especially for phrase and other position sensitive queries since we aren't de-duplicating them. Besides, it's more accurate as-is.

        As an aside... it'd be interesting if instead of a simple boolean toggle, if it were a Predicate<String> fieldMatchPredicate so that only some fields could be collected in the query but not all. Just an idea.

        Show
        dsmiley David Smiley added a comment - Thanks for this contribution Jim! You're the first one to improve the UH outside of those who first worked on it. Overall patch looks pretty good. UnifiedHighlighter: What do you think of adding this to the HighlightFlags instead? It's intended to be a single point to capture various boolean options. As an aside, I'm kinda wondering if the default* and DEFAULT* boolean fields shouldn't exist and instead simply have a highlightFlags enumSet field. I think the results of filterExtractedTerms might now contain duplicated terms (BytesRefs)? (see my note later about testing the same term and varying the field). We could simply collect those bytes into a HashSet, then extract to an array and then sort. PhraseHelper: You applied SingleFieldFilterLeafReader at the top of getTermToSpans but I think this should be done by the caller so it happens just once, not per SpanQuery. FieldRewritingTermHashSet is so close to the other other one... hmm.... what if we had just one, remove "static" from class (thus has access to fieldName & requireFieldMatch), and then implement add() appropriately. Tests: you used the same test input string for both the "field" and "field_require_field_match" fields. To make this more clear; can you vary them if even a little? in no test queries do I see the same term BytesRef across more than one field. For example, maybe add a test incorporating something like field:test OR field_require_field_match:test – granted the results might not be interesting but lets hope it doesn't puke. Do for phrase as well. I agree this requireFieldMatch=false should not be the default. It'll add some overhead – especially for phrase and other position sensitive queries since we aren't de-duplicating them. Besides, it's more accurate as-is. As an aside... it'd be interesting if instead of a simple boolean toggle, if it were a Predicate<String> fieldMatchPredicate so that only some fields could be collected in the query but not all. Just an idea.
        Hide
        jim.ferenczi Jim Ferenczi added a comment -

        Patch for requireFieldMatch

        Show
        jim.ferenczi Jim Ferenczi added a comment - Patch for requireFieldMatch
        Hide
        jim.ferenczi Jim Ferenczi added a comment -

        Hi David Smiley,
        I've attached a patch based on the comment above. I did not find a clean way to detect duplicates in the span queries extracted by the PhraseHelper when requireFieldMatch=false. I agree that it's not essential so I pushed the patch as is. Could you please take a look ?

        Show
        jim.ferenczi Jim Ferenczi added a comment - Hi David Smiley , I've attached a patch based on the comment above. I did not find a clean way to detect duplicates in the span queries extracted by the PhraseHelper when requireFieldMatch=false. I agree that it's not essential so I pushed the patch as is. Could you please take a look ?
        Hide
        dsmiley David Smiley added a comment -

        This should be straight-forward to add to the term & automata extraction routines.

        PhraseHelper is a bit more work when it comes time to actually executing the SpanQuery... it probably needs to wrap the LeafReader in order to always return the field being highlighted no matter which field is being asked for. If the query contains the same span query except differentiated by field (e.g. same SpanNear tree but only different field), it would be nice if it could be identified so that we don't do the work additional times. But that's not essential.

        Show
        dsmiley David Smiley added a comment - This should be straight-forward to add to the term & automata extraction routines. PhraseHelper is a bit more work when it comes time to actually executing the SpanQuery... it probably needs to wrap the LeafReader in order to always return the field being highlighted no matter which field is being asked for. If the query contains the same span query except differentiated by field (e.g. same SpanNear tree but only different field), it would be nice if it could be identified so that we don't do the work additional times. But that's not essential.

          People

          • Assignee:
            dsmiley David Smiley
            Reporter:
            dsmiley David Smiley
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development