Lucene - Core
  1. Lucene - Core
  2. LUCENE-2287

Unexpected terms are highlighted within nested SpanQuery instances

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 2.9.1
    • Fix Version/s: None
    • Component/s: modules/highlighter
    • Labels:
      None
    • Environment:

      Linux, Solaris, Windows

    • Lucene Fields:
      New

      Description

      I haven't yet been able to resolve why I'm seeing spurious highlighting in nested SpanQuery instances. Briefly, the issue is illustrated by the second instance of "Lucene" being highlighted in the test below, when it doesn't satisfy the inner span. There's been some discussion about this on the java-dev list, and I'm opening this issue now because I have made some initial progress on this.

      This new test, added to the HighlighterTest class in lucene_2_9_1, illustrates this:

      /*

      String theText = "The Lucene was made by Doug Cutting and Lucene great Hadoop was"; // Problem
      //String theText = "The Lucene was made by Doug Cutting and the great Hadoop was"; // Works okay

      String fieldName = "SOME_FIELD_NAME";

      SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[]

      { new SpanTermQuery(new Term(fieldName, "lucene")), new SpanTermQuery(new Term(fieldName, "doug")) }

      , 5, true);

      Query query = new SpanNearQuery(new SpanQuery[]

      { spanNear, new SpanTermQuery(new Term(fieldName, "hadoop")) }

      , 4, true);

      String expected = "The <B>Lucene</B> was made by <B>Doug</B> Cutting and Lucene great <B>Hadoop</B> was";
      //String expected = "The <B>Lucene</B> was made by <B>Doug</B> Cutting and the great <B>Hadoop</B> was";

      String observed = highlightField(query, fieldName, theText);
      System.out.println("Expected: \"" + expected + "\n" + "Observed: \"" + observed);

      assertEquals("Why is that second instance of the term \"Lucene\" highlighted?", expected, observed);
      }

      Is this an issue that's arisen before? I've been reading through the source to QueryScorer, WeightedSpanTerm, WeightedSpanTermExtractor, Spans, and NearSpansOrdered, but haven't found the solution yet. Initially, I thought that the extractWeightedSpanTerms method in WeightedSpanTermExtractor should be called on each clause of a SpanNearQuery or SpanOrQuery, but that didn't get me too far.

      1. LUCENE-2287.patch
        190 kB
        Michael Goddard
      2. LUCENE-2287.patch
        189 kB
        Michael Goddard
      3. LUCENE-2287.patch
        190 kB
        Michael Goddard
      4. LUCENE-2287.patch
        191 kB
        Michael Goddard
      5. LUCENE-2287.patch
        194 kB
        Michael Goddard
      6. LUCENE-2287.patch
        175 kB
        Michael Goddard

        Issue Links

          Activity

          Hide
          Michael Goddard added a comment -

          This is a very rough initial patch, has way too much cruft, but I wanted to park it somewhere nonetheless. ** not meant to be applied yet **

          Show
          Michael Goddard added a comment - This is a very rough initial patch, has way too much cruft, but I wanted to park it somewhere nonetheless. ** not meant to be applied yet **
          Hide
          Michael Goddard added a comment -

          JUnit error solved, but there are still four tests failing. Caveats still apply to this version of the patch.

          Show
          Michael Goddard added a comment - JUnit error solved, but there are still four tests failing. Caveats still apply to this version of the patch.
          Hide
          Michael Goddard added a comment -

          This patch is closer: 2 errors and 1 failure on the JUnit test. Still needs a ton of cleanup.

          Show
          Michael Goddard added a comment - This patch is closer: 2 errors and 1 failure on the JUnit test. Still needs a ton of cleanup.
          Hide
          Michael Goddard added a comment -

          0 errors, 2 failures.

          Show
          Michael Goddard added a comment - 0 errors, 2 failures.
          Hide
          Michael Goddard added a comment -
          • Removed the Java 5 code which had crept in
          • Two tests still fail; need to solve
          • Breaks backward compatibility, so need to find a way around that
          Show
          Michael Goddard added a comment - Removed the Java 5 code which had crept in Two tests still fail; need to solve Breaks backward compatibility, so need to find a way around that
          Hide
          Mark Miller added a comment -

          Breaks backward compatibility, so need to find a way around that

          Wouldn't be the end of the world depending on the break.

          Show
          Mark Miller added a comment - Breaks backward compatibility, so need to find a way around that Wouldn't be the end of the world depending on the break.
          Hide
          Michael Goddard added a comment -

          The backward compatibility break was adding

          public abstract Spans[] getSubSpans();

          to the Spans class. I had to do this to enable the recursion on Spans and figured it was the way to go since NearSpansUnordered and NearSpansOrdered had this method.

          Show
          Michael Goddard added a comment - The backward compatibility break was adding public abstract Spans[] getSubSpans(); to the Spans class. I had to do this to enable the recursion on Spans and figured it was the way to go since NearSpansUnordered and NearSpansOrdered had this method.
          Hide
          Michael Goddard added a comment -

          Two tests still fail, but more of my own tests are passing. The issue with testSpanHighlighting in HighlighterTest may be due to the way overlapping spans are handled by NearSpansUnordered. I don't think this issue arose before since the span was just the outermost span.

          Show
          Michael Goddard added a comment - Two tests still fail, but more of my own tests are passing. The issue with testSpanHighlighting in HighlighterTest may be due to the way overlapping spans are handled by NearSpansUnordered. I don't think this issue arose before since the span was just the outermost span.
          Hide
          Michael Goddard added a comment -

          It seems that the discussion here on overlapping spans bears on the current issue:

          http://issues.apache.org/jira/browse/LUCENE-569?focusedCommentId=12411904&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12411904

          I'll investigate that, experiment with allowing spans to overlap, and see how that goes.

          Show
          Michael Goddard added a comment - It seems that the discussion here on overlapping spans bears on the current issue: http://issues.apache.org/jira/browse/LUCENE-569?focusedCommentId=12411904&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12411904 I'll investigate that, experiment with allowing spans to overlap, and see how that goes.
          Hide
          Mark Miller added a comment -

          Hey Michael - there is a lot of reformatting it looks like in this patch - if its not that much of a hassle, is it possible to get a patch without the formats?

          Show
          Mark Miller added a comment - Hey Michael - there is a lot of reformatting it looks like in this patch - if its not that much of a hassle, is it possible to get a patch without the formats?
          Hide
          Michael Goddard added a comment -

          Mark,

          I apologize for that. Eclipse nicely reformatted that code for me, and I will see if I can fix that over the coming weekend.

          Show
          Michael Goddard added a comment - Mark, I apologize for that. Eclipse nicely reformatted that code for me, and I will see if I can fix that over the coming weekend.
          Hide
          Salman Akram added a comment -

          Hi,

          It seems the last patch was committed with still couple of failures. Any update on this? Do you think this is still better than the default highlighter?

          Thanks!

          Show
          Salman Akram added a comment - Hi, It seems the last patch was committed with still couple of failures. Any update on this? Do you think this is still better than the default highlighter? Thanks!
          Hide
          Scott Stults added a comment - - edited

          LUCENE-5455 has a few tests that should be added here once this patch is cleaned up.

          There are a few hurdles in cleaning this up though. The first is that this patch was based on a really old version and I can't seem to find anything in SVN or git older than 3.1. The second is that Spans are quite a bit different.

          By the way, I've tried the unit tests in both issues and they still fail in 5.3+. The issue seems to be in WeightedSpanTermExtractor.extractWeightedSpanTerms(). It first builds a list of all position spans, and then it adds all of those position spans to a map of the term irrespective of whether that term was used in that position span. Mike's patch addresses this by keeping a separate list of position spans per term.

          The thing that's not fixed by the patch is the notion of when to stop recursing into the spans. I tried several methods of inspecting and classifying the spans but I either end up with too many positions (resulting in too many term highlights) or too few.

          Alan Woodward, why is this so hard? Can't we just use the same methods the searcher uses? Maybe create a new collector and re-search the incoming doc?

          Show
          Scott Stults added a comment - - edited LUCENE-5455 has a few tests that should be added here once this patch is cleaned up. There are a few hurdles in cleaning this up though. The first is that this patch was based on a really old version and I can't seem to find anything in SVN or git older than 3.1. The second is that Spans are quite a bit different. By the way, I've tried the unit tests in both issues and they still fail in 5.3+. The issue seems to be in WeightedSpanTermExtractor.extractWeightedSpanTerms(). It first builds a list of all position spans, and then it adds all of those position spans to a map of the term irrespective of whether that term was used in that position span. Mike's patch addresses this by keeping a separate list of position spans per term. The thing that's not fixed by the patch is the notion of when to stop recursing into the spans. I tried several methods of inspecting and classifying the spans but I either end up with too many positions (resulting in too many term highlights) or too few. Alan Woodward , why is this so hard? Can't we just use the same methods the searcher uses? Maybe create a new collector and re-search the incoming doc?
          Hide
          Shawn Heisey added a comment -

          I can't seem to find anything in SVN or git older than 3.1

          Lucene and Solr were merged in the 3.1 version. I knew that Solr was in a different place in SVN before that, but I did not know that Lucene was originally in a different location. I found these locations in SVN for older versions:

          http://svn.apache.org/viewvc/lucene/java/
          http://svn.apache.org/viewvc/lucene/solr/

          Show
          Shawn Heisey added a comment - I can't seem to find anything in SVN or git older than 3.1 Lucene and Solr were merged in the 3.1 version. I knew that Solr was in a different place in SVN before that, but I did not know that Lucene was originally in a different location. I found these locations in SVN for older versions: http://svn.apache.org/viewvc/lucene/java/ http://svn.apache.org/viewvc/lucene/solr/
          Hide
          David Smiley added a comment -

          why is this so hard?

          It's mainly because of NearSpansOrdered and caching it has to do on what underlying spans it saw at what positions. By the time NearSpansOrdered says it has a match, the child spans may have moved on.

          Any way, recent changes to the Span APIs in 5x should make this solve-able. See SpanCollector in Lucene 5.3. I believe Alan Woodward may have some plans to update the highlighters to use this. If not I do likewise; stay tuned

          Show
          David Smiley added a comment - why is this so hard? It's mainly because of NearSpansOrdered and caching it has to do on what underlying spans it saw at what positions. By the time NearSpansOrdered says it has a match, the child spans may have moved on. Any way, recent changes to the Span APIs in 5x should make this solve-able. See SpanCollector in Lucene 5.3. I believe Alan Woodward may have some plans to update the highlighters to use this. If not I do likewise; stay tuned
          Hide
          Scott Stults added a comment -

          Thanks Shawn and David. FYI the tests in LUCENE-5455 are actually a superset of the tests attached here and they fail appropriately on 5x. I'll stay tuned!

          Show
          Scott Stults added a comment - Thanks Shawn and David. FYI the tests in LUCENE-5455 are actually a superset of the tests attached here and they fail appropriately on 5x. I'll stay tuned!

            People

            • Assignee:
              Unassigned
              Reporter:
              Michael Goddard
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 336h
                336h
                Remaining:
                Remaining Estimate - 336h
                336h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development