Lucene - Core
  1. Lucene - Core
  2. LUCENE-2287

Unexpected terms are highlighted within nested SpanQuery instances

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 2.9.1
    • Fix Version/s: None
    • Component/s: modules/highlighter
    • Labels:
      None
    • Environment:

      Linux, Solaris, Windows

    • Lucene Fields:
      New

      Description

      I haven't yet been able to resolve why I'm seeing spurious highlighting in nested SpanQuery instances. Briefly, the issue is illustrated by the second instance of "Lucene" being highlighted in the test below, when it doesn't satisfy the inner span. There's been some discussion about this on the java-dev list, and I'm opening this issue now because I have made some initial progress on this.

      This new test, added to the HighlighterTest class in lucene_2_9_1, illustrates this:

      /*

      String theText = "The Lucene was made by Doug Cutting and Lucene great Hadoop was"; // Problem
      //String theText = "The Lucene was made by Doug Cutting and the great Hadoop was"; // Works okay

      String fieldName = "SOME_FIELD_NAME";

      SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[]

      { new SpanTermQuery(new Term(fieldName, "lucene")), new SpanTermQuery(new Term(fieldName, "doug")) }

      , 5, true);

      Query query = new SpanNearQuery(new SpanQuery[]

      { spanNear, new SpanTermQuery(new Term(fieldName, "hadoop")) }

      , 4, true);

      String expected = "The <B>Lucene</B> was made by <B>Doug</B> Cutting and Lucene great <B>Hadoop</B> was";
      //String expected = "The <B>Lucene</B> was made by <B>Doug</B> Cutting and the great <B>Hadoop</B> was";

      String observed = highlightField(query, fieldName, theText);
      System.out.println("Expected: \"" + expected + "\n" + "Observed: \"" + observed);

      assertEquals("Why is that second instance of the term \"Lucene\" highlighted?", expected, observed);
      }

      Is this an issue that's arisen before? I've been reading through the source to QueryScorer, WeightedSpanTerm, WeightedSpanTermExtractor, Spans, and NearSpansOrdered, but haven't found the solution yet. Initially, I thought that the extractWeightedSpanTerms method in WeightedSpanTermExtractor should be called on each clause of a SpanNearQuery or SpanOrQuery, but that didn't get me too far.

      1. LUCENE-2287.patch
        175 kB
        Michael Goddard
      2. LUCENE-2287.patch
        194 kB
        Michael Goddard
      3. LUCENE-2287.patch
        191 kB
        Michael Goddard
      4. LUCENE-2287.patch
        190 kB
        Michael Goddard
      5. LUCENE-2287.patch
        189 kB
        Michael Goddard
      6. LUCENE-2287.patch
        190 kB
        Michael Goddard

        Activity

        Hide
        Salman Akram added a comment -

        Hi,

        It seems the last patch was committed with still couple of failures. Any update on this? Do you think this is still better than the default highlighter?

        Thanks!

        Show
        Salman Akram added a comment - Hi, It seems the last patch was committed with still couple of failures. Any update on this? Do you think this is still better than the default highlighter? Thanks!
        Hide
        Michael Goddard added a comment -

        Mark,

        I apologize for that. Eclipse nicely reformatted that code for me, and I will see if I can fix that over the coming weekend.

        Show
        Michael Goddard added a comment - Mark, I apologize for that. Eclipse nicely reformatted that code for me, and I will see if I can fix that over the coming weekend.
        Hide
        Mark Miller added a comment -

        Hey Michael - there is a lot of reformatting it looks like in this patch - if its not that much of a hassle, is it possible to get a patch without the formats?

        Show
        Mark Miller added a comment - Hey Michael - there is a lot of reformatting it looks like in this patch - if its not that much of a hassle, is it possible to get a patch without the formats?
        Hide
        Michael Goddard added a comment -

        It seems that the discussion here on overlapping spans bears on the current issue:

        http://issues.apache.org/jira/browse/LUCENE-569?focusedCommentId=12411904&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12411904

        I'll investigate that, experiment with allowing spans to overlap, and see how that goes.

        Show
        Michael Goddard added a comment - It seems that the discussion here on overlapping spans bears on the current issue: http://issues.apache.org/jira/browse/LUCENE-569?focusedCommentId=12411904&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12411904 I'll investigate that, experiment with allowing spans to overlap, and see how that goes.
        Hide
        Michael Goddard added a comment -

        Two tests still fail, but more of my own tests are passing. The issue with testSpanHighlighting in HighlighterTest may be due to the way overlapping spans are handled by NearSpansUnordered. I don't think this issue arose before since the span was just the outermost span.

        Show
        Michael Goddard added a comment - Two tests still fail, but more of my own tests are passing. The issue with testSpanHighlighting in HighlighterTest may be due to the way overlapping spans are handled by NearSpansUnordered. I don't think this issue arose before since the span was just the outermost span.
        Hide
        Michael Goddard added a comment -

        The backward compatibility break was adding

        public abstract Spans[] getSubSpans();

        to the Spans class. I had to do this to enable the recursion on Spans and figured it was the way to go since NearSpansUnordered and NearSpansOrdered had this method.

        Show
        Michael Goddard added a comment - The backward compatibility break was adding public abstract Spans[] getSubSpans(); to the Spans class. I had to do this to enable the recursion on Spans and figured it was the way to go since NearSpansUnordered and NearSpansOrdered had this method.
        Hide
        Mark Miller added a comment -

        Breaks backward compatibility, so need to find a way around that

        Wouldn't be the end of the world depending on the break.

        Show
        Mark Miller added a comment - Breaks backward compatibility, so need to find a way around that Wouldn't be the end of the world depending on the break.
        Hide
        Michael Goddard added a comment -
        • Removed the Java 5 code which had crept in
        • Two tests still fail; need to solve
        • Breaks backward compatibility, so need to find a way around that
        Show
        Michael Goddard added a comment - Removed the Java 5 code which had crept in Two tests still fail; need to solve Breaks backward compatibility, so need to find a way around that
        Hide
        Michael Goddard added a comment -

        0 errors, 2 failures.

        Show
        Michael Goddard added a comment - 0 errors, 2 failures.
        Hide
        Michael Goddard added a comment -

        This patch is closer: 2 errors and 1 failure on the JUnit test. Still needs a ton of cleanup.

        Show
        Michael Goddard added a comment - This patch is closer: 2 errors and 1 failure on the JUnit test. Still needs a ton of cleanup.
        Hide
        Michael Goddard added a comment -

        JUnit error solved, but there are still four tests failing. Caveats still apply to this version of the patch.

        Show
        Michael Goddard added a comment - JUnit error solved, but there are still four tests failing. Caveats still apply to this version of the patch.
        Hide
        Michael Goddard added a comment -

        This is a very rough initial patch, has way too much cruft, but I wanted to park it somewhere nonetheless. ** not meant to be applied yet **

        Show
        Michael Goddard added a comment - This is a very rough initial patch, has way too much cruft, but I wanted to park it somewhere nonetheless. ** not meant to be applied yet **

          People

          • Assignee:
            Unassigned
            Reporter:
            Michael Goddard
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Time Tracking

              Estimated:
              Original Estimate - 336h
              336h
              Remaining:
              Remaining Estimate - 336h
              336h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development