Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2287

Unexpected terms are highlighted within nested SpanQuery instances

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Won't Fix
    • 2.9.1
    • 7.3
    • modules/highlighter
    • None
    • Linux, Solaris, Windows

    • New

    Description

      I haven't yet been able to resolve why I'm seeing spurious highlighting in nested SpanQuery instances. Briefly, the issue is illustrated by the second instance of "Lucene" being highlighted in the test below, when it doesn't satisfy the inner span. There's been some discussion about this on the java-dev list, and I'm opening this issue now because I have made some initial progress on this.

      This new test, added to the HighlighterTest class in lucene_2_9_1, illustrates this:

      /*

      String theText = "The Lucene was made by Doug Cutting and Lucene great Hadoop was"; // Problem
      //String theText = "The Lucene was made by Doug Cutting and the great Hadoop was"; // Works okay

      String fieldName = "SOME_FIELD_NAME";

      SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[]

      { new SpanTermQuery(new Term(fieldName, "lucene")), new SpanTermQuery(new Term(fieldName, "doug")) }

      , 5, true);

      Query query = new SpanNearQuery(new SpanQuery[]

      { spanNear, new SpanTermQuery(new Term(fieldName, "hadoop")) }

      , 4, true);

      String expected = "The <B>Lucene</B> was made by <B>Doug</B> Cutting and Lucene great <B>Hadoop</B> was";
      //String expected = "The <B>Lucene</B> was made by <B>Doug</B> Cutting and the great <B>Hadoop</B> was";

      String observed = highlightField(query, fieldName, theText);
      System.out.println("Expected: \"" + expected + "\n" + "Observed: \"" + observed);

      assertEquals("Why is that second instance of the term \"Lucene\" highlighted?", expected, observed);
      }

      Is this an issue that's arisen before? I've been reading through the source to QueryScorer, WeightedSpanTerm, WeightedSpanTermExtractor, Spans, and NearSpansOrdered, but haven't found the solution yet. Initially, I thought that the extractWeightedSpanTerms method in WeightedSpanTermExtractor should be called on each clause of a SpanNearQuery or SpanOrQuery, but that didn't get me too far.

      Attachments

        1. LUCENE-2287.patch
          190 kB
          Michael Goddard
        2. LUCENE-2287.patch
          189 kB
          Michael Goddard
        3. LUCENE-2287.patch
          190 kB
          Michael Goddard
        4. LUCENE-2287.patch
          191 kB
          Michael Goddard
        5. LUCENE-2287.patch
          194 kB
          Michael Goddard
        6. LUCENE-2287.patch
          175 kB
          Michael Goddard

        Issue Links

          Activity

            People

              dsmiley David Smiley
              michael.goddard Michael Goddard
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 336h
                  336h
                  Remaining:
                  Remaining Estimate - 336h
                  336h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified