Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7231

Problem with NGramAnalyzer, PhraseQuery and Highlighter

    Details

    • Lucene Fields:
      New

      Description

      Using the Highlighter with N-GramAnalyzer and PhraseQuery and searching for a substring with length = N yields the following exception:

      java.lang.IllegalArgumentException: Less than 2 subSpans.size():1
      at org.apache.lucene.search.spans.ConjunctionSpans.<init>(ConjunctionSpans.java:40)
      at org.apache.lucene.search.spans.NearSpansOrdered.<init>(NearSpansOrdered.java:56)
      at org.apache.lucene.search.spans.SpanNearQuery$SpanNearWeight.getSpans(SpanNearQuery.java:232)
      at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extractWeightedSpanTerms(WeightedSpanTermExtractor.java:292)
      at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:137)
      at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:506)
      at org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:219)
      at org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:187)
      at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:196)
      

      Below is a JUnit-Test reproducing this behavior. In case of searching for a string with more than N characters or using NGramPhraseQuery this problem doesn't occur.
      Why is it that more than 1 subSpans are required?

      public class HighlighterTest {
      
         @Rule
         public final ExpectedException exception = ExpectedException.none();
      
         @Test
         public void testHighlighterWithPhraseQueryThrowsException() throws IOException, InvalidTokenOffsetsException {
      
             final Analyzer analyzer = new NGramAnalyzer(4);
             final String fieldName = "substring";
      
             final List<BytesRef> list = new ArrayList<>();
             list.add(new BytesRef("uchu"));
             final PhraseQuery query = new PhraseQuery(fieldName, list.toArray(new BytesRef[list.size()]));
      
             final QueryScorer fragmentScorer = new QueryScorer(query, fieldName);
             final SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<b>", "</b>");
      
             exception.expect(IllegalArgumentException.class);
             exception.expectMessage("Less than 2 subSpans.size():1");
      
             final Highlighter highlighter = new Highlighter(formatter,TextEncoder.NONE.getEncoder(), fragmentScorer);
             highlighter.setTextFragmenter(new SimpleFragmenter(100));
             final String fragment = highlighter.getBestFragment(analyzer, fieldName, "Buchung");
      
             assertEquals("B<b>uchu</b>ng",fragment);
      
         }
      
      public final class NGramAnalyzer extends Analyzer {
      
         private final int minNGram;
      
         public NGramAnalyzer(final int minNGram) {
             super();
             this.minNGram = minNGram;
         }
      
         @Override
         protected TokenStreamComponents createComponents(final String fieldName) {
             final Tokenizer source = new NGramTokenizer(minNGram, minNGram);
             return new TokenStreamComponents(source);
         }
      
      }
      
      }
      
      
      1. LUCENE-7231.patch
        6 kB
        Alan Woodward

        Activity

        Hide
        romseygeek Alan Woodward added a comment -

        Here's a patch including your test case and a fix - WeightedSpanTermExtractor needed to check if a PhraseQuery only had one term, and if so rewrite to a SpanTermQuery rather than a SpanNearQuery.

        Show
        romseygeek Alan Woodward added a comment - Here's a patch including your test case and a fix - WeightedSpanTermExtractor needed to check if a PhraseQuery only had one term, and if so rewrite to a SpanTermQuery rather than a SpanNearQuery.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 7793c06a30eb25ee08ee11a57ca696d3da4744b5 in lucene-solr's branch refs/heads/master from Alan Woodward
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7793c06 ]

        LUCENE-7231: WeightedSpanTermExtractor correctly deals with single-term PhraseQuery

        Show
        jira-bot ASF subversion and git services added a comment - Commit 7793c06a30eb25ee08ee11a57ca696d3da4744b5 in lucene-solr's branch refs/heads/master from Alan Woodward [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7793c06 ] LUCENE-7231 : WeightedSpanTermExtractor correctly deals with single-term PhraseQuery
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 35024d3edc52cb20c7f36860da750cda694a3311 in lucene-solr's branch refs/heads/branch_6x from Alan Woodward
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=35024d3 ]

        LUCENE-7231: WeightedSpanTermExtractor correctly deals with single-term PhraseQuery

        Show
        jira-bot ASF subversion and git services added a comment - Commit 35024d3edc52cb20c7f36860da750cda694a3311 in lucene-solr's branch refs/heads/branch_6x from Alan Woodward [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=35024d3 ] LUCENE-7231 : WeightedSpanTermExtractor correctly deals with single-term PhraseQuery
        Hide
        romseygeek Alan Woodward added a comment -

        Thanks Eva!

        Show
        romseygeek Alan Woodward added a comment - Thanks Eva!
        Hide
        Eva Popenda Eva Popenda added a comment -

        Thank you, Alan, for fixing this!

        Show
        Eva Popenda Eva Popenda added a comment - Thank you, Alan, for fixing this!
        Hide
        steve_rowe Steve Rowe added a comment -

        Reopening to backport to 6.0.1.

        Show
        steve_rowe Steve Rowe added a comment - Reopening to backport to 6.0.1.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 2457d96e37ba711b9d1ac0e74ed405a9917a0f7d in lucene-solr's branch refs/heads/branch_6_0 from Alan Woodward
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2457d96 ]

        LUCENE-7231: WeightedSpanTermExtractor correctly deals with single-term PhraseQuery

        Show
        jira-bot ASF subversion and git services added a comment - Commit 2457d96e37ba711b9d1ac0e74ed405a9917a0f7d in lucene-solr's branch refs/heads/branch_6_0 from Alan Woodward [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2457d96 ] LUCENE-7231 : WeightedSpanTermExtractor correctly deals with single-term PhraseQuery
        Hide
        steve_rowe Steve Rowe added a comment -

        Bulk close issues included in the 6.0.1 release.

        Show
        steve_rowe Steve Rowe added a comment - Bulk close issues included in the 6.0.1 release.
        Hide
        steve_rowe Steve Rowe added a comment -

        Reopening to backport to 5.6 and 5.5.2.

        Show
        steve_rowe Steve Rowe added a comment - Reopening to backport to 5.6 and 5.5.2.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 90e823ed37edcce3984296ba6f16654d47f65d64 in lucene-solr's branch refs/heads/branch_5_5 from Alan Woodward
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=90e823e ]

        LUCENE-7231: WeightedSpanTermExtractor correctly deals with single-term PhraseQuery

        Show
        jira-bot ASF subversion and git services added a comment - Commit 90e823ed37edcce3984296ba6f16654d47f65d64 in lucene-solr's branch refs/heads/branch_5_5 from Alan Woodward [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=90e823e ] LUCENE-7231 : WeightedSpanTermExtractor correctly deals with single-term PhraseQuery
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit c92703d3875bf8a47ff828d5910f78772e3841af in lucene-solr's branch refs/heads/branch_5x from Alan Woodward
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c92703d ]

        LUCENE-7231: WeightedSpanTermExtractor correctly deals with single-term PhraseQuery

        Show
        jira-bot ASF subversion and git services added a comment - Commit c92703d3875bf8a47ff828d5910f78772e3841af in lucene-solr's branch refs/heads/branch_5x from Alan Woodward [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c92703d ] LUCENE-7231 : WeightedSpanTermExtractor correctly deals with single-term PhraseQuery
        Hide
        Thomas Kappler Thomas Kappler added a comment -

        The above commits fixed the issue for PhraseQuery, but it still exists for MultiPhraseQuery.

        I have a test case and a patch, although they are for 5.5.0 which is what we use. I can update the patch, though. Should I submit a patch against the 5x branch?

        Show
        Thomas Kappler Thomas Kappler added a comment - The above commits fixed the issue for PhraseQuery, but it still exists for MultiPhraseQuery. I have a test case and a patch, although they are for 5.5.0 which is what we use. I can update the patch, though. Should I submit a patch against the 5x branch?
        Hide
        dsmiley David Smiley added a comment -

        Since this issue is already in a release (multiple in fact) file a new issue please.

        Show
        dsmiley David Smiley added a comment - Since this issue is already in a release (multiple in fact) file a new issue please.
        Hide
        Thomas Kappler Thomas Kappler added a comment -

        Thanks David. Filed LUCENE-7417.

        Show
        Thomas Kappler Thomas Kappler added a comment - Thanks David. Filed LUCENE-7417 .

          People

          • Assignee:
            romseygeek Alan Woodward
            Reporter:
            Eva Popenda Eva Popenda
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development