Solr
  1. Solr
  2. SOLR-1826

highlighting breaks when using WordDelimiterFilter and setting termOffsets=true

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: highlighter
    • Labels:
      None

      Description

      When using the WordDelimiterFilter and setting termOffsets to true the highlighting breaks in certain cases. This did not happen in the 1.3 release.
      For example, if I index the term "PowerShot.com" and search for pow* the highlighting snippet contains <em>Power</em><em>PowerShot.com</em>.
      I will attach a patch which adds tests to the highlighter unittest to demonstrate the issue.

      1. SOLR-1826.patch
        15 kB
        Stefan Oestreicher
      2. SOLR-1826.patch
        75 kB
        Sanjoy Ghosh
      3. SOLR-1826.txt
        2 kB
        Stefan Oestreicher
      4. SOLR-1826.txt
        2 kB
        Stefan Oestreicher
      5. SOLR-1826.txt
        4 kB
        Stefan Oestreicher

        Issue Links

          Activity

          Hide
          Stefan Oestreicher added a comment -

          attached patch demonstrates the problem

          Show
          Stefan Oestreicher added a comment - attached patch demonstrates the problem
          Hide
          Stefan Oestreicher added a comment -

          I just realised that the field type definition in my patch is unnecessary. I removed it and set the termOffsets attribute directly for the field.

          Show
          Stefan Oestreicher added a comment - I just realised that the field type definition in my patch is unnecessary. I removed it and set the termOffsets attribute directly for the field.
          Hide
          Stefan Oestreicher added a comment -

          updated the patch because I borked the indentation

          Show
          Stefan Oestreicher added a comment - updated the patch because I borked the indentation
          Hide
          Sanjoy Ghosh added a comment -

          Hi,

          I investigated this some more. The problem seems to be in:

          org.apache.lucene.search.highlight.TokenSource.java

          public static TokenStream getTokenStream(TermPositionVector tpv, boolean tokenPositionsGuaranteedContiguous) {

          has at the end the following code to sort the tokens into original document order.

          Arrays.sort(tokensInOriginalOrder, new Comparator(){
          public int compare(Object o1, Object o2)
          {
          Token t1=(Token) o1;
          Token t2=(Token) o2;
          if(t1.startOffset()>t2.endOffset())
          return 1;
          if(t1.startOffset()<t2.startOffset())
          return -1;
          return 0;
          }});

          This is not sorting the tokens into the right original order. The order should be

          lorem, power, powershotcom, shot, com, ipsum for this to work correctly. Instead we are getting lorem, power, com, powershotcom, shot, ipsum which confuses TokenGroup.isDistinct().

          I would be happy to fix this bug.

          Should we fix this as a Lucene bug or fix it in Solr by creating a new TokenStream that handles overlapping tokens correctly.

          Show
          Sanjoy Ghosh added a comment - Hi, I investigated this some more. The problem seems to be in: org.apache.lucene.search.highlight.TokenSource.java public static TokenStream getTokenStream(TermPositionVector tpv, boolean tokenPositionsGuaranteedContiguous) { has at the end the following code to sort the tokens into original document order. Arrays.sort(tokensInOriginalOrder, new Comparator(){ public int compare(Object o1, Object o2) { Token t1=(Token) o1; Token t2=(Token) o2; if(t1.startOffset()>t2.endOffset()) return 1; if(t1.startOffset()<t2.startOffset()) return -1; return 0; }}); This is not sorting the tokens into the right original order. The order should be lorem, power, powershotcom, shot, com, ipsum for this to work correctly. Instead we are getting lorem, power, com, powershotcom, shot, ipsum which confuses TokenGroup.isDistinct(). I would be happy to fix this bug. Should we fix this as a Lucene bug or fix it in Solr by creating a new TokenStream that handles overlapping tokens correctly.
          Hide
          Sanjoy Ghosh added a comment -

          This is a fix that ensures that overlapping tokens are sorted correctly.

          Show
          Sanjoy Ghosh added a comment - This is a fix that ensures that overlapping tokens are sorted correctly.
          Hide
          Sanjoy Ghosh added a comment -

          Just uploaded a patch that should fix this bug. Please let me know if this is not the right fix.

          Show
          Sanjoy Ghosh added a comment - Just uploaded a patch that should fix this bug. Please let me know if this is not the right fix.
          Hide
          Stefan Oestreicher added a comment -

          I'm sorry I completely lost track of this issue. I'll test your patch ASAP and get back to you.

          Show
          Stefan Oestreicher added a comment - I'm sorry I completely lost track of this issue. I'll test your patch ASAP and get back to you.
          Hide
          Stefan Oestreicher added a comment -

          Hi, the fix works for me. The patch didn't apply cleanly to 1.4.0 and 1.4.1. Fixed patch is attached. Thanks.

          Show
          Stefan Oestreicher added a comment - Hi, the fix works for me. The patch didn't apply cleanly to 1.4.0 and 1.4.1. Fixed patch is attached. Thanks.
          Hide
          Robert Muir added a comment -

          Hello, this same bug has been fixed in the Lucene highlighter: LUCENE-2874

          I applied your test and it passes against current trunk... I also committed it.

          Thanks very much for the work here!!!!

          Show
          Robert Muir added a comment - Hello, this same bug has been fixed in the Lucene highlighter: LUCENE-2874 I applied your test and it passes against current trunk... I also committed it. Thanks very much for the work here!!!!
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1.0 release

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1.0 release

            People

            • Assignee:
              Robert Muir
              Reporter:
              Stefan Oestreicher
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development