Uploaded image for project: 'Lucene.Net'
  1. Lucene.Net
  2. LUCENENET-599

Fine-grained segmentation tools with vectorHighlight will cause bug

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: Lucene.Net 4.8.0
    • Fix Version/s: None
    • Labels:
    • Environment:

      System:

      Linux version 4.4.0-62-generic (buildd@lcy01-30) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) )

      Lucene Version :Lucene4.8.0-beta00005

      Participle tool:JIEba

    • Flags:
      Patch, Important

      Description

      the text to analyze :

      "主体内容来自并且自己加了点基本数据结构数组链表,双向链表"

      when I used  fine-graine service and it was token to :

      "

      主体/ 内容/ 来自/ 并且/ 自己/ 加/ 了/ 点/ 基本/ 数据/ 结构/ 数据结构/ 数组/ 链表/ ,/ 双向/ 链表

      "

      I searched with query “数据,基本数据结构” and got wrong :

      System.ArgumentOutOfRangeException: Index and length must refer to a location within the string.

      Parameter name: length

         at System.String.Substring(Int32 startIndex, Int32 length)

         at Lucene.Net.Search.VectorHighlight.BaseFragmentsBuilder.MakeFragment(StringBuilder buffer, Int32[] index, Field[] values, WeightedFragInfo fragInfo, String[] preTags, String[] postTags, IEncoder encoder) in C:\BuildAgent\work\b1b63ca15b99dddb\src\Lucene.Net.Highlighter\VectorHighlight\BaseFragmentsBuilder.cs:line 195

         at Lucene.Net.Search.VectorHighlight.BaseFragmentsBuilder.CreateFragments(IndexReader reader, Int32 docId, String fieldName, FieldFragList fieldFragList, Int32 maxNumFragments, String[] preTags, String[] postTags, IEncoder encoder) in C:\BuildAgent\work\b1b63ca15b99dddb\src\Lucene.Net.Highlighter\VectorHighlight\BaseFragmentsBuilder.cs:line 146

         at Lucene.Net.Search.VectorHighlight.BaseFragmentsBuilder.CreateFragments(IndexReader reader, Int32 docId, String fieldName, FieldFragList fieldFragList, Int32 maxNumFragments) in C:\BuildAgent\work\b1b63ca15b99dddb\src\Lucene.Net.Highlighter\VectorHighlight\BaseFragmentsBuilder.cs:line 99

      The reason is the code in vectorHighlighter:

      1. protected String makeFragment( StringBuilder buffer, int[] index, Field[] values, WeightedFragInfo fragInfo,  

      2.     String[] preTags, String[] postTags, Encoder encoder )

      {   3.   StringBuilder fragment = new StringBuilder();   4.   final int s = fragInfo.getStartOffset();   5.   int[] modifiedStartOffset = \{ s }

      ;  

      6.   String src = getFragmentSourceMSO( buffer, index, values, s, fragInfo.getEndOffset(), modifiedStartOffset );  

      7.   int srcIndex = 0;  

      8.   for( SubInfo subInfo : fragInfo.getSubInfos() ){  

      9.     for( Toffs to : subInfo.getTermsOffsets() )

      {   10.       fragment   11.         .append( encoder.encodeText( src.substring( srcIndex, to.getStartOffset() - modifiedStartOffset[0] ) ) )   12.         .append( getPreTag( preTags, subInfo.getSeqnum() ) )   13.         .append( encoder.encodeText( src.substring( to.getStartOffset() - modifiedStartOffset[0], to.getEndOffset() - modifiedStartOffset[0] ) ) )   14.         .append( getPostTag( postTags, subInfo.getSeqnum() ) );   15.       srcIndex = to.getEndOffset() - modifiedStartOffset[0];   16.     }

        

      17.   }  

      18.   fragment.append( encoder.encodeText( src.substring( srcIndex ) ) );  

      19.   return fragment.toString();  

      20. }  

      when I searched with "基本数据结构" and it was ok.  My English is pool .I will explain reason with Chinese.

      细粒度分词会把“基本数据结构”再次分词,当我们搜索“数据,基本数据结构”, 数据分词被第一个高亮,因为上面的分词,“数据”在“基本数据结构”前面,而数据在文本中的起始位置是(15,16),对“数据”高亮之后,srcIndex 会变成“数据”的末位置,也就是16,从16开始找下一个高亮分词,下一个分词“基本数据结构”的位置(13,18)。src.substring(16,13)高亮前的片段,显示是错误的。 所以快速分词基于的是分词在原文本中的顺序是前后衔接的,当你使用细粒度分词的时候就打破了这种衔接,会导致报错。但是作为搜索引擎,很多时候都是细粒度分词,搜索的时候使用快速高亮也可以提高速度,然而二者不能很好的结合。

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              SilentCc ChenYongkang
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 96h
                96h
                Remaining:
                Remaining Estimate - 96h
                96h
                Logged:
                Time Spent - Not Specified
                Not Specified