Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-8334

Highlighting content field problem when using JiebaTokenizerFactory

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Invalid
    • 5.3
    • None
    • highlighter, search
    • Windows 8.1, Solr 5.3, ZooKeeper 3.4.6, jieba-analysis-1.0.0

    Description

      When I tried to use the JiebaTokenizerFactory to index Chinese characters in Solr, it works fine with the segmentation when I'm using the Analysis function on the Solr Admin UI.

      However, when I tried to do the highlighting in Solr, it is not highlighting in the correct place. For example, when I search of 自然环境与企业本身, it highlight 认<em>为自然环</em><em>境</em><em>与企</em><em>业本</em>身的
      Even when I search for English character like responsibility, it highlight <em> responsibilit<em>y.

      Basically, the highlighting goes off by 1 character/space consistently.
      This problem only happens in content field, and not in any other fields.

      I've made some minor modification in the code under JiebaSegmenter.java, and the highlighting seems to be fine now.

      Basically, I created another int called offset2 under process() method.
      int offset2 = 0;
      After which, I modified the offset to offset2 for this part of the code under process() method.
      The changes are in the attachment below.

      Attachments

        1. JiebaSegmenter.java
          8 kB
          Edwin Yeo Zheng Lin

        Activity

          People

            Unassigned Unassigned
            edwinyeozl Edwin Yeo Zheng Lin
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 24h
                24h
                Remaining:
                Remaining Estimate - 24h
                24h
                Logged:
                Time Spent - Not Specified
                Not Specified