Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10153

UnifiedSolrHighlighter support for CustomSeparatorBreakIterator (LUCENE-6485)

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.5
    • Component/s: highlighter
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      Lucene 5.3 added a CustomSeparatorBreakIterator (see LUCENE-6485)

      UnifiedSolrHighlighter should support CustomSeparatorBreakIterator along with existing ones, WholeBreakIterator etc.

      1. SOLR_10153_UH_and_PH_hl_customSeparatorChar.patch
        12 kB
        David Smiley
      2. SOLR_10153_UH_and_PH_hl_customSeparatorChar.patch
        12 kB
        David Smiley
      3. SOLR-10153.patch
        6 kB
        Amrit Sarkar
      4. SOLR-10153.patch
        6 kB
        Amrit Sarkar

        Issue Links

          Activity

          Hide
          sarkaramrit2@gmail.com Amrit Sarkar added a comment - - edited

          SOLR-10153.patch uploaded which incorporates CustomSeparatorBreakIterator in UnifiedSolrHighlighter.

          • added a new request param option to specify which separator char to use. customSeparatorChar.
          • changed UnifiedSolrHighlighter.getBreakIterator to check HighlightParams.BS_TYPE first.
          • if type=='CUSTOM', look for the new separator param, in getBreakIterator validate it's a single char & skip locale parsing.
          • 'WHOLE' option moved from parseBreakIterator to getBreakIterator, as it doesn't depend on locale.

          Changes made in:

          • HighlightParams.java
          • UnifiedSolrHighlighter.java
          • test cases added in TestUnifiedSolrHighlighter
          Show
          sarkaramrit2@gmail.com Amrit Sarkar added a comment - - edited SOLR-10153 .patch uploaded which incorporates CustomSeparatorBreakIterator in UnifiedSolrHighlighter. added a new request param option to specify which separator char to use. customSeparatorChar . changed UnifiedSolrHighlighter.getBreakIterator to check HighlightParams.BS_TYPE first. if type=='CUSTOM', look for the new separator param, in getBreakIterator validate it's a single char & skip locale parsing. 'WHOLE' option moved from parseBreakIterator to getBreakIterator, as it doesn't depend on locale. Changes made in: HighlightParams.java UnifiedSolrHighlighter.java test cases added in TestUnifiedSolrHighlighter
          Hide
          dsmiley David Smiley added a comment -

          Hello Amrit; thanks for contributing this. Why the change of fragsize == 1 to be considered equivalent to WHOLE?

          Aside from the above and some minor tweaks I plan to do, this looks pretty committable. Thanks for the test.

          Show
          dsmiley David Smiley added a comment - Hello Amrit; thanks for contributing this. Why the change of fragsize == 1 to be considered equivalent to WHOLE? Aside from the above and some minor tweaks I plan to do, this looks pretty committable. Thanks for the test.
          Hide
          sarkaramrit2@gmail.com Amrit Sarkar added a comment -

          Mr. Smiley,

          Thank you for the feedback and glad you found the patch good enough.

          Regarding fragsize == 1; I got that wrong for sure, thank you correcting me out. I made some wrong assertions on the fragment size specified which doesn't make sense, was trying to optimise considering the code below in getBreakIterator(String field):

          312:       if (fragsize <= 1 || baseBI instanceof WholeBreakIterator) { // no real minimum size
          313:         return baseBI;
          314:       }
          

          I put the piece of code back where it belong. PF the updated patch.

          I will also appreciate your inputs on SOLR-10152, CustomSeparatorBreakIterator for PostingsSolrHighlighter.

          Thanks,
          Amrit Sarkar

          Show
          sarkaramrit2@gmail.com Amrit Sarkar added a comment - Mr. Smiley, Thank you for the feedback and glad you found the patch good enough. Regarding fragsize == 1 ; I got that wrong for sure, thank you correcting me out. I made some wrong assertions on the fragment size specified which doesn't make sense, was trying to optimise considering the code below in getBreakIterator(String field): 312: if (fragsize <= 1 || baseBI instanceof WholeBreakIterator) { // no real minimum size 313: return baseBI; 314: } I put the piece of code back where it belong. PF the updated patch. I will also appreciate your inputs on SOLR-10152 , CustomSeparatorBreakIterator for PostingsSolrHighlighter. Thanks, Amrit Sarkar
          Hide
          sarkaramrit2@gmail.com Amrit Sarkar added a comment -

          Just a follow up David,

          Are we planning to include this feature in near future? If not, can you let me know what portion needs improvement.

          Show
          sarkaramrit2@gmail.com Amrit Sarkar added a comment - Just a follow up David, Are we planning to include this feature in near future? If not, can you let me know what portion needs improvement.
          Hide
          dsmiley David Smiley added a comment -

          Here is an updated patch that incorporates SOLR-10152 (same change for PostingsSolrHighlighter). You might notice the little changes I made – all quite minor but tidy it up a bit in formatting and style. I'm running tests now and plan to commit later today. Please give it a look over.

          Show
          dsmiley David Smiley added a comment - Here is an updated patch that incorporates SOLR-10152 (same change for PostingsSolrHighlighter). You might notice the little changes I made – all quite minor but tidy it up a bit in formatting and style. I'm running tests now and plan to commit later today. Please give it a look over.
          Hide
          sarkaramrit2@gmail.com Amrit Sarkar added a comment -

          Elegant. I struggle with naming conventions and this make much more sense. Thank you for considering the patch and fixing the necessary.

          Show
          sarkaramrit2@gmail.com Amrit Sarkar added a comment - Elegant. I struggle with naming conventions and this make much more sense. Thank you for considering the patch and fixing the necessary.
          Hide
          dsmiley David Smiley added a comment -

          It's okay; I think all software developers struggle with naming . Goes with the job. Thanks again for the contribution.

          Now that I think about naming again... hmmm, maybe a more useful/memorable Solr parameter for this might be hl.bs.separatorChar ? I understand why you chose the name you did since it's based on a class name but users don't know/char about that. The "hl.bs" namespaces this param as related to the other break iterator related params, which this is. "Custom" is superfluous; we provide it directly via a parameter. Hmmm; do we even need hl.bs.type==CUSTOM if the user sets hl.bs.separatorChar? I guess it should be set so that there is consistency in mapping the type to an algorithm. Though maybe use the value SEPARATOR? And maybe just name this hl.bs.separator to open the door for possibly using an arbitrary length string in the future?

          Thus I propose hl.bs.type=SEPARATOR with e.g. hl.bs.separator=|

          Show
          dsmiley David Smiley added a comment - It's okay; I think all software developers struggle with naming . Goes with the job. Thanks again for the contribution. Now that I think about naming again... hmmm, maybe a more useful/memorable Solr parameter for this might be hl.bs.separatorChar ? I understand why you chose the name you did since it's based on a class name but users don't know/char about that. The "hl.bs" namespaces this param as related to the other break iterator related params, which this is. "Custom" is superfluous; we provide it directly via a parameter. Hmmm; do we even need hl.bs.type==CUSTOM if the user sets hl.bs.separatorChar ? I guess it should be set so that there is consistency in mapping the type to an algorithm. Though maybe use the value SEPARATOR ? And maybe just name this hl.bs.separator to open the door for possibly using an arbitrary length string in the future? Thus I propose hl.bs.type=SEPARATOR with e.g. hl.bs.separator=|
          Hide
          sarkaramrit2@gmail.com Amrit Sarkar added a comment -

          Alright Yes, +1 for the suggestion. I hate long worded arguments and just seperator is meaningful and apt. CustomSeparatorBreakIterator itself is naming the char 'seperator' in its implementation.

          Regarding using string as separator in near future, looking at the CustomSeparatorBreakIterator's code, it seems it can be done with not much effort. I will give it a shot soon.

          We should go with: hl.bs.type=SEPARATOR with e.g. hl.bs.separator=|

          Show
          sarkaramrit2@gmail.com Amrit Sarkar added a comment - Alright Yes, +1 for the suggestion. I hate long worded arguments and just seperator is meaningful and apt. CustomSeparatorBreakIterator itself is naming the char ' seperator ' in its implementation. Regarding using string as separator in near future, looking at the CustomSeparatorBreakIterator's code, it seems it can be done with not much effort. I will give it a shot soon. We should go with: hl.bs.type=SEPARATOR with e.g. hl.bs.separator=|
          Hide
          dsmiley David Smiley added a comment -

          Here's an updated patch (same patch filename even though the file name references the old param). hl.bs.type=SEPARATOR and hl.bs.separator=# (or whatever) as discussed. Tests are running now. I moved it's location in HighlightParams to the right section too; it was wrong. And added CHANGES.txt. If all goes well I'll commit later tonight.

          Show
          dsmiley David Smiley added a comment - Here's an updated patch (same patch filename even though the file name references the old param). hl.bs.type=SEPARATOR and hl.bs.separator=# (or whatever) as discussed. Tests are running now. I moved it's location in HighlightParams to the right section too; it was wrong. And added CHANGES.txt. If all goes well I'll commit later tonight.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit d1d73bfbea3db4adead960fae3597bec7647fba6 in lucene-solr's branch refs/heads/master from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d1d73bf ]

          SOLR-10153: (and SOLR-10152): UH & PH: Add hl.bs.type=SEPARATOR with new param hl.bs.separator

          Show
          jira-bot ASF subversion and git services added a comment - Commit d1d73bfbea3db4adead960fae3597bec7647fba6 in lucene-solr's branch refs/heads/master from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d1d73bf ] SOLR-10153 : (and SOLR-10152 ): UH & PH: Add hl.bs.type=SEPARATOR with new param hl.bs.separator
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit a607a2c6cfdeb191b3da4474e87d4242b1270fd1 in lucene-solr's branch refs/heads/branch_6x from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a607a2c ]

          SOLR-10153: (and SOLR-10152): UH & PH: Add hl.bs.type=SEPARATOR with new param hl.bs.separator

          (cherry picked from commit d1d73bf)

          Show
          jira-bot ASF subversion and git services added a comment - Commit a607a2c6cfdeb191b3da4474e87d4242b1270fd1 in lucene-solr's branch refs/heads/branch_6x from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a607a2c ] SOLR-10153 : (and SOLR-10152 ): UH & PH: Add hl.bs.type=SEPARATOR with new param hl.bs.separator (cherry picked from commit d1d73bf)
          Hide
          sarkaramrit2@gmail.com Amrit Sarkar added a comment -

          David Smiley LUCENE-7729 has incorporated string type separator for CustomSeparatorBreakIterator as discussed. I have linked the issue to this JIRA itself and patch can be applied on current updated branch. Thanks for committing.

          Show
          sarkaramrit2@gmail.com Amrit Sarkar added a comment - David Smiley LUCENE-7729 has incorporated string type separator for CustomSeparatorBreakIterator as discussed. I have linked the issue to this JIRA itself and patch can be applied on current updated branch. Thanks for committing.

            People

            • Assignee:
              dsmiley David Smiley
              Reporter:
              sarkaramrit2@gmail.com Amrit Sarkar
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development