Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-4587

WordBreakSpellChecker treats bytes as chars

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 4.0
    • Fix Version/s: 4.1, 6.0
    • Component/s: modules/spellchecker
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Originally opened as SOLR-4115.

      1. LUCENE-4587.patch
        11 kB
        James Dyer
      2. LUCENE-4587.patch
        11 kB
        James Dyer

        Issue Links

          Activity

          Hide
          jdyer James Dyer added a comment -

          This patch fixes the issue and adds a random test.

          Show
          jdyer James Dyer added a comment - This patch fixes the issue and adds a random test.
          Hide
          steve_rowe Steve Rowe added a comment -

          +1, nice random test!

          Small suggestion:

          TestWordBreakSpellChecker.goodTestString() invalidates candidate combined terms with whitespace, but there are other whitespace chars than those you handle specifically. This would probably be faster and more complete:

          private static final Pattern WHITESPACE_PATTERN = Pattern.compile("\\s");
          
          private boolean goodTestString(String s) {
            return s.codePointCount(0, s.length()) >= 2 && ! WHITESPACE_PATTERN.matcher(s).find();
          }  
          
          Show
          steve_rowe Steve Rowe added a comment - +1, nice random test! Small suggestion: TestWordBreakSpellChecker.goodTestString() invalidates candidate combined terms with whitespace, but there are other whitespace chars than those you handle specifically. This would probably be faster and more complete: private static final Pattern WHITESPACE_PATTERN = Pattern.compile( "\\s" ); private boolean goodTestString( String s) { return s.codePointCount(0, s.length()) >= 2 && ! WHITESPACE_PATTERN.matcher(s).find(); }
          Hide
          jdyer James Dyer added a comment -

          Steven,

          Thank you for taking time to review this. You're right the regex is better but probably the 4 chars are ok as this mimics what MockTokenizer will split on.

          Working on this made me wonder if perhaps WordBreakSpellChecker itself could be made more useful for non-western languages if it was configurable to break/combine with/on characters other than the space. I have very little of a linguistic background so I'm not sure if there is a solid use-case for this or how would it work. My guess is it would be too complicated for now if even useful at all. But if anyone has thoughts in this direction I wouldn't mind hearing them.

          Show
          jdyer James Dyer added a comment - Steven, Thank you for taking time to review this. You're right the regex is better but probably the 4 chars are ok as this mimics what MockTokenizer will split on. Working on this made me wonder if perhaps WordBreakSpellChecker itself could be made more useful for non-western languages if it was configurable to break/combine with/on characters other than the space. I have very little of a linguistic background so I'm not sure if there is a solid use-case for this or how would it work. My guess is it would be too complicated for now if even useful at all. But if anyone has thoughts in this direction I wouldn't mind hearing them.
          Hide
          rcmuir Robert Muir added a comment -

          one simple idea would be to split with BreakIterator.getWordInstance(Locale.ROOT)

          Show
          rcmuir Robert Muir added a comment - one simple idea would be to split with BreakIterator.getWordInstance(Locale.ROOT)
          Hide
          rcmuir Robert Muir added a comment -

          or rather recognize word boundaries. same with getCharacterInstance if you want "character" boundaries (e.g. not splitting on combining marks)

          Show
          rcmuir Robert Muir added a comment - or rather recognize word boundaries. same with getCharacterInstance if you want "character" boundaries (e.g. not splitting on combining marks)
          Hide
          jdyer James Dyer added a comment -

          Updated patch with Steven's suggested change. I will commit this.

          Show
          jdyer James Dyer added a comment - Updated patch with Steven's suggested change. I will commit this.
          Hide
          jdyer James Dyer added a comment -

          committed.
          Trunk: r1418437
          4x: r1418440

          Thank you Andreas for reporting this!

          Show
          jdyer James Dyer added a comment - committed. Trunk: r1418437 4x: r1418440 Thank you Andreas for reporting this!
          Hide
          commit-tag-bot Commit Tag Bot added a comment -

          [branch_4x commit] James Dyer
          http://svn.apache.org/viewvc?view=revision&revision=1418440

          LUCENE-4587: fix WordBreakSpellChecker to handle non-latin characters

          Show
          commit-tag-bot Commit Tag Bot added a comment - [branch_4x commit] James Dyer http://svn.apache.org/viewvc?view=revision&revision=1418440 LUCENE-4587 : fix WordBreakSpellChecker to handle non-latin characters
          Hide
          commit-tag-bot Commit Tag Bot added a comment -

          [trunk commit] James Dyer
          http://svn.apache.org/viewvc?view=revision&revision=1418437

          LUCENE-4587: fix WordBreakSpellChecker to handle non-latin characters

          Show
          commit-tag-bot Commit Tag Bot added a comment - [trunk commit] James Dyer http://svn.apache.org/viewvc?view=revision&revision=1418437 LUCENE-4587 : fix WordBreakSpellChecker to handle non-latin characters
          Hide
          commit-tag-bot Commit Tag Bot added a comment -

          [branch_4x commit] James Dyer
          http://svn.apache.org/viewvc?view=revision&revision=1418440

          LUCENE-4587: fix WordBreakSpellChecker to handle non-latin characters

          Show
          commit-tag-bot Commit Tag Bot added a comment - [branch_4x commit] James Dyer http://svn.apache.org/viewvc?view=revision&revision=1418440 LUCENE-4587 : fix WordBreakSpellChecker to handle non-latin characters

            People

            • Assignee:
              jdyer James Dyer
              Reporter:
              ahubold Andreas Hubold
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development