Lucene - Core
  1. Lucene - Core
  2. LUCENE-4587

WordBreakSpellChecker treats bytes as chars

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0
    • Fix Version/s: 4.1, 6.0
    • Component/s: modules/spellchecker
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Originally opened as SOLR-4115.

      1. LUCENE-4587.patch
        11 kB
        James Dyer
      2. LUCENE-4587.patch
        11 kB
        James Dyer

        Issue Links

          Activity

          Hide
          James Dyer added a comment -

          This patch fixes the issue and adds a random test.

          Show
          James Dyer added a comment - This patch fixes the issue and adds a random test.
          Hide
          Steve Rowe added a comment -

          +1, nice random test!

          Small suggestion:

          TestWordBreakSpellChecker.goodTestString() invalidates candidate combined terms with whitespace, but there are other whitespace chars than those you handle specifically. This would probably be faster and more complete:

          private static final Pattern WHITESPACE_PATTERN = Pattern.compile("\\s");
          
          private boolean goodTestString(String s) {
            return s.codePointCount(0, s.length()) >= 2 && ! WHITESPACE_PATTERN.matcher(s).find();
          }  
          
          Show
          Steve Rowe added a comment - +1, nice random test! Small suggestion: TestWordBreakSpellChecker.goodTestString() invalidates candidate combined terms with whitespace, but there are other whitespace chars than those you handle specifically. This would probably be faster and more complete: private static final Pattern WHITESPACE_PATTERN = Pattern.compile( "\\s" ); private boolean goodTestString( String s) { return s.codePointCount(0, s.length()) >= 2 && ! WHITESPACE_PATTERN.matcher(s).find(); }
          Hide
          James Dyer added a comment -

          Steven,

          Thank you for taking time to review this. You're right the regex is better but probably the 4 chars are ok as this mimics what MockTokenizer will split on.

          Working on this made me wonder if perhaps WordBreakSpellChecker itself could be made more useful for non-western languages if it was configurable to break/combine with/on characters other than the space. I have very little of a linguistic background so I'm not sure if there is a solid use-case for this or how would it work. My guess is it would be too complicated for now if even useful at all. But if anyone has thoughts in this direction I wouldn't mind hearing them.

          Show
          James Dyer added a comment - Steven, Thank you for taking time to review this. You're right the regex is better but probably the 4 chars are ok as this mimics what MockTokenizer will split on. Working on this made me wonder if perhaps WordBreakSpellChecker itself could be made more useful for non-western languages if it was configurable to break/combine with/on characters other than the space. I have very little of a linguistic background so I'm not sure if there is a solid use-case for this or how would it work. My guess is it would be too complicated for now if even useful at all. But if anyone has thoughts in this direction I wouldn't mind hearing them.
          Hide
          Robert Muir added a comment -

          one simple idea would be to split with BreakIterator.getWordInstance(Locale.ROOT)

          Show
          Robert Muir added a comment - one simple idea would be to split with BreakIterator.getWordInstance(Locale.ROOT)
          Hide
          Robert Muir added a comment -

          or rather recognize word boundaries. same with getCharacterInstance if you want "character" boundaries (e.g. not splitting on combining marks)

          Show
          Robert Muir added a comment - or rather recognize word boundaries. same with getCharacterInstance if you want "character" boundaries (e.g. not splitting on combining marks)
          Hide
          James Dyer added a comment -

          Updated patch with Steven's suggested change. I will commit this.

          Show
          James Dyer added a comment - Updated patch with Steven's suggested change. I will commit this.
          Hide
          James Dyer added a comment -

          committed.
          Trunk: r1418437
          4x: r1418440

          Thank you Andreas for reporting this!

          Show
          James Dyer added a comment - committed. Trunk: r1418437 4x: r1418440 Thank you Andreas for reporting this!
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] James Dyer
          http://svn.apache.org/viewvc?view=revision&revision=1418440

          LUCENE-4587: fix WordBreakSpellChecker to handle non-latin characters

          Show
          Commit Tag Bot added a comment - [branch_4x commit] James Dyer http://svn.apache.org/viewvc?view=revision&revision=1418440 LUCENE-4587 : fix WordBreakSpellChecker to handle non-latin characters
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] James Dyer
          http://svn.apache.org/viewvc?view=revision&revision=1418437

          LUCENE-4587: fix WordBreakSpellChecker to handle non-latin characters

          Show
          Commit Tag Bot added a comment - [trunk commit] James Dyer http://svn.apache.org/viewvc?view=revision&revision=1418437 LUCENE-4587 : fix WordBreakSpellChecker to handle non-latin characters
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] James Dyer
          http://svn.apache.org/viewvc?view=revision&revision=1418440

          LUCENE-4587: fix WordBreakSpellChecker to handle non-latin characters

          Show
          Commit Tag Bot added a comment - [branch_4x commit] James Dyer http://svn.apache.org/viewvc?view=revision&revision=1418440 LUCENE-4587 : fix WordBreakSpellChecker to handle non-latin characters

            People

            • Assignee:
              James Dyer
              Reporter:
              Andreas Hubold
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development