Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-5447

StandardTokenizer should break at consecutive chars matching Word_Break = MidLetter, MidNum and/or MidNumLet

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 4.6.1
    • Fix Version/s: 4.7, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      StandardTokenizer should split all of the following sequences into two tokens each, but they are all instead kept intact and output as single tokens:

      "A::B"           (':' is in \p{Word_Break = MidLetter})
      "1..2", "A..B"   ('.' is in \p{Word_Break = MidNumLet})
      "A.:B"
      "A:.B"
      "1,,2"           (',' is in \p{Word_Break = MidNum})
      "1,.2"
      "1.,2"
      

      Unfortunately, the word break test data released with Unicode, e.g. for Unicode 6.3 http://www.unicode.org/Public/6.3.0/ucd/auxiliary/WordBreakTest.txt, and incorporated into a versioned Lucene test, e.g. WordBreakTestUnicode_6_3_0, doesn't cover these cases.

        Attachments

        1. LUCENE-5447-test.patch
          2 kB
          Steve Rowe
        2. LUCENE-5447.patch
          974 kB
          Steve Rowe
        3. LUCENE-5447.patch
          977 kB
          Steve Rowe
        4. LUCENE-5447-take2.patch
          43 kB
          Steve Rowe

          Activity

            People

            • Assignee:
              steve_rowe Steve Rowe
              Reporter:
              steve_rowe Steve Rowe
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: