Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-5447

StandardTokenizer should break at consecutive chars matching Word_Break = MidLetter, MidNum and/or MidNumLet

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 4.6.1
    • 4.7, 6.0
    • modules/analysis
    • None
    • New, Patch Available

    Description

      StandardTokenizer should split all of the following sequences into two tokens each, but they are all instead kept intact and output as single tokens:

      "A::B"           (':' is in \p{Word_Break = MidLetter})
      "1..2", "A..B"   ('.' is in \p{Word_Break = MidNumLet})
      "A.:B"
      "A:.B"
      "1,,2"           (',' is in \p{Word_Break = MidNum})
      "1,.2"
      "1.,2"
      

      Unfortunately, the word break test data released with Unicode, e.g. for Unicode 6.3 http://www.unicode.org/Public/6.3.0/ucd/auxiliary/WordBreakTest.txt, and incorporated into a versioned Lucene test, e.g. WordBreakTestUnicode_6_3_0, doesn't cover these cases.

      Attachments

        1. LUCENE-5447-take2.patch
          43 kB
          Steven Rowe
        2. LUCENE-5447.patch
          977 kB
          Steven Rowe
        3. LUCENE-5447.patch
          974 kB
          Steven Rowe
        4. LUCENE-5447-test.patch
          2 kB
          Steven Rowe

        Activity

          People

            sarowe Steven Rowe
            sarowe Steven Rowe
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: