Lucene - Core
  1. Lucene - Core
  2. LUCENE-5447

StandardTokenizer should break at consecutive chars matching Word_Break = MidLetter, MidNum and/or MidNumLet

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.6.1
    • Fix Version/s: 4.7, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      StandardTokenizer should split all of the following sequences into two tokens each, but they are all instead kept intact and output as single tokens:

      "A::B"           (':' is in \p{Word_Break = MidLetter})
      "1..2", "A..B"   ('.' is in \p{Word_Break = MidNumLet})
      "A.:B"
      "A:.B"
      "1,,2"           (',' is in \p{Word_Break = MidNum})
      "1,.2"
      "1.,2"
      

      Unfortunately, the word break test data released with Unicode, e.g. for Unicode 6.3 http://www.unicode.org/Public/6.3.0/ucd/auxiliary/WordBreakTest.txt, and incorporated into a versioned Lucene test, e.g. WordBreakTestUnicode_6_3_0, doesn't cover these cases.

      1. LUCENE-5447.patch
        977 kB
        Steve Rowe
      2. LUCENE-5447.patch
        974 kB
        Steve Rowe
      3. LUCENE-5447-take2.patch
        43 kB
        Steve Rowe
      4. LUCENE-5447-test.patch
        2 kB
        Steve Rowe

        Activity

        Hide
        Steve Rowe added a comment -

        Patch with tests that demonstrate the problem

        Show
        Steve Rowe added a comment - Patch with tests that demonstrate the problem
        Hide
        Steve Rowe added a comment -

        Patch fixing the issue; includes LUCENE-5447-test.patch.

        Committing shortly.

        Show
        Steve Rowe added a comment - Patch fixing the issue; includes LUCENE-5447 -test.patch. Committing shortly.
        Hide
        Steve Rowe added a comment -

        Final patch, adding a test for UAX29URLEmailTokenizer and a CHANGES.txt entry.

        Show
        Steve Rowe added a comment - Final patch, adding a test for UAX29URLEmailTokenizer and a CHANGES.txt entry.
        Hide
        ASF subversion and git services added a comment -

        Commit 1569586 from Steve Rowe in branch 'dev/trunk'
        [ https://svn.apache.org/r1569586 ]

        LUCENE-5447: StandardTokenizer should break at consecutive chars matching Word_Break = MidLetter, MidNum and/or MidNumLet

        Show
        ASF subversion and git services added a comment - Commit 1569586 from Steve Rowe in branch 'dev/trunk' [ https://svn.apache.org/r1569586 ] LUCENE-5447 : StandardTokenizer should break at consecutive chars matching Word_Break = MidLetter, MidNum and/or MidNumLet
        Hide
        ASF subversion and git services added a comment -

        Commit 1569601 from Steve Rowe in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1569601 ]

        LUCENE-5447: StandardTokenizer should break at consecutive chars matching Word_Break = MidLetter, MidNum and/or MidNumLet (merged trunk r1569586)

        Show
        ASF subversion and git services added a comment - Commit 1569601 from Steve Rowe in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1569601 ] LUCENE-5447 : StandardTokenizer should break at consecutive chars matching Word_Break = MidLetter, MidNum and/or MidNumLet (merged trunk r1569586)
        Hide
        Steve Rowe added a comment -

        Committed to trunk, branch_4x and lucene_solr_4_7.

        Show
        Steve Rowe added a comment - Committed to trunk, branch_4x and lucene_solr_4_7.
        Hide
        Steve Rowe added a comment -

        In looking at the committed diffs (when JIRA was down last night and earlier today, the lucene_solr_4_7 commit didn't put a comment on this issue, which sucks), I see that I didn't fully patch StandardTokenizerImpl.jflex, although I did correctly patch UAX29URLEmailTokenizerImpl, which is basically a superset of StandardTokenizerImpl.jflex.

        I've added some more tests to show the problem (existing tests didn't fail), patch forthcoming. Here's an example that should be split by StandardTokenizer but isn't currently - the issue is triggered via a preceding char matching Word_Break = ExtendNumLet, e.g. the underscore character:

        A:B_A::B <- left intact, but should output "A:B_A", "B"

        By contrast, the current UAX29URLEmailTokenizer gets the above right.

        In the JFlex 1.5.0 release, I added the ability to include external files into the rules section of the scanner specification, and I want to take advantage of this to refactor StandardTokenizer and UAX29URLEmailTokenizer so that there is only one definition of the shared rules. (That would have prevented the problem for which I'm reopening this issue.) I'll make a separate issue for that.

        Show
        Steve Rowe added a comment - In looking at the committed diffs (when JIRA was down last night and earlier today, the lucene_solr_4_7 commit didn't put a comment on this issue, which sucks), I see that I didn't fully patch StandardTokenizerImpl.jflex, although I did correctly patch UAX29URLEmailTokenizerImpl, which is basically a superset of StandardTokenizerImpl.jflex. I've added some more tests to show the problem (existing tests didn't fail), patch forthcoming. Here's an example that should be split by StandardTokenizer but isn't currently - the issue is triggered via a preceding char matching Word_Break = ExtendNumLet , e.g. the underscore character: A:B_A::B <- left intact, but should output " A:B_A ", " B " By contrast, the current UAX29URLEmailTokenizer gets the above right. In the JFlex 1.5.0 release, I added the ability to include external files into the rules section of the scanner specification, and I want to take advantage of this to refactor StandardTokenizer and UAX29URLEmailTokenizer so that there is only one definition of the shared rules. (That would have prevented the problem for which I'm reopening this issue.) I'll make a separate issue for that.
        Hide
        Robert Muir added a comment -

        A random question here Steve, is it possible to add this test to the unicode tests and send upstream? or is it already fixed in recent versions?

        Show
        Robert Muir added a comment - A random question here Steve, is it possible to add this test to the unicode tests and send upstream? or is it already fixed in recent versions?
        Hide
        Steve Rowe added a comment -

        random question here Steve, is it possible to add this test to the unicode tests and send upstream? or is it already fixed in recent versions?

        Good idea, I'll check if it's already fixed, and if not, send upstream.

        Show
        Steve Rowe added a comment - random question here Steve, is it possible to add this test to the unicode tests and send upstream? or is it already fixed in recent versions? Good idea, I'll check if it's already fixed, and if not, send upstream.
        Hide
        Steve Rowe added a comment -

        Patch with more tests illustrating the StandardTokenizerImpl problem, along with the scanner specification fix.

        Committing shortly.

        Show
        Steve Rowe added a comment - Patch with more tests illustrating the StandardTokenizerImpl problem, along with the scanner specification fix. Committing shortly.
        Hide
        ASF subversion and git services added a comment -

        Commit 1569831 from Steve Rowe in branch 'dev/trunk'
        [ https://svn.apache.org/r1569831 ]

        LUCENE-5447: Fully patch StandardTokenizerImpl.jflex, to bring parity with rules in UAX29URLEmailTokenizerImpl.jflex; add tests that fail without this fix and succeed with it.

        Show
        ASF subversion and git services added a comment - Commit 1569831 from Steve Rowe in branch 'dev/trunk' [ https://svn.apache.org/r1569831 ] LUCENE-5447 : Fully patch StandardTokenizerImpl.jflex, to bring parity with rules in UAX29URLEmailTokenizerImpl.jflex; add tests that fail without this fix and succeed with it.
        Hide
        ASF subversion and git services added a comment -

        Commit 1569849 from Steve Rowe in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1569849 ]

        LUCENE-5447: Fully patch StandardTokenizerImpl.jflex, to bring parity with rules in UAX29URLEmailTokenizerImpl.jflex; add tests that fail without this fix and succeed with it. (merged trunk r1569831)

        Show
        ASF subversion and git services added a comment - Commit 1569849 from Steve Rowe in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1569849 ] LUCENE-5447 : Fully patch StandardTokenizerImpl.jflex, to bring parity with rules in UAX29URLEmailTokenizerImpl.jflex; add tests that fail without this fix and succeed with it. (merged trunk r1569831)
        Hide
        ASF subversion and git services added a comment -

        Commit 1569855 from Steve Rowe in branch 'dev/branches/lucene_solr_4_7'
        [ https://svn.apache.org/r1569855 ]

        LUCENE-5447: Fully patch StandardTokenizerImpl.jflex, to bring parity with rules in UAX29URLEmailTokenizerImpl.jflex; add tests that fail without this fix and succeed with it. (merged branch_4x r1569849)

        Show
        ASF subversion and git services added a comment - Commit 1569855 from Steve Rowe in branch 'dev/branches/lucene_solr_4_7' [ https://svn.apache.org/r1569855 ] LUCENE-5447 : Fully patch StandardTokenizerImpl.jflex, to bring parity with rules in UAX29URLEmailTokenizerImpl.jflex; add tests that fail without this fix and succeed with it. (merged branch_4x r1569849)
        Hide
        Steve Rowe added a comment -

        Committed the new tests and the fully patched StandardTokenizerImpl.jflex to to trunk, branch_4x and lucene_solr_4_7.

        Show
        Steve Rowe added a comment - Committed the new tests and the fully patched StandardTokenizerImpl.jflex to to trunk, branch_4x and lucene_solr_4_7.
        Hide
        Steve Rowe added a comment -

        random question here Steve, is it possible to add this test to the unicode tests and send upstream? or is it already fixed in recent versions?

        Good idea, I'll check if it's already fixed, and if not, send upstream.

        It's not fixed in recent versions - in fact the proposed 7.0 version is exactly the same as the 6.3.0 version, with the exception of the header.

        I converted the tests to the WordBreakTest.txt format and submitted them (along with an explanation pointing to this issue) through the Unicode.org contact form at http://www.unicode.org/reporting.html.

        Show
        Steve Rowe added a comment - random question here Steve, is it possible to add this test to the unicode tests and send upstream? or is it already fixed in recent versions? Good idea, I'll check if it's already fixed, and if not, send upstream. It's not fixed in recent versions - in fact the proposed 7.0 version is exactly the same as the 6.3.0 version, with the exception of the header. I converted the tests to the WordBreakTest.txt format and submitted them (along with an explanation pointing to this issue) through the Unicode.org contact form at http://www.unicode.org/reporting.html .
        Hide
        Robert Muir added a comment -

        Thanks Steve!

        Show
        Robert Muir added a comment - Thanks Steve!
        Hide
        Steve Rowe added a comment -

        In the JFlex 1.5.0 release, I added the ability to include external files into the rules section of the scanner specification, and I want to take advantage of this to refactor StandardTokenizer and UAX29URLEmailTokenizer so that there is only one definition of the shared rules. (That would have prevented the problem for which I'm reopening this issue.) I'll make a separate issue for that.

        See LUCENE-5464

        Show
        Steve Rowe added a comment - In the JFlex 1.5.0 release, I added the ability to include external files into the rules section of the scanner specification, and I want to take advantage of this to refactor StandardTokenizer and UAX29URLEmailTokenizer so that there is only one definition of the shared rules. (That would have prevented the problem for which I'm reopening this issue.) I'll make a separate issue for that. See LUCENE-5464
        Hide
        Steve Rowe added a comment -

        I converted the tests to the WordBreakTest.txt format and submitted them (along with an explanation pointing to this issue) through the Unicode.org contact form at http://www.unicode.org/reporting.html.

        The message I sent is now recorded as the second email in the feedback for Proposed Update UAX #29, Unicode Text Segmentation: http://www.unicode.org/review/pri265/

        Show
        Steve Rowe added a comment - I converted the tests to the WordBreakTest.txt format and submitted them (along with an explanation pointing to this issue) through the Unicode.org contact form at http://www.unicode.org/reporting.html . The message I sent is now recorded as the second email in the feedback for Proposed Update UAX #29, Unicode Text Segmentation: http://www.unicode.org/review/pri265/
        Hide
        Steve Rowe added a comment -

        I converted the tests to the WordBreakTest.txt format and submitted them (along with an explanation pointing to this issue) through the Unicode.org contact form at http://www.unicode.org/reporting.html.

        The message I sent is now recorded as the second email in the feedback for Proposed Update UAX #29, Unicode Text Segmentation: http://www.unicode.org/review/pri265/

        The Unicode Technical Committee emailed me today to tell me that they would be adding test cases for this problem to Unicode 8.0, but not to the upcoming 7.0 release.

        Show
        Steve Rowe added a comment - I converted the tests to the WordBreakTest.txt format and submitted them (along with an explanation pointing to this issue) through the Unicode.org contact form at http://www.unicode.org/reporting.html . The message I sent is now recorded as the second email in the feedback for Proposed Update UAX #29, Unicode Text Segmentation: http://www.unicode.org/review/pri265/ The Unicode Technical Committee emailed me today to tell me that they would be adding test cases for this problem to Unicode 8.0, but not to the upcoming 7.0 release.

          People

          • Assignee:
            Steve Rowe
            Reporter:
            Steve Rowe
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development