Lucene - Core
  1. Lucene - Core
  2. LUCENE-5357

Upgrade StandardTokenizer & co to latest unicode rules

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.7, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      besides any change in data, the rules have also changed (regional indicators, better handling for hebrew, etc)

      1. LUCENE-5357.patch
        1.06 MB
        Steve Rowe

        Activity

        Hide
        Steve Rowe added a comment - - edited

        Tasks include:

        • Update the TLDs acceptable in URLs and Emails (for UAX29URLEmailTokenizer) from the latest IANA Root Zone Database, using ant gen-tlds. Test data files referring to obsolete TLDs will need to be updated to use current TLDs: ((email.addresses,urls).from.)random.text.with.(email.address,urls).txt.
        • Update the icu module's GenerateJFlexSupplementaryMacros.java to include supplementary character additions to JFlex grammars for new character classes [:WordBreak=Single_Quote:], [:WordBreak=Double_Quote:], [:WordBreak=Hebrew_Letter:] and [:WordBreak=Regional_Indicator:].
        • Update the JFlex grammars to Unicode 6.3
          • Change the version in the %unicode directive in the grammar: %unicode 6.1 -> %unicode 6.3
          • Change all JFlex grammars that use ". | <newline>" to mean "any character" to instead use [^], since JFlex's "." now excludes all Unicode newline chars, rather than just
            n
            , to comply with Unicode Regular Expressions standard UTS#30.
          • Upgrade the UAX#29-based grammars to the Unicode 6.3 word break rules, in StandardTokenizerImpl.jflex and UAX29URLEmailTokenizer.jflex.
        • Regenerate the JFlex scanners in lucene/analysis/common/ via ant jflex.
        • Test the new scanners against the Unicode 6.3 word break test data
          • Update generateJavaUnicodeWordBreakTest.pl to handle above-BMP characters in the Unicode character database's ucd/auxiliary/WordBreakTest.txt (previous Unicode versions included only BMP characters in that file).
          • Using generateJavaUnicodeWordBreakTest.pl, generate WordBreakTestUnicode_6_3_0.java under modules/analysis/common/src/test/org/apache/lucene/analysis/core/.
          • Update TestStandardAnalyzer.java and TestUAX29URLEmailTokenizer.java to invoke WordBreakTestUnicode_6_3_0 rather than WordBreakTestUnicode_6_1_0.
          • Remove WordBreakTestUnicode_6_1_0.java.

        Additional task for the 4.x backport:

        • Version the JFlex grammars:
          • Copy the current implementations to *Impl40 (where 40=>4.0 is the version in which the Unicode 6.1 versions of these scanners were introduced.
          • Cause the versioning tokenizer wrappers to instantiate this version when the Version c-tor param is in the range 4.0 to 4.6.
          • Change the specified Unicode version in the non-versioned JFlex grammars from 6.1 to 6.3.
        Show
        Steve Rowe added a comment - - edited Tasks include: Update the TLDs acceptable in URLs and Emails (for UAX29URLEmailTokenizer ) from the latest IANA Root Zone Database, using ant gen-tlds . Test data files referring to obsolete TLDs will need to be updated to use current TLDs: ((email.addresses,urls).from.)random.text.with.(email.address,urls).txt . Update the icu module's GenerateJFlexSupplementaryMacros.java to include supplementary character additions to JFlex grammars for new character classes [:WordBreak=Single_Quote:] , [:WordBreak=Double_Quote:] , [:WordBreak=Hebrew_Letter:] and [:WordBreak=Regional_Indicator:] . Update the JFlex grammars to Unicode 6.3 Change the version in the %unicode directive in the grammar: %unicode 6.1 -> %unicode 6.3 Change all JFlex grammars that use " . | <newline> " to mean "any character" to instead use [^] , since JFlex's " . " now excludes all Unicode newline chars, rather than just n , to comply with Unicode Regular Expressions standard UTS#30. Upgrade the UAX#29-based grammars to the Unicode 6.3 word break rules, in StandardTokenizerImpl.jflex and UAX29URLEmailTokenizer.jflex . Regenerate the JFlex scanners in lucene/analysis/common/ via ant jflex . Test the new scanners against the Unicode 6.3 word break test data Update generateJavaUnicodeWordBreakTest.pl to handle above-BMP characters in the Unicode character database's ucd/auxiliary/WordBreakTest.txt (previous Unicode versions included only BMP characters in that file). Using generateJavaUnicodeWordBreakTest.pl , generate WordBreakTestUnicode_6_3_0.java under modules/analysis/common/src/test/org/apache/lucene/analysis/core/ . Update TestStandardAnalyzer.java and TestUAX29URLEmailTokenizer.java to invoke WordBreakTestUnicode_6_3_0 rather than WordBreakTestUnicode_6_1_0 . Remove WordBreakTestUnicode_6_1_0.java . Additional task for the 4.x backport: Version the JFlex grammars: Copy the current implementations to *Impl40 (where 40=>4.0 is the version in which the Unicode 6.1 versions of these scanners were introduced. Cause the versioning tokenizer wrappers to instantiate this version when the Version c-tor param is in the range 4.0 to 4.6. Change the specified Unicode version in the non-versioned JFlex grammars from 6.1 to 6.3.
        Hide
        Steve Rowe added a comment -

        Patch against trunk handling all above-described pre-4.x-backport tasks.

        Also, I was able to get rid of the workarounds in lucene/analysis/common/build.xml removing the time stamp and the InputStream constructors from generated JFlex scanners, because JFlex itself has fixed these things.

        I think it's ready to go.

        Show
        Steve Rowe added a comment - Patch against trunk handling all above-described pre-4.x-backport tasks. Also, I was able to get rid of the workarounds in lucene/analysis/common/build.xml removing the time stamp and the InputStream constructors from generated JFlex scanners, because JFlex itself has fixed these things. I think it's ready to go.
        Hide
        Uwe Schindler added a comment -

        because JFlex itself has fixed these things.

        How about a release? It would be so great if it could stay in Maven and could be invoked via ivy-cachepath.

        Show
        Uwe Schindler added a comment - because JFlex itself has fixed these things. How about a release? It would be so great if it could stay in Maven and could be invoked via ivy-cachepath.
        Hide
        Steve Rowe added a comment -

        How about a release?

        I'm working on it, almost done.

        Show
        Steve Rowe added a comment - How about a release? I'm working on it, almost done.
        Hide
        ASF subversion and git services added a comment -

        Commit 1548595 from Steve Rowe in branch 'dev/trunk'
        [ https://svn.apache.org/r1548595 ]

        LUCENE-5357: Upgrade StandardTokenizer and UAX29URLEmailTokenizer to Unicode 6.3; update UAX29URLEmailTokenizer's recognized top level domains in URLs and Emails from the IANA Root Zone Database.

        Show
        ASF subversion and git services added a comment - Commit 1548595 from Steve Rowe in branch 'dev/trunk' [ https://svn.apache.org/r1548595 ] LUCENE-5357 : Upgrade StandardTokenizer and UAX29URLEmailTokenizer to Unicode 6.3; update UAX29URLEmailTokenizer's recognized top level domains in URLs and Emails from the IANA Root Zone Database.
        Hide
        ASF subversion and git services added a comment -

        Commit 1548746 from Steve Rowe in branch 'dev/trunk'
        [ https://svn.apache.org/r1548746 ]

        LUCENE-5357: Sync small change (/Katakana/ => /Katakana [x ExtendNumLet] x Katakana/ in the <ALPHANUM> pattern) from UAX29URLEmailTokenizerImpl.jflex to StandardTokenizerImpl.jflex

        Show
        ASF subversion and git services added a comment - Commit 1548746 from Steve Rowe in branch 'dev/trunk' [ https://svn.apache.org/r1548746 ] LUCENE-5357 : Sync small change (/Katakana/ => /Katakana [x ExtendNumLet] x Katakana/ in the <ALPHANUM> pattern) from UAX29URLEmailTokenizerImpl.jflex to StandardTokenizerImpl.jflex
        Hide
        ASF subversion and git services added a comment -

        Commit 1548762 from Steve Rowe in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1548762 ]

        LUCENE-5357: Upgrade StandardTokenizer and UAX29URLEmailTokenizer to Unicode 6.3; update UAX29URLEmailTokenizer's recognized top level domains in URLs and Emails from the IANA Root Zone Database; add std40/StandardTokenizerImpl40 and std40/UAX29URLEmailTokenizerImpl40, for backcompat from 4.0->4.6. (merged trunk r1548595 and r1548746)

        Show
        ASF subversion and git services added a comment - Commit 1548762 from Steve Rowe in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1548762 ] LUCENE-5357 : Upgrade StandardTokenizer and UAX29URLEmailTokenizer to Unicode 6.3; update UAX29URLEmailTokenizer's recognized top level domains in URLs and Emails from the IANA Root Zone Database; add std40/StandardTokenizerImpl40 and std40/UAX29URLEmailTokenizerImpl40, for backcompat from 4.0->4.6. (merged trunk r1548595 and r1548746)
        Hide
        Steve Rowe added a comment -

        Committed to trunk and backported to branch_4x.

        Show
        Steve Rowe added a comment - Committed to trunk and backported to branch_4x.
        Hide
        Robert Muir added a comment -

        This patch looks great! Thanks for taking care of this.

        I'm sorry I couldnt review it earlier, I had a power surge and had some connection difficulties the last few days.

        Show
        Robert Muir added a comment - This patch looks great! Thanks for taking care of this. I'm sorry I couldnt review it earlier, I had a power surge and had some connection difficulties the last few days.
        Hide
        Steve Rowe added a comment -

        No problem Robert, thanks for taking a look.

        About back-compat: none of the JFlex-based tokenizers on trunk have version-based behavior at this point, in contrast to branch_4x. It could be argued that that was because all previous back-compat version were for 3.X, but this issue introduced a 4.0 version, which puts it within the version X-1 window for trunk/5.0. Should I forward-port the 4.0 back-compat stuff from branch_4x for StandardTokenizer and UAX29URLEmailTokenizer? There are other analysis components on trunk that do different things based on version, so clearly the practice has not been abandoned on trunk.

        Show
        Steve Rowe added a comment - No problem Robert, thanks for taking a look. About back-compat: none of the JFlex-based tokenizers on trunk have version-based behavior at this point, in contrast to branch_4x. It could be argued that that was because all previous back-compat version were for 3.X, but this issue introduced a 4.0 version, which puts it within the version X-1 window for trunk/5.0. Should I forward-port the 4.0 back-compat stuff from branch_4x for StandardTokenizer and UAX29URLEmailTokenizer? There are other analysis components on trunk that do different things based on version, so clearly the practice has not been abandoned on trunk.
        Hide
        Robert Muir added a comment -

        About back-compat: none of the JFlex-based tokenizers on trunk have version-based behavior at this point, in contrast to branch_4x

        I would love if all these constants/parameters were completely removed in trunk. if you look at the mailing lists, its obvious that users dont even understand it at all. I dont know how index back compat got perverted into such a thing that made all the analysis apis ugly and overcomplicated.

        This stuff all hurts the project far more than any benefit it brings to the rare few that understand it. I think it should be removed everywhere.

        Show
        Robert Muir added a comment - About back-compat: none of the JFlex-based tokenizers on trunk have version-based behavior at this point, in contrast to branch_4x I would love if all these constants/parameters were completely removed in trunk. if you look at the mailing lists, its obvious that users dont even understand it at all. I dont know how index back compat got perverted into such a thing that made all the analysis apis ugly and overcomplicated. This stuff all hurts the project far more than any benefit it brings to the rare few that understand it. I think it should be removed everywhere.

          People

          • Assignee:
            Steve Rowe
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development