Lucene - Core
  1. Lucene - Core
  2. LUCENE-2847

Support all of unicode in StandardTokenizer

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      StandardTokenizer currently only supports the BMP.

      If it encounters characters outside of the BMP, it just discards them...
      it should instead implement fully implement UAX#29 across all of unicode.

      1. LUCENE-2847.patch
        329 kB
        Steve Rowe
      2. LUCENE-2847.patch
        327 kB
        Steve Rowe
      3. LUCENE-2847.patch
        45 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1
          Hide
          Steve Rowe added a comment - - edited

          I think the added files need svn:eol-style=native ?
          Also, I think we should add an ASL2 license to the generated macros?
          I noticed the TLD generator does this, but I forgot to do it here.

          Done: trunk: r1056014, branch_3x: r1056030

          Show
          Steve Rowe added a comment - - edited I think the added files need svn:eol-style=native ? Also, I think we should add an ASL2 license to the generated macros? I noticed the TLD generator does this, but I forgot to do it here. Done: trunk: r1056014, branch_3x: r1056030
          Hide
          Robert Muir added a comment -

          Thanks for taking care of this!

          I think the added files need svn:eol-style=native ?
          Also, I think we should add an ASL2 license to the generated macros?
          I noticed the TLD generator does this, but I forgot to do it here.

          Show
          Robert Muir added a comment - Thanks for taking care of this! I think the added files need svn:eol-style=native ? Also, I think we should add an ASL2 license to the generated macros? I noticed the TLD generator does this, but I forgot to do it here.
          Hide
          Steve Rowe added a comment -

          Committed to trunk: r1055877, branch_3x: r1055904.

          Show
          Steve Rowe added a comment - Committed to trunk: r1055877, branch_3x: r1055904.
          Hide
          Steve Rowe added a comment -

          i think we should commit your latest patch.

          OK, I'll commit shortly.

          Show
          Steve Rowe added a comment - i think we should commit your latest patch. OK, I'll commit shortly.
          Hide
          Robert Muir added a comment -

          How far would you go with this tools consolidation? All tools across the whole of Scenolunr? Or just the ones under modules/analysis/?

          I just meant under the analyzers module... but lets leave this be, i also forgot we have no analyzers module in 3.x.

          i think we should commit your latest patch.

          Show
          Robert Muir added a comment - How far would you go with this tools consolidation? All tools across the whole of Scenolunr? Or just the ones under modules/analysis/? I just meant under the analyzers module... but lets leave this be, i also forgot we have no analyzers module in 3.x. i think we should commit your latest patch.
          Hide
          Steve Rowe added a comment -

          We could also consolidate tools, because in general i would rather all the analyzers be consolidated, they are only split up due to dependencies/large files etc. But tools are different, its just to assist the build.

          How far would you go with this tools consolidation? All tools across the whole of Scenolunr? Or just the ones under modules/analysis/?

          Show
          Steve Rowe added a comment - We could also consolidate tools, because in general i would rather all the analyzers be consolidated, they are only split up due to dependencies/large files etc. But tools are different, its just to assist the build. How far would you go with this tools consolidation? All tools across the whole of Scenolunr? Or just the ones under modules/analysis/ ?
          Hide
          Steve Rowe added a comment -

          Removed the WARNING from the UAX29URLEmailTokenizer class javadocs about Unicode supplementary character non-coverage.

          Show
          Steve Rowe added a comment - Removed the WARNING from the UAX29URLEmailTokenizer class javadocs about Unicode supplementary character non-coverage.
          Hide
          Steve Rowe added a comment - - edited

          New patch, with the following changes:

          1. Added a new target gen-uax29-supp-macros to modules/analysis/icu/build.xml, and a <subant> call to it from the jflex task in modules/analysis/common/build.xml.
          2. Included SUPPLEMENTARY.jflex-macro in UAX29URLEmailTokenizer.jflex in the same way as it is included in StandardTokenizer.jflex
          3. Copied the simple supplementary characters test from TestStandardAnalyzer.java to TestUAX29URLEmailTokenizer.java.
          4. Modified the CHANGES.txt entry for the UAX#29 issues to include a reference to this issue.

          All tests pass.

          Show
          Steve Rowe added a comment - - edited New patch, with the following changes: Added a new target gen-uax29-supp-macros to modules/analysis/icu/build.xml , and a <subant> call to it from the jflex task in modules/analysis/common/build.xml . Included SUPPLEMENTARY.jflex-macro in UAX29URLEmailTokenizer.jflex in the same way as it is included in StandardTokenizer.jflex Copied the simple supplementary characters test from TestStandardAnalyzer.java to TestUAX29URLEmailTokenizer.java . Modified the CHANGES.txt entry for the UAX#29 issues to include a reference to this issue. All tests pass.
          Hide
          Robert Muir added a comment -

          If we add a target in modules/analysis/icu/build.xml to run GenerateJFlexSupplementaryMacros#main(), maybe named gen-stdtok-supp-macros, the jflex target in modules/analysis/common/build.xml could use a <subant> to call it and auto-generate SUPPLEMENTARY.jflex-macro, no?

          Yeah, i think we could do something like this. We could also consolidate tools, because in general i would rather all the analyzers
          be consolidated, they are only split up due to dependencies/large files etc. But tools are different, its just to assist the build.

          Show
          Robert Muir added a comment - If we add a target in modules/analysis/icu/build.xml to run GenerateJFlexSupplementaryMacros#main(), maybe named gen-stdtok-supp-macros, the jflex target in modules/analysis/common/build.xml could use a <subant> to call it and auto-generate SUPPLEMENTARY.jflex-macro, no? Yeah, i think we could do something like this. We could also consolidate tools, because in general i would rather all the analyzers be consolidated, they are only split up due to dependencies/large files etc. But tools are different, its just to assist the build.
          Hide
          Steve Rowe added a comment -

          JFlex generates fine, everything compiles, all tests pass.

          If we add a target in modules/analysis/icu/build.xml to run GenerateJFlexSupplementaryMacros#main(), maybe named gen-stdtok-supp-macros, the jflex target in modules/analysis/common/build.xml could use a <subant> to call it and auto-generate SUPPLEMENTARY.jflex-macro, no?

          Show
          Steve Rowe added a comment - JFlex generates fine, everything compiles, all tests pass. If we add a target in modules/analysis/icu/build.xml to run GenerateJFlexSupplementaryMacros#main() , maybe named gen-stdtok-supp-macros , the jflex target in modules/analysis/common/build.xml could use a <subant> to call it and auto-generate SUPPLEMENTARY.jflex-macro , no?
          Hide
          Robert Muir added a comment -

          Here's a patch... I added a simple test.

          I'm sure it can be beautified etc.

          Show
          Robert Muir added a comment - Here's a patch... I added a simple test. I'm sure it can be beautified etc.

            People

            • Assignee:
              Steve Rowe
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development