Lucene - Core
  1. Lucene - Core
  2. LUCENE-2906 Filter to process output of ICUTokenizer and create overlapping bigrams for CJK
  3. LUCENE-2911

synchronize grammar/token types across StandardTokenizer, UAX29EmailURLTokenizer, ICUTokenizer, add CJK types.

    Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      I'd like to do LUCENE-2906 (better cjk support for these tokenizers) for a future target such as 3.2

      But, in 3.1 I would like to do a little cleanup first, and synchronize all these token types, etc.

      1. LUCENE-2911.patch
        18 kB
        Robert Muir
      2. LUCENE-2911.patch
        17 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        after applying the patch, you have to run 'ant jflex' from modules/analysis/common and 'ant genrbbi' from modules/analysis/icu to regenerate.

        Show
        Robert Muir added a comment - after applying the patch, you have to run 'ant jflex' from modules/analysis/common and 'ant genrbbi' from modules/analysis/icu to regenerate.
        Hide
        Steve Rowe added a comment -

        The generated top-level domain macro file has a bunch of new entries when I run this, but these are not included in your patch, and I think we should keep this list up-to-date.

        The patch is missing HangulSupp macro generation in modules/icu/src/tools/.../GenerateJFlexSupplementaryMacros.java, but since the Hangul macro is not used in the jflex grammar, this doesn't cause a problem.

        It would be nice to remove the hard-coded ranges for the intersection of Hangul & ALetter, but when I tried to use JFlex negation and union to produce the equivalent, memory usage exploded and I couldn't get JFlex to generate, so I guess we'll have to wait on native JFlex supplementary character support before we can change it.

        Show
        Steve Rowe added a comment - The generated top-level domain macro file has a bunch of new entries when I run this, but these are not included in your patch, and I think we should keep this list up-to-date. The patch is missing HangulSupp macro generation in modules/icu/src/tools/.../GenerateJFlexSupplementaryMacros.java, but since the Hangul macro is not used in the jflex grammar, this doesn't cause a problem. It would be nice to remove the hard-coded ranges for the intersection of Hangul & ALetter, but when I tried to use JFlex negation and union to produce the equivalent, memory usage exploded and I couldn't get JFlex to generate, so I guess we'll have to wait on native JFlex supplementary character support before we can change it.
        Hide
        Robert Muir added a comment -

        The generated top-level domain macro file has a bunch of new entries when I run this, but these are not included in your patch, and I think we should keep this list up-to-date.

        Yeah, i would re-run it before committing? in general i didn't "re-generate" so you wouldnt see a lot of generated differences in the patch.

        The patch is missing HangulSupp macro generation in modules/icu/src/tools/.../GenerateJFlexSupplementaryMacros.java, but since the Hangul macro is not used in the jflex grammar, this doesn't cause a problem.

        Oh i did actually mean to include this, sorry I forgot... its a one liner though, I can include this easily.

        Show
        Robert Muir added a comment - The generated top-level domain macro file has a bunch of new entries when I run this, but these are not included in your patch, and I think we should keep this list up-to-date. Yeah, i would re-run it before committing? in general i didn't "re-generate" so you wouldnt see a lot of generated differences in the patch. The patch is missing HangulSupp macro generation in modules/icu/src/tools/.../GenerateJFlexSupplementaryMacros.java, but since the Hangul macro is not used in the jflex grammar, this doesn't cause a problem. Oh i did actually mean to include this, sorry I forgot... its a one liner though, I can include this easily.
        Hide
        Robert Muir added a comment -

        improved the patch by using a simpler demorgan expression Steven came up with.

        I think this one is ready to commit.

        Show
        Robert Muir added a comment - improved the patch by using a simpler demorgan expression Steven came up with. I think this one is ready to commit.
        Hide
        Steve Rowe added a comment -

        I think this one is ready to commit.

        +1

        I applied the patch, jflex generates properly, tests pass

        Show
        Steve Rowe added a comment - I think this one is ready to commit. +1 I applied the patch, jflex generates properly, tests pass
        Hide
        Robert Muir added a comment -

        Committed revision 1068979. Now backporting...

        Show
        Robert Muir added a comment - Committed revision 1068979. Now backporting...
        Hide
        Robert Muir added a comment -

        Committed revision 1068997 to branch_3x

        Thanks Steven!

        Show
        Robert Muir added a comment - Committed revision 1068997 to branch_3x Thanks Steven!
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development