Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.7, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      ICU will release a new version in about a month.

      They have a version for testing (http://site.icu-project.org/download/milestone) already out with some interesting features, e.g. dictionary-based CJK segmentation.

      This issue is just to test it out/integrate the new stuff/etc. We should try out the automation Steve did as well.

      1. LUCENE-4381.patch
        58 kB
        Robert Muir
      2. LUCENE-4381.patch
        21 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          A hacked up patch for testing:

          I think its nice to offer the CJK dictionary-based stuff as an option? I'm not sure how good results will be on average yet (maybe I can enlist Christian to help investigate).

          So as a test I just added a boolean option, which if enabled, keeps all han/hiragana/katakana marked as "Chinese/Japanese" (uses the 15924 Japanese code, but I overrode the toString to try to prevent confusion).

          Seems to work ok: some trivial snippets from smartcn and kuromoji are analyzed fine, and testRandomStrings is happy

          Show
          Robert Muir added a comment - A hacked up patch for testing: I think its nice to offer the CJK dictionary-based stuff as an option? I'm not sure how good results will be on average yet (maybe I can enlist Christian to help investigate). So as a test I just added a boolean option, which if enabled, keeps all han/hiragana/katakana marked as "Chinese/Japanese" (uses the 15924 Japanese code, but I overrode the toString to try to prevent confusion). Seems to work ok: some trivial snippets from smartcn and kuromoji are analyzed fine, and testRandomStrings is happy
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Robert Muir added a comment -

          here's a cleaned up patch. i think its ready.

          our ICU is currently really out of date, and upgrading it allows us to delete a bunch of custom code.

          Show
          Robert Muir added a comment - here's a cleaned up patch. i think its ready. our ICU is currently really out of date, and upgrading it allows us to delete a bunch of custom code.
          Hide
          Steve Rowe added a comment -

          The issue title should read unicode 6.3, right? There are several references to 6.3 in the patch.

          Robert, you haven't included any jflex changes, as I did on the 6.1 upgrade issue (LUCENE-3747). JFlex trunk includes unicode 6.3 support. I can handle the upgrade.

          Show
          Steve Rowe added a comment - The issue title should read unicode 6.3, right? There are several references to 6.3 in the patch. Robert, you haven't included any jflex changes, as I did on the 6.1 upgrade issue ( LUCENE-3747 ). JFlex trunk includes unicode 6.3 support. I can handle the upgrade.
          Hide
          Robert Muir added a comment -

          thanks, i renamed the issue to clarify the scope.

          I didnt want to mess with the jflex part, as some rules of the grammar have changed (in addition to data).

          Show
          Robert Muir added a comment - thanks, i renamed the issue to clarify the scope. I didnt want to mess with the jflex part, as some rules of the grammar have changed (in addition to data).
          Hide
          ASF subversion and git services added a comment -

          Commit 1547502 from Robert Muir in branch 'dev/trunk'
          [ https://svn.apache.org/r1547502 ]

          LUCENE-4381: upgrade ICU to icu4j 52.1

          Show
          ASF subversion and git services added a comment - Commit 1547502 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1547502 ] LUCENE-4381 : upgrade ICU to icu4j 52.1
          Hide
          ASF subversion and git services added a comment -

          Commit 1547561 from Robert Muir in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1547561 ]

          LUCENE-4381: upgrade ICU to icu4j 52.1

          Show
          ASF subversion and git services added a comment - Commit 1547561 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1547561 ] LUCENE-4381 : upgrade ICU to icu4j 52.1
          Hide
          Robert Muir added a comment -

          I will open a separate issue for the jflex tokenization

          Show
          Robert Muir added a comment - I will open a separate issue for the jflex tokenization

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development