Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None

      Description

      Seems the character list in the CJK section of the StandardTokenizer.jj is not quite complete. Following is a more complete list:

      < CJK: // non-alphabets
      [
      "\u1100"-"\u11ff",
      "\u3040"-"\u30ff",
      "\u3130"-"\u318f",
      "\u31f0"-"\u31ff",
      "\u3300"-"\u337f",
      "\u3400"-"\u4dbf",
      "\u4e00"-"\u9fff",
      "\uac00"-"\ud7a3",
      "\uf900"-"\ufaff",
      "\uff65"-"\uffdc"
      ]
      >

      1. StandardTokenizer.jj.diff
        1.0 kB
        Steve Rowe
      2. StandardTokenizer.jj.diff
        1.0 kB
        Steve Rowe

        Activity

        Hide
        otis Otis Gospodnetic added a comment -

        Thanks, I committed Steven Rowe's patch, although it doesn't seem to fully match what he said in comments above (e.g. in his patch, I don't see the range he mentioned in 5.b).

        Show
        otis Otis Gospodnetic added a comment - Thanks, I committed Steven Rowe's patch, although it doesn't seem to fully match what he said in comments above (e.g. in his patch, I don't see the range he mentioned in 5.b).
        Hide
        steve_rowe Steve Rowe added a comment -

        Removed stray comma - obsoletes previous patch

        Show
        steve_rowe Steve Rowe added a comment - Removed stray comma - obsoletes previous patch
        Hide
        steve_rowe Steve Rowe added a comment -

        Patch addressing the above-described issues

        Show
        steve_rowe Steve Rowe added a comment - Patch addressing the above-described issues
        Hide
        steve_rowe Steve Rowe added a comment -

        There are six classes of issues:

        1. A character range in StandardTokenizer.jj that is missing in
        John's list, and should be left as-is in StandardTokenizer.jj
        (in the <CJ> section):

        1.a. [ U+3100 - U+312F ]
        BoPoMoFo (a.k.a. ZhuYin): Phonetic transcription symbols
        used in Taiwan; not used on mainland China.

        2. A character range in StandardTokenizer.jj that is also in
        John's list, but in the <LETTER> section rather than in the <CJ>
        section, and should be left as-is:

        2.a. [ U+1100 - U+11FF ]
        Korean Jamo (phonetic symbols)

        3. A character range in StandardTokenizer.jj that is not present in
        John's list, and that should be removed from the <KOREAN> section
        in StandardTokenizer.jj:

        3.a. [ U+D7A4 - U+D7AF ]
        Non-character range at the end of the pre-composed Hangul
        (Korean) block

        4. A character range in John's list that is missing in
        StandardTokenizer.jj, but which was not present in Unicode 3.0, and
        so strictly should not be included when running on Java 1.4; since
        this is a non-character range in Unicode 3.0, however, I think it
        should be included in StandardTokenizer.jj (in the <CJ> section)
        for future compatibility with Java 1.5 and Unicode 4.0:

        4.a. [ U+31F0 - U+31FF ]
        Japanese Katakana phonetic extensions; these were introduced
        in Unicode version 3.2 (see
        http://www.unicode.org/reports/tr28/tr28-3.html#10_3_katakana )

        5. Character ranges in John's list that are missing in
        StandardTokenizer.jj, and that should be added to the newly
        re-labeled <CJ> section:

        5.a. [ U+FF65 - U+FF9F ]
        Half-width Japanese Katakana (phonetic symbols)

        5.b. [ U+3d2e - U+4DB5 ] (non-chars [ U+4DB6 - U+4DBF ] excluded)
        CJK Ideograph Extension A.
        This range was introduced in Unicode 3.0.

        6. A character range in John's list that is missing in
        StandardTokenizer.jj, and that should be added to the <LETTER>
        section, since it, like the [ U+1100 - U+11FF ] range already
        included there, is a range of Korean Jamo (phonetic symbols):

        6.a. [ U+FFA0 - U+FFDC ]
        Half-width Korean Jamo (phonetic symbols)

        Show
        steve_rowe Steve Rowe added a comment - There are six classes of issues: 1. A character range in StandardTokenizer.jj that is missing in John's list, and should be left as-is in StandardTokenizer.jj (in the <CJ> section): 1.a. [ U+3100 - U+312F ] BoPoMoFo (a.k.a. ZhuYin): Phonetic transcription symbols used in Taiwan; not used on mainland China. 2. A character range in StandardTokenizer.jj that is also in John's list, but in the <LETTER> section rather than in the <CJ> section, and should be left as-is: 2.a. [ U+1100 - U+11FF ] Korean Jamo (phonetic symbols) 3. A character range in StandardTokenizer.jj that is not present in John's list, and that should be removed from the <KOREAN> section in StandardTokenizer.jj: 3.a. [ U+D7A4 - U+D7AF ] Non-character range at the end of the pre-composed Hangul (Korean) block 4. A character range in John's list that is missing in StandardTokenizer.jj, but which was not present in Unicode 3.0, and so strictly should not be included when running on Java 1.4; since this is a non-character range in Unicode 3.0, however, I think it should be included in StandardTokenizer.jj (in the <CJ> section) for future compatibility with Java 1.5 and Unicode 4.0: 4.a. [ U+31F0 - U+31FF ] Japanese Katakana phonetic extensions; these were introduced in Unicode version 3.2 (see http://www.unicode.org/reports/tr28/tr28-3.html#10_3_katakana ) 5. Character ranges in John's list that are missing in StandardTokenizer.jj, and that should be added to the newly re-labeled <CJ> section: 5.a. [ U+FF65 - U+FF9F ] Half-width Japanese Katakana (phonetic symbols) 5.b. [ U+3d2e - U+4DB5 ] (non-chars [ U+4DB6 - U+4DBF ] excluded) CJK Ideograph Extension A. This range was introduced in Unicode 3.0. 6. A character range in John's list that is missing in StandardTokenizer.jj, and that should be added to the <LETTER> section, since it, like the [ U+1100 - U+11FF ] range already included there, is a range of Korean Jamo (phonetic symbols): 6.a. [ U+FFA0 - U+FFDC ] Half-width Korean Jamo (phonetic symbols)
        Hide
        lucenebugs@danielnaber.de Daniel Naber added a comment -

        John, I'm not sure I understand: do you think that this issue can be closed now? If not, could you ask your i18n experts how your changes could be integrated into the current code (the one where K/Korean and CJ are separate things)?

        Show
        lucenebugs@danielnaber.de Daniel Naber added a comment - John, I'm not sure I understand: do you think that this issue can be closed now? If not, could you ask your i18n experts how your changes could be integrated into the current code (the one where K/Korean and CJ are separate things)?
        Hide
        john.wang@gmail.com John Wang added a comment -

        Yes I am.

        Our i18n team has provided a more up-to-date list and I thought I'd contribute it back.

        -John

        Show
        john.wang@gmail.com John Wang added a comment - Yes I am. Our i18n team has provided a more up-to-date list and I thought I'd contribute it back. -John
        Hide
        lucenebugs@danielnaber.de Daniel Naber added a comment -

        This is how the code looks currently:

        < CJ: // Chinese, Japanese
        [
        "\u3040"-"\u318f",
        "\u3300"-"\u337f",
        "\u3400"-"\u3d2d",
        "\u4e00"-"\u9fff",
        "\uf900"-"\ufaff"
        ]
        >
        < KOREAN: // Korean
        [
        "\uac00"-"\ud7af"
        ]
        >

        Are your suggested changes still needed and if so, where should which range be added (Chinese/Japanese or Korean)?

        Show
        lucenebugs@danielnaber.de Daniel Naber added a comment - This is how the code looks currently: < CJ: // Chinese, Japanese [ "\u3040"-"\u318f", "\u3300"-"\u337f", "\u3400"-"\u3d2d", "\u4e00"-"\u9fff", "\uf900"-"\ufaff" ] > < KOREAN: // Korean [ "\uac00"-"\ud7af" ] > Are your suggested changes still needed and if so, where should which range be added (Chinese/Japanese or Korean)?

          People

          • Assignee:
            otis Otis Gospodnetic
            Reporter:
            john.wang@gmail.com John Wang
          • Votes:
            2 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development