Nutch
  1. Nutch
  2. NUTCH-162

country code "jp" is used instead of language code "ja" for Japanese

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Won't Fix
    • Affects Version/s: 0.7.1
    • Fix Version/s: None
    • Component/s: web gui
    • Labels:
      None
    • Environment:

      n/a

      Description

      In locale switching link for Japanese, "jp" is used as language code but it is an ISO country code. The language code "ja" should be used.

      By the way, I don't think many users are familiar with the ISO language codes. A Canadian user may click on "ca" uknowoing that ca stands for Catalan, not Canadian English or French. Rather than listing the language code, listing the language names in the prospective languages may be better. (I say "may be" because the browser could show some language names in corrupted text if the current font does not support that language — this is a difficult problem.)

      1. anchors_ja.properties
        0.1 kB
        Hiroaki Kawai
      2. cached_ja.properties
        0.2 kB
        Hiroaki Kawai
      3. explain_ja.properties
        0.1 kB
        Hiroaki Kawai
      4. search_ja.properties
        0.4 kB
        Hiroaki Kawai
      5. text_ja.properties
        0.3 kB
        Hiroaki Kawai

        Activity

        Hide
        KuroSaka TeruHiko added a comment -

        This is causing an undesired behavior for Japanese users. If the Nutch main index.jsp is visited from the browser of which the preferred language is configured to be Japanese, the app server's main page is displayed instead (Tomcat's Welcome page, for example). This is because the Nutch index.jsp tries to redirect to the non-exisiting "ja/".

        Show
        KuroSaka TeruHiko added a comment - This is causing an undesired behavior for Japanese users. If the Nutch main index.jsp is visited from the browser of which the preferred language is configured to be Japanese, the app server's main page is displayed instead (Tomcat's Welcome page, for example). This is because the Nutch index.jsp tries to redirect to the non-exisiting "ja/".
        Hide
        Paul Baclace added a comment -

        The best practice for identifying localization is to use the ISO language and country code in the form of lowercase language code followed by upper case country code. This makes it possible to use specific idioms used in particular countries. English has over a dozen variants; a few examples are:

        enAU-English-Australia
        enIE-English-Ireland
        enJM-English-Jamaica
        enUS-English-United_States

        Inexplicably, different codes were used for the Japanese language and the country Japan. The locale is jaJP. Meanwhile, Javanese in Java is jwJA.

        The web gui should obtain the user's prefered language and country combination from the HTTP request headers and use the nearest matching Locale:

        http://java.sun.com/docs/books/tutorial/i18n/locale/create.html

        This is preferred over having the user pick the language and/or conutry from a list since the user might not be able to read the labels.

        Show
        Paul Baclace added a comment - The best practice for identifying localization is to use the ISO language and country code in the form of lowercase language code followed by upper case country code. This makes it possible to use specific idioms used in particular countries. English has over a dozen variants; a few examples are: enAU-English-Australia enIE-English-Ireland enJM-English-Jamaica enUS-English-United_States Inexplicably, different codes were used for the Japanese language and the country Japan. The locale is jaJP. Meanwhile, Javanese in Java is jwJA. The web gui should obtain the user's prefered language and country combination from the HTTP request headers and use the nearest matching Locale: http://java.sun.com/docs/books/tutorial/i18n/locale/create.html This is preferred over having the user pick the language and/or conutry from a list since the user might not be able to read the labels.
        Hide
        KuroSaka TeruHiko added a comment -

        I agree with Paul in principle. With the current way of designating language by the lang code alone, there is no way to distinguish Simplified Chinese and Traditional Chinese, the written variants of Chinese language. These have been traditionally distinguished by zh_cc where cc is a county code such as tw (Taiwan which uses Traditional form) or cn (People's Repblic of China which uses Simplified form). Nutch currently displays Simplified Chinese for "zh", which would disappoint Traditional Chinese readers.

        I am not too sure about always using the llcc naming convention.

        (1) To be compatible with the web standard & practice, and not inventing another naming convention, I would prefer using minus as delimiter, e.g. en-au.

        (2) Not all languages need country modifier. Japanese, for example, is spoken (by a large enough community that devlopes its own dialect) only in Japan. Major browsers send out "ja", not "ja-jp".

        (3) We would still need the generic "en" (or "fr") because there is a generic English (French) setting in many browsers with which "en" is sent without country code, and because we would need a fall back locale when unsupported country variants of English is specified by the browser.

        (By the way, jwJA is an interesting example, but jw is not a registered ISO language code (it's jv), or Java is not a country.)

        Another things we need to be concerned is the implication of the new locale naming scheme. The language identifier and analyzer plugin (in Trunk) are consumers of the locale too. The current code assumes the two letter language names. This needs to be extended to accept both types of names, ll and ll-cc.

        Show
        KuroSaka TeruHiko added a comment - I agree with Paul in principle. With the current way of designating language by the lang code alone, there is no way to distinguish Simplified Chinese and Traditional Chinese, the written variants of Chinese language. These have been traditionally distinguished by zh_cc where cc is a county code such as tw (Taiwan which uses Traditional form) or cn (People's Repblic of China which uses Simplified form). Nutch currently displays Simplified Chinese for "zh", which would disappoint Traditional Chinese readers. I am not too sure about always using the llcc naming convention. (1) To be compatible with the web standard & practice, and not inventing another naming convention, I would prefer using minus as delimiter, e.g. en-au. (2) Not all languages need country modifier. Japanese, for example, is spoken (by a large enough community that devlopes its own dialect) only in Japan. Major browsers send out "ja", not "ja-jp". (3) We would still need the generic "en" (or "fr") because there is a generic English (French) setting in many browsers with which "en" is sent without country code, and because we would need a fall back locale when unsupported country variants of English is specified by the browser. (By the way, jwJA is an interesting example, but jw is not a registered ISO language code (it's jv), or Java is not a country.) Another things we need to be concerned is the implication of the new locale naming scheme. The language identifier and analyzer plugin (in Trunk) are consumers of the locale too. The current code assumes the two letter language names. This needs to be extended to accept both types of names, ll and ll-cc.
        Hide
        KuroSaka TeruHiko added a comment -

        It seems many .html files are actually generated by ant target "generate-docs" in build.xml, and only these four changes are needed to fix this bug:

        mv src/web/inclde/jp src/web/include/ja
        mv src/web/pages/jp src/web/pages/ja
        edited src/web/pages/ja/search.xml to replace one occurance of "jp" with "ja".
        edited src/web/include/footer.html to replace two occurances of "jp" with "ja" on a sngle line.

        These are diffs:
        $ diff src/web/pages/ja/search.xml src/web/pages/ja/search.xml~
        5c5
        < <input type="hidden" name="lang" value="ja"/>

        > <input type="hidden" name="lang" value="jp"/>

        $ diff src/web/include/footer.html src/web/include/footer.html~
        16c16
        < <a href="../ja/">ja</a> |

        > <a href="../jp/">jp</a> |

        Show
        KuroSaka TeruHiko added a comment - It seems many .html files are actually generated by ant target "generate-docs" in build.xml, and only these four changes are needed to fix this bug: mv src/web/inclde/jp src/web/include/ja mv src/web/pages/jp src/web/pages/ja edited src/web/pages/ja/search.xml to replace one occurance of "jp" with "ja". edited src/web/include/footer.html to replace two occurances of "jp" with "ja" on a sngle line. These are diffs: $ diff src/web/pages/ja/search.xml src/web/pages/ja/search.xml~ 5c5 < <input type="hidden" name="lang" value="ja"/> — > <input type="hidden" name="lang" value="jp"/> $ diff src/web/include/footer.html src/web/include/footer.html~ 16c16 < <a href="../ja/">ja</a> | — > <a href="../jp/">jp</a> |
        Hide
        Hiroaki Kawai added a comment -

        We need some japanaese property files to make "ja" for the default language selection (Because of String language = ResourceBundle.getBundle("org.nutch.jsp.search", request.getLocale()).getLocale().getLanguage(); in seach.jsp for example).

        I'll submit those property files.

        Show
        Hiroaki Kawai added a comment - We need some japanaese property files to make "ja" for the default language selection (Because of String language = ResourceBundle.getBundle("org.nutch.jsp.search", request.getLocale()).getLocale().getLanguage(); in seach.jsp for example). I'll submit those property files.
        Hide
        Hiroaki Kawai added a comment -

        Please put these property files in src/web/locale/org/nutch/jsp/ .

        Show
        Hiroaki Kawai added a comment - Please put these property files in src/web/locale/org/nutch/jsp/ .
        Show
        Markus Jelsma added a comment - Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

          People

          • Assignee:
            Unassigned
            Reporter:
            KuroSaka TeruHiko
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development