Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-224

Nutch doesn't handle Korean text at all

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 0.7.1
    • None
    • indexer
    • None

    Description

      I was browing NutchAnalysis.jj and found that
      Hungul Syllables (U+AC00 ... U+D7AF; U+xxxx means
      a Unicode character of the hex value xxxx) are not
      part of LETTER or CJK class. This seems to me that
      Nutch cannot handle Korean documents at all.

      I posted the above message at nutch-user ML and Cheolgoo Kang [appler@gmail.com]
      replied as:
      ------------------------------------------------------------------------------------
      There was similar issue with Lucene's StandardTokenizer.jj.

      http://issues.apache.org/jira/browse/LUCENE-444

      and

      http://issues.apache.org/jira/browse/LUCENE-461

      I'm have almost no experience with Nutch, but you can handle it like
      those issues above.
      ------------------------------------------------------------------------------------

      Both fixes should probably be ported back to NuatchAnalysis.jj.

      Attachments

        Activity

          People

            Unassigned Unassigned
            tkurosaka Kuro Kurosaka
            Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: