Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Won't Fix
-
0.7.1
-
None
-
None
Description
I was browing NutchAnalysis.jj and found that
Hungul Syllables (U+AC00 ... U+D7AF; U+xxxx means
a Unicode character of the hex value xxxx) are not
part of LETTER or CJK class. This seems to me that
Nutch cannot handle Korean documents at all.
I posted the above message at nutch-user ML and Cheolgoo Kang [appler@gmail.com]
replied as:
------------------------------------------------------------------------------------
There was similar issue with Lucene's StandardTokenizer.jj.
http://issues.apache.org/jira/browse/LUCENE-444
and
http://issues.apache.org/jira/browse/LUCENE-461
I'm have almost no experience with Nutch, but you can handle it like
those issues above.
------------------------------------------------------------------------------------
Both fixes should probably be ported back to NuatchAnalysis.jj.