|
I like what this patch does, but not how it does it. Nutch should perform bi-gram segementation of CJK character sequences. This patch performs such segmentation at two places: in the character stream that is the input to the tokenizer, and in a filter that processes the output of the tokenizer. I'm unclear why the latter is required. The former should suffice, no?
But instead of segmenting in the character stream it should be done in the tokenizer itself. I think this could be done with something like the following in NutchAnalysis.jj.
{ if (prevCJK) {
matchedToken.image = prevCJK + matchedToken.image;
} else {
matchedToken.image = "_" + matchedToken.image;
} A little more would be required to maintain prevCJK. Thoughts? Code of a kind can be used to perform third-part CJK word
segmentation in NutchAnalysis.jj. CJKTokenizer, a kind of bi-gram segmentation , was used in the following example. ================================================================================ @@ -33,6 +33,7 @@ import org.apache.nutch.searcher.Query.Clause; import org.apache.lucene.analysis.StopFilter; import java.io.*; TOKEN_MGR_DECLS : { /** Constructs a token manager for the provided Reader. */ // chinese, japanese and korean characters Kerang Lv's solution did well in NutchAnalysis but still there are some bugs in Summarizer. Say here is one chinese string (c1)(c2)(c3)(c4), the result of bi-gram is:
matched-image start-offset end-offset (c1)(c2) 0 2 (c2)(c3) 1 3 (c3)(c4) 2 4 In search summaries, we should merge the tokens if the index is overlaped. You can follow this: change code if (highlight.contains(t.termText())) { excerpt.addToken(t.termText()); excerpt.add(new Fragment(text.substring(offset, t.startOffset()))); excerpt.add(new Highlight(text.substring(t.startOffset(),t.endOffset()))); offset = t.endOffset(); endToken = Math.min(j+SUM_CONTEXT, tokens.length); } to if (highlight.contains(t.termText())) { Jack,
Have you tested the latest patches attached to this issue + your fix for summarizer? I can test that technically speaking they appear to do what was described, but knowing no Chinese I cannot testify if they produce any useful output... [[ Old comment, sent by email on Wed, 23 Aug 2006 10:44:13 +0800 ]] Hi! Best wish! |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1. Modify NutchAnalysis.jj.
===========================================
@@ -106,7 +106,7 @@
}
// chinese, japanese and korean characters
-| <SIGRAM: <CJK> >
| <SIGRAM: (<CJK>) >
===========================================
Why change "<SIGRAM:<CJK>>" to "<SIGRAM: (<CJK>)+>"? Because Chinese(I don't know japanese and korean well) terms segmentation is totally different from English. In another words, word-by-word segmentation is inefficient for Chinese characters indexing and search.
2. Modify FastCharStream.java
===========================================
@@ -18,6 +18,8 @@
import java.io.*;
+import org.apache.lucene.analysis.Token;
+
/** An efficient implementation of JavaCC's CharStream interface. <p>Note that
@@ -69,10 +71,15 @@
if (charsRead == -1)
throw new IOException("read past eof");
else
+ { + charsRead = new CJKCharStream().readChineseChars(newPosition, charsRead); + bufferLength += charsRead; + }
}
+
+
+public final char BeginToken() throws IOException { tokenStart = bufferPosition; return readChar(); }
@@ -117,4 +124,45 @@
public final int getBeginLine() { return 1; }
+
+
+ final class CJKCharStream
+ {
+
+ /**
+ * @param newPosition
+ * @param charsRead
+ * @return
+ * @throws IOException
+ */
+ int readChineseChars(int newPosition, int charsRead)
+ throws IOException
+
+ }
+
}
To support "<SIGRAM: (<CJK>)+>" in NutchAnalysis.jj, we do Chinese term segmentation in FastCharStream which process before NutchAnalysis's parse method. And the main component is CJKTokenizer which Bi-segments Chinese terms.
3. Add CJKTokenizer.java
4. Modify NutchDocumentTokenizer.java
===========================================
@@ -46,8 +46,11 @@
while (true) {
t = tokenManager.getNextToken();
switch (t.kind) { // skip query syntax tokens - case EOF: case WORD: case ACRONYM: case SIGRAM: + case EOF: case WORD: case ACRONYM: break loop; + case SIGRAM: + CJKTokenizer cjkT = new CJKTokenizer(input); + return cjkT.next(); default: }
}
===========================================
NutchDocumentTokenizer.tokenStream() is called by NutchDocumentAnalyzer, and int this way, the modified NutchDocumentTokenizer class let NutchDocumentAnalyzer supports Chinese.