Issue Details (XML | Word | Printable)

Key: NUTCH-36
Type: Improvement Improvement
Status: Open Open
Priority: Minor Minor
Assignee: Unassigned
Reporter: Jack Tang
Votes: 5
Watchers: 3
Operations

If you were logged in you would be able to see more operations.
Nutch

Chinese in Nutch

Created: 05/Apr/05 11:14 AM   Updated: 07/Nov/06 06:22 AM
Return to search
Component/s: indexer, searcher
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
File &#26700 2005-04-05 11:58 AM Jack Tang 4 kB
Environment: all


 Description  « Hide
Nutch now support Chinese in very simple way: NutchAnalysis segments CJK term word-by-word.
So, if I search Chinese term 'FooBar'(two Chinese words: 'Foo' and 'Bar'), the result in web gui will highlight 'FooBar' and 'Foo', 'Bar'. While we expect Nutch only highlights 'FooBar'.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Jack Tang added a comment - 05/Apr/05 11:20 AM
Follow below steps to make Nutch support Chinese well.

1. Modify NutchAnalysis.jj.
===========================================
@@ -106,7 +106,7 @@
}

// chinese, japanese and korean characters
-| <SIGRAM: <CJK> >
| <SIGRAM: (<CJK>) >
===========================================

Why change "<SIGRAM:<CJK>>" to "<SIGRAM: (<CJK>)+>"? Because Chinese(I don't know japanese and korean well) terms segmentation is totally different from English. In another words, word-by-word segmentation is inefficient for Chinese characters indexing and search.

2. Modify FastCharStream.java
===========================================
@@ -18,6 +18,8 @@

import java.io.*;

+import org.apache.lucene.analysis.Token;
+
/** An efficient implementation of JavaCC's CharStream interface. <p>Note that

  • this does not do line-number counting, but instead keeps track of the
  • character position of the token in the input, as required by Lucene's {@link
    @@ -69,10 +71,15 @@
    if (charsRead == -1)
    throw new IOException("read past eof");
    else
  • bufferLength += charsRead;
    + { + charsRead = new CJKCharStream().readChineseChars(newPosition, charsRead); + bufferLength += charsRead; + }
    }
  • public final char BeginToken() throws IOException {
    +
    +
    +public final char BeginToken() throws IOException { tokenStart = bufferPosition; return readChar(); }
    @@ -117,4 +124,45 @@
    public final int getBeginLine() { return 1; }
    +
    +
    + final class CJKCharStream
    + {
    +
    + /**
    + * @param newPosition
    + * @param charsRead
    + * @return
    + * @throws IOException
    + */
    + int readChineseChars(int newPosition, int charsRead)
    + throws IOException
    +
    Unknown macro: {+ String str = new String(buffer,newPosition,charsRead);+ CJKTokenizer tokenizer = new CJKTokenizer(new StringReader(str));+ Token token = tokenizer.next();+ StringBuffer sb = new StringBuffer();+ while(token != null)+ { + sb.append(token.termText()).append(" "); + token = tokenizer.next(); + }+ + + + while(sb.length()>buffer.length-newPosition)+ { + char[] newBuffer = new char[buffer.length*2]; + System.arraycopy(buffer, 0, newBuffer, 0, buffer.length); + buffer = newBuffer; + }+ + for(int i=0;i<sb.length();i++){ + buffer[newPosition+i]=sb.charAt(i); + }+ + return sb.length();+ }

    + }
    +
    }

To support "<SIGRAM: (<CJK>)+>" in NutchAnalysis.jj, we do Chinese term segmentation in FastCharStream which process before NutchAnalysis's parse method. And the main component is CJKTokenizer which Bi-segments Chinese terms.

3. Add CJKTokenizer.java

4. Modify NutchDocumentTokenizer.java
===========================================
@@ -46,8 +46,11 @@
while (true) {
t = tokenManager.getNextToken();
switch (t.kind) { // skip query syntax tokens - case EOF: case WORD: case ACRONYM: case SIGRAM: + case EOF: case WORD: case ACRONYM: break loop; + case SIGRAM: + CJKTokenizer cjkT = new CJKTokenizer(input); + return cjkT.next(); default: }
}
===========================================
NutchDocumentTokenizer.tokenStream() is called by NutchDocumentAnalyzer, and int this way, the modified NutchDocumentTokenizer class let NutchDocumentAnalyzer supports Chinese.


Jack Tang added a comment - 05/Apr/05 11:58 AM
Attachment includes
1. patch of NutchAnalysis.jj
2. patch of FastCharStream.java
3. CJKTokenizer.java
4. patch of NutchDocumentTokenizer.java

Jack Tang made changes - 05/Apr/05 11:58 AM
Field Original Value New Value
Attachment [ 19479 ]
Doug Cutting added a comment - 12/Apr/05 07:20 AM
I like what this patch does, but not how it does it. Nutch should perform bi-gram segementation of CJK character sequences. This patch performs such segmentation at two places: in the character stream that is the input to the tokenizer, and in a filter that processes the output of the tokenizer. I'm unclear why the latter is required. The former should suffice, no?

But instead of segmenting in the character stream it should be done in the tokenizer itself. I think this could be done with something like the following in NutchAnalysis.jj.

<SIGRAM: <CJK> >

{ if (prevCJK) { matchedToken.image = prevCJK + matchedToken.image; } else { matchedToken.image = "_" + matchedToken.image; }
}

A little more would be required to maintain prevCJK.

Thoughts?


Kerang Lv added a comment - 22/Sep/05 09:56 PM
enghlitened by your last comment, the bi-gram segmentation could be done with the following in NutchAnalysis.jj
<SIGRAM: <CJK><CJK> > { input_stream.backup(1); }

Kerang Lv added a comment - 27/Sep/05 10:52 PM
Code of a kind can be used to perform third-part CJK word
segmentation in NutchAnalysis.jj. CJKTokenizer, a kind of bi-gram segmentation , was used in the following example.
================================================================================
@@ -33,6 +33,7 @@
import org.apache.nutch.searcher.Query.Clause;

import org.apache.lucene.analysis.StopFilter;
+import org.apache.lucene.analysis.cjk.CJKTokenizer;

import java.io.*;
import java.util.*;
@@ -81,6 +82,14 @@
PARSER_END(NutchAnalysis)

TOKEN_MGR_DECLS : {
+ /** use CJKTokenizer to process cjk character */
+ private CJKTokenizer cjkTokenizer = null;
+
+ /** a global cjk token */
+ private org.apache.lucene.analysis.Token cjkToken = null;
+
+ /** start offset of cjk sequence */
+ private int cjkStartOffset = 0;

/** Constructs a token manager for the provided Reader. */
public NutchAnalysisTokenManager(Reader reader) { @@ -106,7 +115,46 @@ }

// chinese, japanese and korean characters
-| <SIGRAM: <CJK> >
| <SIGRAM: (<CJK>) >
+ {
+ /**
+ * use an instance of CJKTokenizer, cjkTokenizer, hold the maximum
+ * matched cjk chars, and cjkToken for the current token;
+ * reset matchedToken.image use cjkToken.termText();
+ * reset matchedToken.beginColumn use cjkToken.startOffset();
+ * reset matchedToken.endColumn use cjkToken.endOffset();
+ * backup the last char when the next cjkToken is valid.
+ */
+ if(cjkTokenizer == null) {
+ cjkTokenizer = new CJKTokenizer(new StringReader(image.toString()));
+ cjkStartOffset = matchedToken.beginColumn;
+ try { + cjkToken = cjkTokenizer.next(); + } catch(IOException ioe) { + cjkToken = null; + }
+ }
+
+ if(cjkToken != null && !cjkToken.termText().equals("")) {
+ //sometime the cjkTokenizer returns an empty string, is it a bug?
+ matchedToken.image = cjkToken.termText();
+ matchedToken.beginColumn = cjkStartOffset + cjkToken.startOffset();
+ matchedToken.endColumn = cjkStartOffset + cjkToken.endOffset();
+ try {+ cjkToken = cjkTokenizer.next();+ } } catch(IOException ioe) { + cjkToken = null; + }
+ if(cjkToken != null && !cjkToken.termText().equals("")) { + input_stream.backup(1); + }
+ }
+
+ if(cjkToken == null || cjkToken.termText().equals("")) { + cjkTokenizer = null; + cjkStartOffset = 0; + }
+ }


Jack Tang added a comment - 06/Oct/05 01:18 AM
Kerang Lv's solution did well in NutchAnalysis but still there are some bugs in Summarizer. Say here is one chinese string (c1)(c2)(c3)(c4), the result of bi-gram is:
matched-image start-offset end-offset
(c1)(c2) 0 2
(c2)(c3) 1 3
(c3)(c4) 2 4

In search summaries, we should merge the tokens if the index is overlaped. You can follow this:

change code

if (highlight.contains(t.termText())) { excerpt.addToken(t.termText()); excerpt.add(new Fragment(text.substring(offset, t.startOffset()))); excerpt.add(new Highlight(text.substring(t.startOffset(),t.endOffset()))); offset = t.endOffset(); endToken = Math.min(j+SUM_CONTEXT, tokens.length); }

to

if (highlight.contains(t.termText())) {
if(offset * 2 == (t.startOffset() + t.endOffset() )) { // cjk bi-gram excerpt.addToken(t.termText().substring(offset - t.startOffset())); excerpt.add(new Fragment(text.substring(t.startOffset() + 1,offset))); excerpt.add(new Highlight(text.substring(t.startOffset() + 1 ,t.endOffset()))); }
else { excerpt.addToken(t.termText()); excerpt.add(new Fragment(text.substring(offset, t.startOffset()))); excerpt.add(new Highlight(text.substring(t.startOffset() ,t.endOffset()))); }
offset = t.endOffset();
endToken = Math.min(j+SUM_CONTEXT, tokens.length);
}


Andrzej Bialecki added a comment - 10/Nov/05 05:07 AM
Jack,

Have you tested the latest patches attached to this issue + your fix for summarizer? I can test that technically speaking they appear to do what was described, but knowing no Chinese I cannot testify if they produce any useful output...


juwen added a comment - 07/Nov/06 06:22 AM

[[ Old comment, sent by email on Wed, 23 Aug 2006 10:44:13 +0800 ]]

Hi!
Now I use the Nutch!
But I don't know how to support chinese in Nutch!
I find very long time! but the result is not good!
All is from http://issues.apache.org/jira/browse/NUTCH-36
ant the technology isn't mature.
Two year pass.can I support chinese well?
can you give me some information?

Best wish!
I'm chinese and the english isn't well.So some thing i can't express very
well!

Juwen_zhong