Index: contrib/analyzers/src/java/org/apache/lucene/analysis/cn/SmartChineseAnalyzer.java =================================================================== --- contrib/analyzers/src/java/org/apache/lucene/analysis/cn/SmartChineseAnalyzer.java (revision 789155) +++ contrib/analyzers/src/java/org/apache/lucene/analysis/cn/SmartChineseAnalyzer.java (working copy) @@ -33,23 +33,26 @@ import org.apache.lucene.analysis.cn.smart.WordSegmenter; import org.apache.lucene.analysis.cn.smart.WordTokenizer; +import org.apache.lucene.analysis.cn.smart.AnalyzerProfile; // for javadoc + /** - * - * SmartChineseAnalyzer 是一个智能中文分词模块, 能够利用概率对汉语句子进行最优切分, - * 并内嵌英文tokenizer,能有效处理中英文混合的文本内容。 - * - * 它的原理基于自然语言处理领域的隐马尔科夫模型(HMM), 利用大量语料库的训练来统计汉语词汇的词频和跳转概率, - * 从而根据这些统计结果对整个汉语句子计算最似然(likelihood)的切分。 - * - * 因为智能分词需要词典来保存词汇的统计值,SmartChineseAnalyzer的运行需要指定词典位置,如何指定词典位置请参考 - * org.apache.lucene.analysis.cn.smart.AnalyzerProfile - * - * SmartChineseAnalyzer的算法和语料库词典来自于ictclas1.0项目(http://www.ictclas.org), - * 其中词典已获取www.ictclas.org的apache license v2(APLv2)的授权。在遵循APLv2的条件下,欢迎用户使用。 - * 在此感谢www.ictclas.org以及ictclas分词软件的工作人员的无私奉献! - * - * @see org.apache.lucene.analysis.cn.smart.AnalyzerProfile - * + *
+ * SmartChineseAnalyzer is an analyzer for Chinese or mixed Chinese-English text. + * The analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text. + * The text is first broken into sentences, then each sentence is segmented into words. + *
+ *+ * Segmentation is based upon the Hidden Markov Model. + * A large training corpus was used to calculate Chinese word frequency probability. + *
+ *+ * This analyzer requires a dictionary to provide statistical data. + * To specify the location of the dictionary data, refer to {@link AnalyzerProfile} + *
+ *+ * The included dictionary data is from ICTCLAS1.0. + * Thanks to ICTCLAS for their hard work, and for contributing the data under the Apache 2 License! + *
*/ public class SmartChineseAnalyzer extends Analyzer { @@ -57,15 +60,23 @@ private WordSegmenter wordSegment; + /** + * Create a new SmartChineseAnalyzer, using the default stopword list. + */ public SmartChineseAnalyzer() { this(true); } /** - * SmartChineseAnalyzer内部带有默认停止词库,主要是标点符号。如果不希望结果中出现标点符号, - * 可以将useDefaultStopWords设为true, useDefaultStopWords为false时不使用任何停止词 + *+ * Create a new SmartChineseAnalyzer, optionally using the default stopword list. + *
+ *+ * The included default stopword list is simply a list of punctuation. + * If you do not use this list, punctuation will not be removed from the text! + *
* - * @param useDefaultStopWords + * @param useDefaultStopWords true to use the default stopword list. */ public SmartChineseAnalyzer(boolean useDefaultStopWords) { if (useDefaultStopWords) { @@ -76,10 +87,14 @@ } /** - * 使用自定义的而不使用内置的停止词库,停止词可以使用SmartChineseAnalyzer.loadStopWords(InputStream)加载 - * - * @param stopWords - * @see SmartChineseAnalyzer.loadStopWords(InputStream) + *+ * Create a new SmartChineseAnalyzer, using the provided {@link Set} of stopwords. + *
+ *+ * Note: the set should include punctuation, unless you want to index punctuation! + *
+ * @param stopWords {@link Set} of stopwords to use. + * @see SmartChineseAnalyzer#loadStopWords(InputStream) */ public SmartChineseAnalyzer(Set stopWords) { this.stopWords = stopWords; @@ -90,8 +105,8 @@ TokenStream result = new SentenceTokenizer(reader); result = new WordTokenizer(result, wordSegment); // result = new LowerCaseFilter(result); - // 不再需要LowerCaseFilter,因为SegTokenFilter已经将所有英文字符转换成小写 - // stem太严格了, This is not bug, this feature:) + // LowerCaseFilter is not needed, as SegTokenFilter lowercases Basic Latin text. + // The porter stemming is too strict, this is not a bug, this is a feature:) result = new PorterStemFilter(result); if (stopWords != null) { result = new StopFilter(result, stopWords, false); @@ -100,13 +115,17 @@ } /** - * 从停用词文件中加载停用词, 停用词文件是普通UTF-8编码的文本文件, 每一行是一个停用词,注释利用“//”, 停用词中包括中文标点符号, 中文空格, - * 以及使用率太高而对索引意义不大的词。 + * Utility function to return a {@link Set} of stopwords from a UTF-8 encoded {@link InputStream}. + * The comment "//" can be used in the stopword list. * - * @param input 停用词文件 - * @return 停用词组成的HashSet + * @param input {@link InputStream} of UTF-8 encoded stopwords + * @return {@link Set} of stopwords. */ public static Set loadStopWords(InputStream input) { + /* + * Note: WordListLoader is not used here because this method allows for inline "//" comments. + * WordListLoader will only filter out these comments if they are on a separate line. + */ String line; Set stopWords = new HashSet(); try { Index: contrib/analyzers/src/java/org/apache/lucene/analysis/cn/package.html =================================================================== --- contrib/analyzers/src/java/org/apache/lucene/analysis/cn/package.html (revision 789155) +++ contrib/analyzers/src/java/org/apache/lucene/analysis/cn/package.html (working copy) @@ -1,51 +1,22 @@ -Analyzer for Chinese. +Analyzers for Chinese. ++Three analyzers are provided for Chinese, each of which treats Chinese text in a different way. +
SmartChineseAnalyzer 是一个智能中文分词模块, 与 ChineseAnalyzer (切分每个汉字)和 -CJKAnalyzer (组合每两个汉字)不同, 它能够利用概率对汉语句子进行最优切分, 并内嵌英文tokenizer, -能有效处理中英文混合的文本内容。目前SmartChineseAnalyzer的词典库只支持简体中文。
- -它的原理基于自然语言处理领域的隐马尔科夫模型(HMM), 利用大量语料库的训练来统计汉语词汇的词频和跳转概率, -从而根据这些统计结果对整个汉语句子计算最似然(likelihood)的切分。
- -三种分词模块的分词结果比较, 由此可以看出智能分词更符合句子的原本语义, 从而提高搜索的准确率。 -
语句: 我是中国人+Example phrase: "我是中国人"
因为智能分词需要词典来保存词汇的统计值,默认情况下,SmartChineseAnalyzer使用内置的词典库,当需要指定的词典库时,需要指定词典位置,如何指定词典位置请参考 -org.apache.lucene.analysis.cn.smart.AnalyzerProfile。
- -词库的下载地址为:http://code.google.com/p/imdict-chinese-analyzer/downloads/list - 下载文件analysis-data.zip保存到本地,解压即可使用。
- -最简单的指定词典库的办法就是运行时加上参数-Danalysis.data.dir -
如: java -Danalysis.data.dir=/path/to/analysis-data com.example.YourApplication- - -
SmartChineseAnalyzer的JVM要求java 1.4及以上版本;Lucene -要求2.4.0及以上版本,Lucene 2.3.X版应该也可以使用,但未经测试,有需要的用户可自行测试。
- -SmartChineseAnalyzer的算法和语料库词典来自于ictclas1.0项目(http://www.ictclas.org), -其中词典已经著作权人www.ictclas.org允许,以apache license -v2(APLv2)协议发布。在遵循APLv2的条件下,欢迎用户使用。 -在此感谢www.ictclas.org以及ictclas分词软件的工作人员的辛勤工作和无私奉献!
Index: contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/AnalyzerProfile.java =================================================================== --- contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/AnalyzerProfile.java (revision 789155) +++ contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/AnalyzerProfile.java (working copy) @@ -23,38 +23,37 @@ import java.util.Properties; /** - * 在默认情况下,SmartChineseAnalyzer内置有词典库、默认停止词库,已经经过封装,用户可以直接使用。 + * Configure analysis data for SmartChineseAnalyzer + *+ * SmartChineseAnalyzer has a built-in dictionary and stopword list out-of-box. + *
+ *+ * In special circumstances a user may wish to configure SmartChineseAnalyzer with a custom data directory location. + *
+ * AnalyzerProfile is used to determine the location of the data directory containing bigramdict.dct and coredict.dct. + * The following order is used to determine the location of the data directory: * - * 特殊情况下,用户需要使用指定的词典库和停止词库,此时需要删除org.apache.lucene.analysis.cn.smart. hhmm下的 - * coredict.mem 和 bigramdict.mem, 然后使用AnalyzerProfile来指定词典库目录。 - * - * AnalyzerProfile 用来寻找存放分词词库数据 和停用词数据的目录, 该目录下应该有 bigramdict.dct, coredict.dct, - * stopwords_utf8.txt, 查找过程依次如下: - * ** analysis.data.dir=D:/path/to/analysis-data/ ** - * 当找不到analysis-data目录时,ANALYSIS_DATA_DIR设置为"",因此在使用前,必须在程序里显式指定data目录,例如: * - *
- * AnalyzerProfile.ANALYSIS_DATA_DIR = "/path/to/analysis-data"; - *- * */ public class AnalyzerProfile { + /** + * Global indicating the configured analysis data directory + */ public static String ANALYSIS_DATA_DIR = ""; static { @@ -65,7 +64,7 @@ String dirName = "analysis-data"; String propName = "analysis.properties"; - // 读取系统设置,在运行时加入参数:-Danalysis.data.dir=/path/to/analysis-data + // Try the system property:-Danalysis.data.dir=/path/to/analysis-data ANALYSIS_DATA_DIR = System.getProperty("analysis.data.dir", ""); if (ANALYSIS_DATA_DIR.length() != 0) return; @@ -86,9 +85,9 @@ } if (ANALYSIS_DATA_DIR.length() == 0) { - // 提示用户未找到词典文件夹 + // Dictionary directory cannot be found. System.err - .println("WARNING: Can not found lexical dictionary directory!"); + .println("WARNING: Can not find lexical dictionary directory!"); System.err .println("WARNING: This will cause unpredictable exceptions in your application!"); System.err Index: contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/CharType.java =================================================================== --- contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/CharType.java (revision 789155) +++ contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/CharType.java (working copy) @@ -17,23 +17,49 @@ package org.apache.lucene.analysis.cn.smart; +/** + * Internal SmartChineseAnalyzer character type constants. + */ public class CharType { + /** + * Punctuation Characters + */ public final static int DELIMITER = 0; + /** + * Letters + */ public final static int LETTER = 1; + /** + * Numeric Digits + */ public final static int DIGIT = 2; + /** + * Han Ideographs + */ public final static int HANZI = 3; + /** + * Characters that act as a space + */ public final static int SPACE_LIKE = 4; - // (全角半角)标点符号,半角(字母,数字),汉字,空格,"\t\r\n"等空格或换行字符 + /** + * Full-Width letters + */ public final static int FULLWIDTH_LETTER = 5; - public final static int FULLWIDTH_DIGIT = 6; // 全角字符,字母,数字 + /** + * Full-Width alphanumeric characters + */ + public final static int FULLWIDTH_DIGIT = 6; + /** + * Other (not fitting any of the other categories) + */ public final static int OTHER = 7; } Index: contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/SentenceTokenizer.java =================================================================== --- contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/SentenceTokenizer.java (revision 789155) +++ contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/SentenceTokenizer.java (working copy) @@ -25,14 +25,12 @@ import org.apache.lucene.analysis.Tokenizer; /** - * - * 包含一个完整句子的Token,从文件中读出,是下一步分词的对象 - * + * Tokenizes input into sentences. */ public class SentenceTokenizer extends Tokenizer { /** - * 用来切断句子的标点符号 。,!?;,!?; + * End of sentence punctuation: 。,!?;,!?; */ public final static String PUNCTION = "。,!?;,!?;"; @@ -62,7 +60,7 @@ if (ci == -1) { break; } else if (PUNCTION.indexOf(ch) != -1) { - // 找到了句子末尾 + // End of a sentence buffer.append(ch); tokenEnd++; break; @@ -78,8 +76,7 @@ pch = ch; ci = bufferInput.read(); ch = (char) ci; - // 如果碰上了两个连续的skip字符,例如两个回车,两个空格或者, - // 一个回车,一个空格等等,将其视为句子结束,以免句子太长而内存不足 + // Two spaces, such as CR, LF if (Utility.SPACES.indexOf(ch) != -1 && Utility.SPACES.indexOf(pch) != -1) { // buffer.append(ch); Index: contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/Utility.java =================================================================== --- contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/Utility.java (revision 789155) +++ contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/Utility.java (working copy) @@ -17,6 +17,12 @@ package org.apache.lucene.analysis.cn.smart; +import org.apache.lucene.analysis.cn.smart.hhmm.BiSegGraph; // for javadoc +import org.apache.lucene.analysis.cn.smart.hhmm.SegTokenFilter; // for javadoc + +/** + * SmartChineseAnalyzer utility constants and methods + */ public class Utility { public static final char[] STRING_CHAR_ARRAY = new String("未##串") @@ -30,24 +36,29 @@ public static final char[] END_CHAR_ARRAY = new String("末##末").toCharArray(); + /** + * Delimiters will be filtered to this character by {@link SegTokenFilter} + */ public static final char[] COMMON_DELIMITER = new char[] { ',' }; /** - * 需要跳过的符号,例如制表符,回车,换行等等。 + * Space-like characters that need to be skipped: such as space, tab, newline, carriage return. */ public static final String SPACES = " \t\r\n"; + /** + * Maximum bigram frequency (used in the {@link BiSegGraph} smoothing function). + */ public static final int MAX_FREQUENCE = 2079997 + 80000; /** - * 比较两个整数数组的大小, 分别从数组的一定位置开始逐个比较, 当依次相等且都到达末尾时, 返回相等, 否则未到达末尾的大于到达末尾的; - * 当未到达末尾时有一位不相等, 该位置数值大的数组大于小的 + * compare two arrays starting at the specified offsets. * - * @param larray - * @param lstartIndex larray的起始位置 - * @param rarray - * @param rstartIndex rarray的起始位置 - * @return 0表示相等,1表示larray > rarray, -1表示larray < rarray + * @param larray left array + * @param lstartIndex start offset into larray + * @param rarray right array + * @param rstartIndex start offset into rarray + * @return 0 if the arrays are equal,1 if larray > rarray, -1 if larray < rarray */ public static int compareArray(char[] larray, int lstartIndex, char[] rarray, int rstartIndex) { @@ -74,21 +85,19 @@ } if (li == larray.length) { if (ri == rarray.length) { - // 两者一直相等到末尾,因此返回相等,也就是结果0 + // Both arrays are equivalent, return 0. return 0; } else { - // 此时不可能ri>rarray.length因此只有ri
+ * SmartChineseAnalyzer abstract dictionary implementation. + *
+ *+ * Contains methods for dealing with GB2312 encoding. + *
+ */ public abstract class AbstractDictionary { /** - * 第一个汉字为“啊”,他前面有15个区,共15*94个字符 + * First Chinese Character in GB2312 (15 * 94) + * Characters in GB2312 are arranged in a grid of 94 * 94, 0-14 are unassigned or punctuation. */ public static final int GB2312_FIRST_CHAR = 1410; /** - * GB2312字符集中01~87的字符集才可能有效,共8178个 + * Last Chinese Character in GB2312 (87 * 94). + * Characters in GB2312 are arranged in a grid of 94 * 94, 88-94 are unassigned. */ public static final int GB2312_CHAR_NUM = 87 * 94; /** - * 词库文件中收录了6768个汉字的词频统计 + * Dictionary data contains 6768 Chinese characters with frequency statistics. */ public static final int CHAR_NUM_IN_FILE = 6768; @@ -45,34 +55,34 @@ // B0F0 梆 榜 膀 绑 棒 磅 蚌 镑 傍 谤 苞 胞 包 褒 剥 // ===================================================== // - // GB2312 字符集的区位分布表: - // 区号 字数 字符类别 - // 01 94 一般符号 - // 02 72 顺序号码 - // 03 94 拉丁字母 - // 04 83 日文假名 + // GB2312 character set: + // 01 94 Symbols + // 02 72 Numbers + // 03 94 Latin + // 04 83 Kana // 05 86 Katakana - // 06 48 希腊字母 - // 07 66 俄文字母 - // 08 63 汉语拼音符号 - // 09 76 图形符号 - // 10-15 备用区 - // 16-55 3755 一级汉字,以拼音为序 - // 56-87 3008 二级汉字,以笔划为序 - // 88-94 备用区 + // 06 48 Greek + // 07 66 Cyrillic + // 08 63 Phonetic Symbols + // 09 76 Drawing Symbols + // 10-15 Unassigned + // 16-55 3755 Plane 1, in pinyin order + // 56-87 3008 Plane 2, in radical/stroke order + // 88-94 Unassigned // ====================================================== /** - * GB2312 共收录有 7445 个字符,其中简化汉字 6763 个,字母和符号 682 个。 + *+ * Transcode from GB2312 ID to Unicode + *
+ *+ * GB2312 is divided into a 94 * 94 grid, containing 7445 characters consisting of 6763 Chinese characters and 682 symbols. + * Some regions are unassigned (reserved). + *
* - * GB2312 将所收录的字符分为 94 个区,编号为 01 区至 94 区;每个区收录 94 个字符,编号为 01 位至 94 - * 位,01为起始与0xA1,94位处于0xFE。GB2312 的每一个字符都由与其唯一对应的区号和位号所确定。例如:汉字“啊”,编号为 16 区 01 - * 位。 + * @param ccid GB2312 id + * @return unicode String */ - /** - * @param ccid - * @return - */ public String getCCByGB2312Id(int ccid) { if (ccid < 0 || ccid > WordDictionary.GB2312_CHAR_NUM) return ""; @@ -90,16 +100,16 @@ } /** - * 根据输入的Unicode字符,获取它的GB2312编码或者ascii编码, + * Transcode from Unicode to GB2312 * - * @param ch 输入的GB2312中文字符或者ASCII字符(128个) - * @return ch在GB2312中的位置,-1表示该字符不认识 + * @param ch input character in Unicode, or character in Basic Latin range. + * @return position in GB2312 */ public short getGB2312Id(char ch) { try { byte[] buffer = Character.toString(ch).getBytes("GB2312"); if (buffer.length != 2) { - // 正常情况下buffer应该是两个字节,否则说明ch不属于GB2312编码,故返回'?',此时说明不认识该字符 + // Should be a two-byte character return -1; } int b0 = (int) (buffer[0] & 0x0FF) - 161; // 编码从A1开始,因此减去0xA1=161 @@ -112,12 +122,10 @@ } /** - * 改进的32位FNV hash算法,用作本程序中的第一hash函数.第一和第二hash函数用来联合计算hash表, 使其均匀分布, - * 并能避免因hash表过密而导致的长时间计算的问题 + * 32-bit FNV Hash Function * - * @param c 待hash的Unicode字符 - * @return c的哈希值 - * @see Utility.hash2() + * @param c input character + * @return hashcode */ public long hash1(char c) { final long p = 1099511628211L; @@ -133,9 +141,10 @@ } /** - * @see Utility.hash1(char[]) - * @param carray - * @return + * 32-bit FNV Hash Function + * + * @param carray character array + * @return hashcode */ public long hash1(char carray[]) { final long p = 1099511628211L; @@ -155,16 +164,14 @@ } /** - * djb2哈希算法,用作本程序中的第二hash函数 - * * djb2 hash algorithm,this algorithm (k=33) was first reported by dan * bernstein many years ago in comp.lang.c. another version of this algorithm * (now favored by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; * the magic of number 33 (why it works better than many other constants, * prime or not) has never been adequately explained. * - * @param c - * @return + * @param c character + * @return hashcode */ public int hash2(char c) { int hash = 5381; @@ -177,9 +184,14 @@ } /** - * @see Utility.hash2(char[]) - * @param carray - * @return + * djb2 hash algorithm,this algorithm (k=33) was first reported by dan + * bernstein many years ago in comp.lang.c. another version of this algorithm + * (now favored by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; + * the magic of number 33 (why it works better than many other constants, + * prime or not) has never been adequately explained. + * + * @param carray character array + * @return hashcode */ public int hash2(char carray[]) { int hash = 5381; Index: contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/BiSegGraph.java =================================================================== --- contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/BiSegGraph.java (revision 789155) +++ contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/BiSegGraph.java (working copy) @@ -26,6 +26,12 @@ import org.apache.lucene.analysis.cn.smart.Utility; +/** + * Graph representing possible token pairs (bigrams) at each start offset in the sentence. + *+ * For each start offset, a list of possible token pairs is stored. + *
+ */ public class BiSegGraph { private Map tokenPairListTable = new HashMap(); @@ -39,15 +45,8 @@ generateBiSegGraph(segGraph); } - /** - * 生成两两词之间的二叉图表,将结果保存在一个MultiTokenPairMap中 - * - * @param segGraph 所有的Token列表 - * @param smooth 平滑系数 - * @param biDict 二叉词典 - * @return - * - * @see MultiTokenPairMap + /* + * Generate a BiSegGraph based upon a SegGraph */ private void generateBiSegGraph(SegGraph segGraph) { double smooth = 0.1; @@ -57,7 +56,7 @@ int next; char[] idBuffer; - // 为segGraph中的每个元素赋以一个下标 + // get the list of tokens ordered and indexed segTokenList = segGraph.makeIndex(); // 因为startToken("始##始")的起始位置是-1因此key为-1时可以取出startToken int key = -1; @@ -119,31 +118,29 @@ } /** - * 查看SegTokenPair的结束位置为to(SegTokenPair.to为to)是否存在SegTokenPair, - * 如果没有则说明to处没有SegTokenPair或者还没有添加 + * Returns true if their is a list of token pairs at this offset (index of the second token) * - * @param to SegTokenPair.to - * @return + * @param to index of the second token in the token pair + * @return true if a token pair exists */ public boolean isToExist(int to) { return tokenPairListTable.get(new Integer(to)) != null; } /** - * 取出SegTokenPair.to为to的所有SegTokenPair,如果没有则返回null + * Return a {@link List} of all token pairs at this offset (index of the second token) * - * @param to - * @return 所有相同SegTokenPair.to的SegTokenPair的序列 + * @param to index of the second token in the token pair + * @return {@link List} of token pairs. */ public List getToList(int to) { return (List) tokenPairListTable.get(new Integer(to)); } /** - * 向BiSegGraph中增加一个SegTokenPair,这些SegTokenPair按照相同SegTokenPair. - * to放在同一个ArrayList中 + * Add a {@link SegTokenPair} * - * @param tokenPair + * @param tokenPair {@link SegTokenPair} */ public void addSegTokenPair(SegTokenPair tokenPair) { int to = tokenPair.to; @@ -158,16 +155,16 @@ } /** - * @return TokenPair的列数,也就是Map中不同列号的TokenPair种数。 + * Get the number of {@link SegTokenPair} entries in the table. + * @return number of {@link SegTokenPair} entries */ public int getToCount() { return tokenPairListTable.size(); } /** - * 用veterbi算法计算从起点到终点的最短路径 - * - * @return + * Find the shortest path with the Viterbi algorithm. + * @return {@link List} */ public List getShortPath() { int current; @@ -198,7 +195,7 @@ path.add(newNode); } - // 接下来从nodePaths中计算从起点到终点的真实路径 + // Calculate PathNodes int preNode, lastNode; lastNode = path.size() - 1; current = lastNode; Index: contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/BigramDictionary.java =================================================================== --- contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/BigramDictionary.java (revision 789155) +++ contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/BigramDictionary.java (working copy) @@ -32,6 +32,9 @@ import org.apache.lucene.analysis.cn.smart.AnalyzerProfile; +/** + * SmartChineseAnalyzer Bigram dictionary. + */ public class BigramDictionary extends AbstractDictionary { private BigramDictionary() { @@ -43,12 +46,8 @@ public static final int PRIME_BIGRAM_LENGTH = 402137; - /** - * bigramTable 来存储词与词之间的跳转频率, bigramHashTable 和 frequencyTable - * 就是用来存储这些频率的数据结构。 为了提高查询速度和节省内存, 采用 hash 值来代替关联词作为查询依据, 关联词就是 - * (formWord+'@'+toWord) , 利用 FNV1 hash 算法来计算关联词的hash值 ,并保存在 bigramHashTable - * 中,利用 hash 值来代替关联词有可能会产生很小概率的冲突, 但是 long 类型 - * (64bit)的hash值有效地将此概率降到极低。bigramHashTable[i]与frequencyTable[i]一一对应 + /* + * The word associations are stored as FNV1 hashcodes, which have a small probability of collision, but save memory. */ private long[] bigramHashTable; @@ -128,7 +127,7 @@ bigramHashTable = new long[PRIME_BIGRAM_LENGTH]; frequencyTable = new int[PRIME_BIGRAM_LENGTH]; for (int i = 0; i < PRIME_BIGRAM_LENGTH; i++) { - // 实际上将0作为初始值有一点问题,因为某个字符串可能hash值为0,但是概率非常小,因此影响不大 + // it is possible for a value to hash to 0, but the probability is extremely low bigramHashTable[i] = 0; frequencyTable[i] = 0; } @@ -141,10 +140,9 @@ } /** - * 将词库文件加载到WordDictionary的相关数据结构中,只是加载,没有进行合并和修改操作 + * Load the datafile into this BigramDictionary * - * @param dctFilePath - * @return + * @param dctFilePath path to the Bigramdictionary (bigramdict.mem) * @throws FileNotFoundException * @throws IOException * @throws UnsupportedEncodingException @@ -159,14 +157,14 @@ String tmpword; RandomAccessFile dctFile = new RandomAccessFile(dctFilePath, "r"); - // 字典文件中第一个汉字出现的位置是0,最后一个是6768 + // GB2312 characters 0 - 6768 for (i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + CHAR_NUM_IN_FILE; i++) { String currentStr = getCCByGB2312Id(i); // if (i == 5231) // System.out.println(i); - dctFile.read(intBuffer);// 原词库文件在c下开发,所以写入的文件为little - // endian编码,而java为big endian,必须转换过来 + dctFile.read(intBuffer); + // the dictionary was developed for C, and byte order must be converted to work with Java cnt = ByteBuffer.wrap(intBuffer).order(ByteOrder.LITTLE_ENDIAN).getInt(); if (cnt <= 0) { continue; @@ -272,9 +270,8 @@ return -1; } - /** - * @param c - * @return + /* + * lookup the index into the frequency array. */ private int getBigramItemIndex(char carray[]) { long hashId = hash1(carray); Index: contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/HHMMSegmenter.java =================================================================== --- contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/HHMMSegmenter.java (revision 789155) +++ contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/HHMMSegmenter.java (working copy) @@ -23,18 +23,18 @@ import org.apache.lucene.analysis.cn.smart.Utility; import org.apache.lucene.analysis.cn.smart.WordType; +/** + * Finds the optimal segmentation of a sentence into Chinese words + */ public class HHMMSegmenter { private static WordDictionary wordDict = WordDictionary.getInstance(); /** - * 寻找sentence中所有可能的Token,最后再添加两个特殊Token,"始##始", - * "末##末","始##始"Token的起始位置是-1,"末##末"Token的起始位置是句子的长度 + * Create the {@link SegGraph} for a sentence. * - * @param sentence 输入的句子,不包含"始##始","末##末"等 - * @param coreDict 核心字典 - * @return 所有可能的Token - * @see MultiTokenMap + * @param sentence input sentence, without start and end markers + * @return {@link SegGraph} corresponding to the input sentence. */ private SegGraph createSegGraph(String sentence) { int i = 0, j; @@ -168,16 +168,16 @@ } /** - * 为sentence中的每个字符确定唯一的字符类型 + * Get the character types for every character in a sentence. * * @see Utility.charType(char) - * @param sentence 输入的完成句子 - * @return 返回的字符类型数组,如果输入为null,返回也是null + * @param sentence input sentence + * @return array of character types corresponding to character positions in the sentence */ private static int[] getCharTypes(String sentence) { int length = sentence.length(); int[] charTypeArray = new int[length]; - // 生成对应单个汉字的字符类型数组 + // the type of each character by position for (int i = 0; i < length; i++) { charTypeArray[i] = Utility.getCharType(sentence.charAt(i)); } @@ -185,6 +185,11 @@ return charTypeArray; } + /** + * Return a list of {@link PathNode} representing the best segmentation of a sentence + * @param sentence input sentence + * @return best segmentation as a {@link List} + */ public List process(String sentence) { SegGraph segGraph = createSegGraph(sentence); BiSegGraph biSegGraph = new BiSegGraph(segGraph); Index: contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/PathNode.java =================================================================== --- contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/PathNode.java (revision 789155) +++ contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/PathNode.java (working copy) @@ -17,6 +17,12 @@ package org.apache.lucene.analysis.cn.smart.hhmm; +/** + * SmartChineseAnalyzer internal node representation + *+ * Used by {@link BiSegGraph} to maximize the segmentation with the Viterbi algorithm. + *
+ */ public class PathNode implements Comparable { public double weight; Index: contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/SegGraph.java =================================================================== --- contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/SegGraph.java (revision 789155) +++ contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/SegGraph.java (working copy) @@ -23,42 +23,53 @@ import java.util.List; import java.util.Map; +/** + * Graph representing possible tokens at each start offset in the sentence. + *+ * For each start offset, a list of possible tokens is stored. + *
+ */ public class SegGraph { /** - * 用一个ArrayList记录startOffset相同的Token,这个startOffset就是Token的key + * Map of start offsets to ArrayList of tokens at that position */ - private Map tokenListTable = new HashMap(); + private Map /*+ * Filters a {@link SegToken} by converting full-width latin to half-width, then lowercasing latin. + * Additionally, all punctuation is converted into {@link Utility#COMMON_DELIMITER} + *
+ */ public class SegTokenFilter { + /** + * Filter an input {@link SegToken} + *+ * Full-width latin will be converted to half-width, then all latin will be lowercased. + * All punctuation is converted into {@link Utility#COMMON_DELIMITER} + *
+ * + * @param token input {@link SegToken} + * @return normalized {@link SegToken} + */ public SegToken filter(SegToken token) { switch (token.wordType) { case WordType.FULLWIDTH_NUMBER: - case WordType.FULLWIDTH_STRING: + case WordType.FULLWIDTH_STRING: /* first convert full-width -> half-width */ for (int i = 0; i < token.charArray.length; i++) { if (token.charArray[i] >= 0xFF10) token.charArray[i] -= 0xFEE0; - if (token.charArray[i] >= 0x0041 && token.charArray[i] <= 0x005A) + if (token.charArray[i] >= 0x0041 && token.charArray[i] <= 0x005A) /* lowercase latin */ token.charArray[i] += 0x0020; } break; case WordType.STRING: for (int i = 0; i < token.charArray.length; i++) { - if (token.charArray[i] >= 0x0041 && token.charArray[i] <= 0x005A) + if (token.charArray[i] >= 0x0041 && token.charArray[i] <= 0x005A) /* lowercase latin */ token.charArray[i] += 0x0020; } break; - case WordType.DELIMITER: + case WordType.DELIMITER: /* convert all punctuation to Utility.COMMON_DELIMITER */ token.charArray = Utility.COMMON_DELIMITER; break; default: Index: contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/SegTokenPair.java =================================================================== --- contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/SegTokenPair.java (revision 789155) +++ contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/SegTokenPair.java (working copy) @@ -17,15 +17,21 @@ package org.apache.lucene.analysis.cn.smart.hhmm; +/** + * A pair of tokens in {@link SegGraph} + */ public class SegTokenPair { public char[] charArray; /** - * from和to是Token对的index号,表示本TokenPair的两个Token在segGragh中的位置。 + * index of the first token in {@link SegGraph} */ public int from; + /** + * index of the second token in {@link SegGraph} + */ public int to; public double weight; Index: contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/WordDictionary.java =================================================================== --- contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/WordDictionary.java (revision 789155) +++ contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/WordDictionary.java (working copy) @@ -33,6 +33,10 @@ import org.apache.lucene.analysis.cn.smart.AnalyzerProfile; import org.apache.lucene.analysis.cn.smart.Utility; +/** + * SmartChineseAnalyzer Word Dictionary + * + */ public class WordDictionary extends AbstractDictionary { private WordDictionary() { @@ -41,7 +45,7 @@ private static WordDictionary singleInstance; /** - * 一个较大的素数,保证hash查找能够遍历所有位置 + * Large prime number for hash function */ public static final int PRIME_INDEX_LENGTH = 12071; @@ -66,6 +70,10 @@ // static Logger log = Logger.getLogger(WordDictionary.class); + /** + * Get the singleton dictionary instance. + * @return singleton + */ public synchronized static WordDictionary getInstance() { if (singleInstance == null) { singleInstance = new WordDictionary(); @@ -82,10 +90,9 @@ } /** - * 从外部文件夹dctFileRoot加载词典库文件,首先测试是否有coredict.mem文件, 如果有则直接作为序列化对象加载, - * 如果没有则加载词典库源文件coredict.dct + * Attempt to load dictionary from provided directory, first trying coredict.mem, failing back on coredict.dct * - * @param dctFileName 词典库文件的路径 + * @param dctFileRoot path to dictionary directory */ public void load(String dctFileRoot) { String dctFilePath = dctFileRoot + "/coredict.dct"; @@ -119,9 +126,8 @@ } /** - * 从jar内部加载词典库文件,要求保证WordDictionary类当前路径中有coredict.mem文件,以将其作为序列化对象加载 + * Load coredict.mem internally from the jar file. * - * @param dctFileName 词典库文件的路径 * @throws ClassNotFoundException * @throws IOException */ @@ -171,10 +177,10 @@ } /** - * 将词库文件加载到WordDictionary的相关数据结构中,只是加载,没有进行合并和修改操作 + * Load the datafile into this WordDictionary * - * @param dctFilePath - * @return + * @param dctFilePath path to word dictionary (coredict.mem) + * @return number of words read * @throws FileNotFoundException * @throws IOException * @throws UnsupportedEncodingException @@ -188,13 +194,13 @@ String tmpword; RandomAccessFile dctFile = new RandomAccessFile(dctFilePath, "r"); - // 字典文件中第一个汉字出现的位置是0,最后一个是6768 + // GB2312 characters 0 - 6768 for (i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + CHAR_NUM_IN_FILE; i++) { // if (i == 5231) // System.out.println(i); - dctFile.read(intBuffer);// 原词库文件在c下开发,所以写入的文件为little - // endian编码,而java为big endian,必须转换过来 + dctFile.read(intBuffer); + // the dictionary was developed for C, and byte order must be converted to work with Java cnt = ByteBuffer.wrap(intBuffer).order(ByteOrder.LITTLE_ENDIAN).getInt(); if (cnt <= 0) { wordItem_charArrayTable[i] = null; @@ -287,8 +293,8 @@ wordItem_frequencyTable[delimiterIndex] = null; } - /** - * 本程序不做词性标注,因此将相同词不同词性的频率合并到同一个词下,以减小存储空间,加快搜索速度 + /* + * since we aren't doing POS-tagging, merge the frequencies for entries of the same word (with different POS) */ private void mergeSameWords() { int i; @@ -350,12 +356,9 @@ } } - /** + /* * 计算字符c在哈希表中应该在的位置,然后将地址列表中该位置的值初始化 * - * @param c - * @param j - * @return */ private boolean setTableIndex(char c, int j) { int index = getAvaliableTableIndex(c); @@ -390,10 +393,6 @@ return -1; } - /** - * @param c - * @return - */ private short getWordItemTableIndex(char c) { int hash1 = (int) (hash1(c) % PRIME_INDEX_LENGTH); int hash2 = hash2(c) % PRIME_INDEX_LENGTH; @@ -465,32 +464,33 @@ } /** - * charArray这个单词对应的词组在不在WordDictionary中出现 + * Returns true if the input word appears in the dictionary * - * @param charArray - * @return true表示存在,false表示不存在 + * @param charArray input word + * @return true if the word exists */ public boolean isExist(char[] charArray) { return findInTable(charArray) != -1; } /** - * @see{getPrefixMatch(char[] charArray, int knownStart)} - * @param charArray - * @return + * Find the first word in the dictionary that starts with the supplied prefix + * + * @see #getPrefixMatch(char[], int) + * @param charArray input prefix + * @return index of word, or -1 if not found */ public int getPrefixMatch(char[] charArray) { return getPrefixMatch(charArray, 0); } /** - * 从词典中查找以charArray对应的单词为前缀(prefix)的单词的位置, 并返回第一个满足条件的位置。为了减小搜索代价, - * 可以根据已有知识设置起始搜索位置, 如果不知道起始位置,默认是0 + * Find the nth word in the dictionary that starts with the supplied prefix * - * @see{getPrefixMatch(char[] charArray)} - * @param charArray 前缀单词 - * @param knownStart 已知的起始位置 - * @return 满足前缀条件的第一个单词的位置 + * @see #getPrefixMatch(char[]) + * @param charArray input prefix + * @param knownStart relative position in the dictionary to start + * @return index of word, or -1 if not found */ public int getPrefixMatch(char[] charArray, int knownStart) { short index = getWordItemTableIndex(charArray[0]); @@ -521,11 +521,10 @@ } /** - * 获取idArray对应的词的词频,若pos为-1则获取所有词性的词频 + * Get the frequency of a word from the dictionary * - * @param charArray 输入的单词对应的charArray - * @param pos 词性,-1表示要求求出所有的词性的词频 - * @return idArray对应的词频 + * @param charArray input word + * @return word frequency, or zero if the word is not found */ public int getFrequency(char[] charArray) { short hashIndex = getWordItemTableIndex(charArray[0]); @@ -539,12 +538,11 @@ } /** - * 判断charArray对应的字符串是否跟词典中charArray[0]对应的wordIndex的charArray相等, - * 也就是说charArray的位置查找结果是不是就是wordIndex + * Return true if the dictionary entry at itemIndex for table charArray[0] is charArray * - * @param charArray 输入的charArray词组,第一个数表示词典中的索引号 - * @param itemIndex 位置编号 - * @return 是否相等 + * @param charArray input word + * @param itemIndex item index for table charArray[0] + * @return true if the entry exists */ public boolean isEqual(char[] charArray, int itemIndex) { short hashIndex = getWordItemTableIndex(charArray[0]); Index: contrib/analyzers/src/resources/org/apache/lucene/analysis/cn/stopwords.txt =================================================================== --- contrib/analyzers/src/resources/org/apache/lucene/analysis/cn/stopwords.txt (revision 789155) +++ contrib/analyzers/src/resources/org/apache/lucene/analysis/cn/stopwords.txt (working copy) @@ -1,4 +1,4 @@ -////////// 将标点符号全部去掉 //////////////// +////////// Punctuation tokens to remove //////////////// , . ` @@ -51,8 +51,8 @@ [ ] ● - //中文空格字符 + //IDEOGRAPHIC SPACE character (Used as a space in Chinese) -//////////////// 英文停用词 //////////////// +//////////////// English Stop Words //////////////// -//////////////// 中文停用词 //////////////// +//////////////// Chinese Stop Words ////////////////