Lucene - Core
  1. Lucene - Core
  2. LUCENE-1629

contrib intelligent Analyzer for Chinese

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.4.1
    • Fix Version/s: 2.9
    • Component/s: modules/analysis
    • Labels:
      None
    • Environment:

      for java 1.5 or higher, lucene 2.4.1

    • Lucene Fields:
      New, Patch Available

      Description

      I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called "imdict-chinese-analyzer", the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/

      In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously!

      Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly.

      The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 60%.

      As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository.

      1. LUCENE-1629-java1.4.patch
        139 kB
        Xiaoping Gao
      2. LUCENE-1629-encoding-fix.patch
        0.8 kB
        Uwe Schindler
      3. coredict.mem
        1.51 MB
        Xiaoping Gao
      4. build-resources-with-folder.patch
        8 kB
        Uwe Schindler
      5. build-resources.patch
        7 kB
        Uwe Schindler
      6. build-resources.patch
        1 kB
        Uwe Schindler
      7. bigramdict.mem
        4.60 MB
        Xiaoping Gao
      8. analysis-data.zip
        2.02 MB
        Xiaoping Gao

        Activity

        Hide
        Xiaoping Gao added a comment -

        Here is all the source code of intelligent analyzer for Chinese. About 2500 lines
        The unit TestCase contains a main method, which needs lexical dictionary to run, so I will post the binary lexical dictionary soon.

        Show
        Xiaoping Gao added a comment - Here is all the source code of intelligent analyzer for Chinese. About 2500 lines The unit TestCase contains a main method, which needs lexical dictionary to run, so I will post the binary lexical dictionary soon.
        Hide
        Xiaoping Gao added a comment -

        Lexical dictionary files, unzip it to somewhere, run TestSmartChineseAnalyzer with this command:
        java org.apache.lucene.analysis.cn.TestSmartChineseAnalyzer -Danalysis.data.dir=/path/to/analysis-data/

        Show
        Xiaoping Gao added a comment - Lexical dictionary files, unzip it to somewhere, run TestSmartChineseAnalyzer with this command: java org.apache.lucene.analysis.cn.TestSmartChineseAnalyzer -Danalysis.data.dir=/path/to/analysis-data/
        Hide
        Michael McCandless added a comment -

        Patch looks good – thanks Xiaoping!

        One problem is that contrib/analyzers is currently limited to Java 1.4, and I don't think we should change that at this point (though in 3.0, we will change it to 1.5). How hard would it be to switch your sources to use only Java 1.4?

        A couple other issues:

        • Each copyright header is missing the starting 'S' in the sentence 'ee the License for the specific language governing permissions and'
        • Can you remove the @author tags? (Lucene sources don't include author tags anymore)
        Show
        Michael McCandless added a comment - Patch looks good – thanks Xiaoping! One problem is that contrib/analyzers is currently limited to Java 1.4, and I don't think we should change that at this point (though in 3.0, we will change it to 1.5). How hard would it be to switch your sources to use only Java 1.4? A couple other issues: Each copyright header is missing the starting 'S' in the sentence 'ee the License for the specific language governing permissions and' Can you remove the @author tags? (Lucene sources don't include author tags anymore)
        Hide
        Uwe Schindler added a comment -

        Hi Xiaoping,

        looks good, but I have some suggestions:

        • Making the datafile only readable from a RandomAccessFile makes it hard to e.g. move the data file directly into the jar file. I would like to put the data file directly into the package directory and load it with Class.getResourceAsStream(). In this case, the binary Lucene analyzer jar would be ready-to-use and the analyzer would run out of the box. Often configuring external files in e.g. web applications is complicated. An automatism to load the file from the JAR would be fine.
        • I have seen some singleton implementations, where the getInstance() static method is not synchronized. Without it there may be more than one instance, if different threads call getInstance() at the same time or close together.
        • Do we compile the source files with a fixed encoding of UTF-8 (build.xml?). If not, there may be problems, if the Java compiler uses another encoding (because platform default).
        Show
        Uwe Schindler added a comment - Hi Xiaoping, looks good, but I have some suggestions: Making the datafile only readable from a RandomAccessFile makes it hard to e.g. move the data file directly into the jar file. I would like to put the data file directly into the package directory and load it with Class.getResourceAsStream(). In this case, the binary Lucene analyzer jar would be ready-to-use and the analyzer would run out of the box. Often configuring external files in e.g. web applications is complicated. An automatism to load the file from the JAR would be fine. I have seen some singleton implementations, where the getInstance() static method is not synchronized. Without it there may be more than one instance, if different threads call getInstance() at the same time or close together. Do we compile the source files with a fixed encoding of UTF-8 (build.xml?). If not, there may be problems, if the Java compiler uses another encoding (because platform default).
        Hide
        Xiaoping Gao added a comment -

        to McCandless:
        There is lots of code depending on Java 1.5, I use enum, generalization frequently. Because I saw these points on apache wiki:

        • All core code to be included in 2.X releases should be compatible with Java 1.4.
        • All contrib code should be compatible with either Java 5 or 1.4.
          I have corrected the copyright header and @author tags, thank you.

        to Schindler:
        1. This is really a good idea, I wanna to move the data file into jar in next develop cycle, but now I need to make some changes to the data files independently, can I just commit the codes now?
        2. I have changed the getInstance() method to synchronized
        3. All the source files are fixed encoded using UTF-8, and I had put a notice in package.html, Should I do something else?

        Thank you all!

        Show
        Xiaoping Gao added a comment - to McCandless: There is lots of code depending on Java 1.5, I use enum, generalization frequently. Because I saw these points on apache wiki: All core code to be included in 2.X releases should be compatible with Java 1.4. All contrib code should be compatible with either Java 5 or 1.4 . I have corrected the copyright header and @author tags, thank you. to Schindler: 1. This is really a good idea, I wanna to move the data file into jar in next develop cycle, but now I need to make some changes to the data files independently, can I just commit the codes now? 2. I have changed the getInstance() method to synchronized 3. All the source files are fixed encoded using UTF-8, and I had put a notice in package.html, Should I do something else? Thank you all!
        Hide
        Xiaoping Gao added a comment -

        New patch in reply to Michael McCandless and Uwe Schindler 's comments.

        Show
        Xiaoping Gao added a comment - New patch in reply to Michael McCandless and Uwe Schindler 's comments.
        Hide
        Robert Muir added a comment -

        Hi,

        I see in the paper that lexical resources were also developed for Big5 (traditional chinese). Are you able to acquire these resources with BSD license as well?

        Show
        Robert Muir added a comment - Hi, I see in the paper that lexical resources were also developed for Big5 (traditional chinese). Are you able to acquire these resources with BSD license as well?
        Hide
        Michael McCandless added a comment -

        There is lots of code depending on Java 1.5, I use enum, generalization frequently. Because I saw these points on apache wiki:

        Well... "in general" contrib packages can be 1.5, but the analyzers contrib package is widely used, and is not 1.5 now, so it's a biggish change to force it to 1.5 with this. We should at least separate discuss in on java-dev if we want to consider allowing 1.5 code into contrib-analyzers.

        We could hold off on committing this until 3.0?

        Show
        Michael McCandless added a comment - There is lots of code depending on Java 1.5, I use enum, generalization frequently. Because I saw these points on apache wiki: Well... "in general" contrib packages can be 1.5, but the analyzers contrib package is widely used, and is not 1.5 now, so it's a biggish change to force it to 1.5 with this. We should at least separate discuss in on java-dev if we want to consider allowing 1.5 code into contrib-analyzers. We could hold off on committing this until 3.0?
        Hide
        Xiaoping Gao added a comment -

        I have ported the code to Java1.4 today, fortunately there were not so much problems.

        "Lucene-1629-java1.4.patch" is all the code working on Java 1.4, I have just changed it to fit Java1.4 code style.Data structures and algorithms are not modified.
        It has been tested that it would produce the very same result, just with a slight affection on speed.

        Show
        Xiaoping Gao added a comment - I have ported the code to Java1.4 today, fortunately there were not so much problems. "Lucene-1629-java1.4.patch" is all the code working on Java 1.4, I have just changed it to fit Java1.4 code style.Data structures and algorithms are not modified. It has been tested that it would produce the very same result, just with a slight affection on speed.
        Hide
        Xiaoping Gao added a comment -

        all the code working on java1.4

        Show
        Xiaoping Gao added a comment - all the code working on java1.4
        Hide
        Michael McCandless added a comment -

        all the code working on java1.4

        Fabulous, thanks Xiaoping!

        Show
        Michael McCandless added a comment - all the code working on java1.4 Fabulous, thanks Xiaoping!
        Hide
        Michael McCandless added a comment -

        When I apply the patch and then run "ant test" in contrib/analyzers, I'm hitting this compilation error:

        compile-core:
            [mkdir] Created dir: /lucene/src/cn.1629/build/contrib/analyzers/classes/java
            [javac] Compiling 88 source files to /lucene/src/cn.1629/build/contrib/analyzers/classes/java
            [javac] /lucene/src/cn.1629/contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/AnalyzerProfile.java:98: load(java.io.InputStream) in java.util.Properties cannot be applied to (java.io.FileReader)
            [javac]       prop.load(reader);
            [javac]           ^
            [javac] Note: Some input files use or override a deprecated API.
            [javac] Note: Recompile with -Xlint:deprecation for details.
            [javac] 1 error
        
        Show
        Michael McCandless added a comment - When I apply the patch and then run "ant test" in contrib/analyzers, I'm hitting this compilation error: compile-core: [mkdir] Created dir: /lucene/src/cn.1629/build/contrib/analyzers/classes/java [javac] Compiling 88 source files to /lucene/src/cn.1629/build/contrib/analyzers/classes/java [javac] /lucene/src/cn.1629/contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/AnalyzerProfile.java:98: load(java.io.InputStream) in java.util.Properties cannot be applied to (java.io.FileReader) [javac] prop.load(reader); [javac] ^ [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] 1 error
        Hide
        Xiaoping Gao added a comment -

        new patch for java1.4, I have corrected the bug "java.util.Property.load(Reader)".
        The new code can now be compiled now.

        Show
        Xiaoping Gao added a comment - new patch for java1.4, I have corrected the bug "java.util.Property.load(Reader)". The new code can now be compiled now.
        Hide
        Michael McCandless added a comment -

        Xiaoping, could you turn the TestSmartChineseAnalyzer into a real JUnit test case? (Ie, invoke that sample method from the testChineseAnalyzer method)?

        Also, it looks like you didn't switch to Class.getResourceAsStream() (Uwe's suggestion above) – are you planning on doing that?

        Finally, Robert asked a question above (about Big5) that maybe you missed?

        Do we compile the source files with a fixed encoding of UTF-8 (build.xml?). If not, there may be problems, if the Java compiler uses another encoding (because platform default).

        Lucene's common-build.xml already sets the encoding (for javac) to utf-8. So I think we're good here...

        Show
        Michael McCandless added a comment - Xiaoping, could you turn the TestSmartChineseAnalyzer into a real JUnit test case? (Ie, invoke that sample method from the testChineseAnalyzer method)? Also, it looks like you didn't switch to Class.getResourceAsStream() (Uwe's suggestion above) – are you planning on doing that? Finally, Robert asked a question above (about Big5) that maybe you missed? Do we compile the source files with a fixed encoding of UTF-8 (build.xml?). If not, there may be problems, if the Java compiler uses another encoding (because platform default). Lucene's common-build.xml already sets the encoding (for javac) to utf-8. So I think we're good here...
        Hide
        Xiaoping Gao added a comment -

        to Robert Muir:
        The dictionary only supports GB2312 encoding now, which has about 6800 characters, so I don't think it can support big5 encoding with this dictionary.
        You can ask the author about the big5 issue. May be he has another lexical dictionary.

        Now I will switch to Class.getResourceAsStream() to load the dictionary, so the user don't have to download the dictionary independently.
        After that I can write a real JUnit test case.

        Show
        Xiaoping Gao added a comment - to Robert Muir: The dictionary only supports GB2312 encoding now, which has about 6800 characters, so I don't think it can support big5 encoding with this dictionary. You can ask the author about the big5 issue. May be he has another lexical dictionary. Now I will switch to Class.getResourceAsStream() to load the dictionary, so the user don't have to download the dictionary independently. After that I can write a real JUnit test case.
        Hide
        Robert Muir added a comment -

        Xiaoping, thanks. I see they didn't get great performance with big5 tests but just curious.

        Maybe mention somewhere in the javadocs that this analyzer is for simplified chinese text, just so its clear?

        Show
        Robert Muir added a comment - Xiaoping, thanks. I see they didn't get great performance with big5 tests but just curious. Maybe mention somewhere in the javadocs that this analyzer is for simplified chinese text, just so its clear?
        Hide
        Xiaoping Gao added a comment -

        changes
        1. Add two binary dictionary files into the java package: coredict.mem(1.6M) bigramdict.mem(4.7M), I'll post them after this
        2. Using Class.getResourceAsStream() to load the dictionary, so users don't need to download dictionaries manually.
        3. Switch TestSmartChineseAnalyzer into a real JUnit test case

        Show
        Xiaoping Gao added a comment - changes 1. Add two binary dictionary files into the java package: coredict.mem(1.6M) bigramdict.mem(4.7M), I'll post them after this 2. Using Class.getResourceAsStream() to load the dictionary, so users don't need to download dictionaries manually. 3. Switch TestSmartChineseAnalyzer into a real JUnit test case
        Hide
        Xiaoping Gao added a comment -

        two binary dictionary files, please put them into contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/

        Show
        Xiaoping Gao added a comment - two binary dictionary files, please put them into contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/
        Hide
        Michael McCandless added a comment -

        When I run "ant test" in contrib/analyzers, SmartChineseAnalyzer is unable to locate the stopwords.txt:

            [junit] Testcase: testChineseAnalyzer(org.apache.lucene.analysis.cn.TestSmartChineseAnalyzer):	Caused an ERROR
            [junit] null
            [junit] java.lang.NullPointerException
            [junit] 	at java.io.Reader.<init>(Reader.java:61)
            [junit] 	at java.io.InputStreamReader.<init>(InputStreamReader.java:80)
            [junit] 	at org.apache.lucene.analysis.cn.SmartChineseAnalyzer.loadStopWords(SmartChineseAnalyzer.java:112)
            [junit] 	at org.apache.lucene.analysis.cn.SmartChineseAnalyzer.<init>(SmartChineseAnalyzer.java:71)
            [junit] 	at org.apache.lucene.analysis.cn.TestSmartChineseAnalyzer.testChineseAnalyzer(TestSmartChineseAnalyzer.java:36)
        
        Show
        Michael McCandless added a comment - When I run "ant test" in contrib/analyzers, SmartChineseAnalyzer is unable to locate the stopwords.txt: [junit] Testcase: testChineseAnalyzer(org.apache.lucene.analysis.cn.TestSmartChineseAnalyzer): Caused an ERROR [junit] null [junit] java.lang.NullPointerException [junit] at java.io.Reader.<init>(Reader.java:61) [junit] at java.io.InputStreamReader.<init>(InputStreamReader.java:80) [junit] at org.apache.lucene.analysis.cn.SmartChineseAnalyzer.loadStopWords(SmartChineseAnalyzer.java:112) [junit] at org.apache.lucene.analysis.cn.SmartChineseAnalyzer.<init>(SmartChineseAnalyzer.java:71) [junit] at org.apache.lucene.analysis.cn.TestSmartChineseAnalyzer.testChineseAnalyzer(TestSmartChineseAnalyzer.java:36)
        Hide
        Xiaoping Gao added a comment -

        On Mon, May 11, 2009 at 6:57 PM, Michael McCandless (JIRA)

        stopwords.txt should be in the same package as
        org.apache.lucene.analysis.cn.SmartChineseAnalyzer , can you find it there?

        Show
        Xiaoping Gao added a comment - On Mon, May 11, 2009 at 6:57 PM, Michael McCandless (JIRA) stopwords.txt should be in the same package as org.apache.lucene.analysis.cn.SmartChineseAnalyzer , can you find it there?
        Hide
        Michael McCandless added a comment -

        I do have the file, but at runtime the JRE cannot locate it using Class.getResourceAsStream().

        Are you able to run "ant test -Dtestcase=TestSmartChineseAnalyzer" from the command line in contrib/analzyers successfully?

        Show
        Michael McCandless added a comment - I do have the file, but at runtime the JRE cannot locate it using Class.getResourceAsStream(). Are you able to run "ant test -Dtestcase=TestSmartChineseAnalyzer" from the command line in contrib/analzyers successfully?
        Hide
        Uwe Schindler added a comment - - edited

        Did the <jar> ANT task also adds the non *.class files? During compilation, the additional files must be copied to the build directory, this is normally done by an additional copy task (I do it in this way). The Packager then packs all files below build into the jar file. Maybe the build script must be modified?
        I will try this out later.

        Show
        Uwe Schindler added a comment - - edited Did the <jar> ANT task also adds the non *.class files? During compilation, the additional files must be copied to the build directory, this is normally done by an additional copy task (I do it in this way). The Packager then packs all files below build into the jar file. Maybe the build script must be modified? I will try this out later.
        Hide
        Xiaoping Gao added a comment -

        I think Schindler should be right.
        I modified the code to skip loading stopwords.txt, but NullPointerException
        pop out again when loading coredict.mem file. When I run
        TestSmartChineseAnalyzer using eclipse, it just run successfully.
        So the problem might exist in the ant build script.

        Show
        Xiaoping Gao added a comment - I think Schindler should be right. I modified the code to skip loading stopwords.txt, but NullPointerException pop out again when loading coredict.mem file. When I run TestSmartChineseAnalyzer using eclipse, it just run successfully. So the problem might exist in the ant build script.
        Hide
        Uwe Schindler added a comment -

        I did some checks now, it is the problem of the ant script. Because of this, e.g. ArabicAnalyzer throws an IOException (but this is not tested, and so no test failures occur).
        The ant script should copy all the data files to the build/classes directory after compiling and before jaring.

        I do not know, how to fix this correctly, because I do not fully understand all the parts of the build files and how maven and common-build.xml works together with contrib-build and so on.
        The simpliest would be to customize the "compile" target for the analyzers package and list there all files that must be copied during the compilation step.

        Should I open an additional bug report for the ArabicAnalyzer, or should we fix the build.xml for analyzers with this case?

        Show
        Uwe Schindler added a comment - I did some checks now, it is the problem of the ant script. Because of this, e.g. ArabicAnalyzer throws an IOException (but this is not tested, and so no test failures occur). The ant script should copy all the data files to the build/classes directory after compiling and before jaring. I do not know, how to fix this correctly, because I do not fully understand all the parts of the build files and how maven and common-build.xml works together with contrib-build and so on. The simpliest would be to customize the "compile" target for the analyzers package and list there all files that must be copied during the compilation step. Should I open an additional bug report for the ArabicAnalyzer, or should we fix the build.xml for analyzers with this case?
        Hide
        Michael McCandless added a comment -

        The simpliest would be to customize the "compile" target for the analyzers package and list there all files that must be copied during the compilation step.

        Let's just do this fix, under this issue, for all contrib/analyzers that need to load a resource?

        Show
        Michael McCandless added a comment - The simpliest would be to customize the "compile" target for the analyzers package and list there all files that must be copied during the compilation step. Let's just do this fix, under this issue, for all contrib/analyzers that need to load a resource?
        Hide
        Uwe Schindler added a comment -

        Hi Mike,

        here is a patch that adds a maven-like resources directory. It patches the build script in two ways:

        • The junit test classpath is extended to include src/resources
        • The jarify macro is changed to also add src/resources to the jar file

        So all resource files mut be put into the corresponding subdirectory under src/resources. The patch contains this for the stopword.txt file af the arabic analyzer. The data files should be removed from src/java.

        The cn analyzers stopwords must be put in the top-level cn directory, the mem files into cn/smart/hhmm (I took me some time to find this out).

        The patch also includes some src/resources directory additions. For the compilation to work, every src/ directory now needs at least an empty resources folder. I found no way to make the jarify macro work without this?

        If somebody has an idea, it would be good.

        Show
        Uwe Schindler added a comment - Hi Mike, here is a patch that adds a maven-like resources directory. It patches the build script in two ways: The junit test classpath is extended to include src/resources The jarify macro is changed to also add src/resources to the jar file So all resource files mut be put into the corresponding subdirectory under src/resources. The patch contains this for the stopword.txt file af the arabic analyzer. The data files should be removed from src/java. The cn analyzers stopwords must be put in the top-level cn directory, the mem files into cn/smart/hhmm (I took me some time to find this out). The patch also includes some src/resources directory additions. For the compilation to work, every src/ directory now needs at least an empty resources folder. I found no way to make the jarify macro work without this? If somebody has an idea, it would be good.
        Hide
        Xiaoping Gao added a comment -

        I think it is unacceptable to ask every package to have a resources folder,
        can we write the build script to test whether the resources file exists,
        like this:
        <available property="resources.exists" file="$

        {resources.dir}

        " type="dir"/>
        <target name="index" depends="compile" description="Build WordNet index">
        <do_something if="sources.exists">
        package the reources.
        </do_something>

        Show
        Xiaoping Gao added a comment - I think it is unacceptable to ask every package to have a resources folder, can we write the build script to test whether the resources file exists, like this: <available property="resources.exists" file="$ {resources.dir} " type="dir"/> <target name="index" depends="compile" description="Build WordNet index"> <do_something if="sources.exists"> package the reources. </do_something>
        Hide
        Uwe Schindler added a comment -

        I know this, the problem with th lucene build is that JAR ing is done using a macro called <jarify>. And here this is not possible. From ANT 1.7.1 on there is the possibility to specify a "erroronmissingdir" when using <fileset/>: http://ant.apache.org/manual/CoreTypes/fileset.html

        I do not know what version of ant we require, but using it, the error can be avoided.

        Show
        Uwe Schindler added a comment - I know this, the problem with th lucene build is that JAR ing is done using a macro called <jarify>. And here this is not possible. From ANT 1.7.1 on there is the possibility to specify a "erroronmissingdir" when using <fileset/>: http://ant.apache.org/manual/CoreTypes/fileset.html I do not know what version of ant we require, but using it, the error can be avoided.
        Hide
        Michael McCandless added a comment -

        (Shooting in the dark, here, since I'm no ant expert...)

        Lucene's common-build.xml has this:

        <!-- Copy any data files present to the classpath -->
        <copy todir="@{destdir}">
          <fileset dir="@{srcdir}" excludes="**/*.java"/>
        </copy>
        

        Which for all tests will copy any resources (any file that's not *.java) into the corresponding build/classes directory; eg contrib/xml-query-parser's tests rely on this. This approach doesn't cause any errors when a given contrib module doesn't have resources. Is there some way to use a similar approach here (and not bump up the minimum ant version required)?

        Show
        Michael McCandless added a comment - (Shooting in the dark, here, since I'm no ant expert...) Lucene's common-build.xml has this: <!-- Copy any data files present to the classpath --> <copy todir= "@{destdir}" > <fileset dir= "@{srcdir}" excludes= "**/*.java" /> </copy> Which for all tests will copy any resources (any file that's not *.java) into the corresponding build/classes directory; eg contrib/xml-query-parser's tests rely on this. This approach doesn't cause any errors when a given contrib module doesn't have resources. Is there some way to use a similar approach here (and not bump up the minimum ant version required)?
        Hide
        Uwe Schindler added a comment -

        I wonder, why this build fragment did not work for contrib? The only problem is, that this also copies the package. and overview javadoc files. They should also be excluded.

        Show
        Uwe Schindler added a comment - I wonder, why this build fragment did not work for contrib? The only problem is, that this also copies the package. and overview javadoc files. They should also be excluded.
        Hide
        Michael McCandless added a comment -

        That fragment is under "compile-test-macro", which is run only on src/test/*. I agree, we should fix it to not copy package/javadoc files.

        Show
        Michael McCandless added a comment - That fragment is under "compile-test-macro", which is run only on src/test/*. I agree, we should fix it to not copy package/javadoc files.
        Hide
        Uwe Schindler added a comment -

        I will look into it this evening and provide a patch.

        Because of the file exclusion problematics, I thought, the approach to have a separate resources directory (like Maven does it), would be a great new invention. We could also do this for the tests. In my opinion, data files should be separated from source files. And by adding the resources folder to classpath during tests saves a lot of disk space during compilation and testing (ok, thats not important). By this compilation/test class path and building the jar files are separate tasks.
        The problem with my current approach is only, that the JAR packager fails, when the directory is not available - Is it so bad to just add an empty resources folder to every compilation unit? This would be similar to Maven.

        Show
        Uwe Schindler added a comment - I will look into it this evening and provide a patch. Because of the file exclusion problematics, I thought, the approach to have a separate resources directory (like Maven does it), would be a great new invention. We could also do this for the tests. In my opinion, data files should be separated from source files. And by adding the resources folder to classpath during tests saves a lot of disk space during compilation and testing (ok, thats not important). By this compilation/test class path and building the jar files are separate tasks. The problem with my current approach is only, that the JAR packager fails, when the directory is not available - Is it so bad to just add an empty resources folder to every compilation unit? This would be similar to Maven.
        Hide
        Michael McCandless added a comment -

        OK, I agree, separation of resources from source code is good.

        Can we limit the required addition of src/resources/org/apache/lucene/* to just contrib/analyzers? Ie, somehow only override its jarify macro?

        Show
        Michael McCandless added a comment - OK, I agree, separation of resources from source code is good. Can we limit the required addition of src/resources/org/apache/lucene/* to just contrib/analyzers? Ie, somehow only override its jarify macro?
        Hide
        Uwe Schindler added a comment -

        Its only needed to have the src/resources folder, no subfolders, I think it would be no problem to add this folder to every compilation unit (I added it to my svn in minutes). The good thing is, that future developments then know, where to put the resource files. But I agree, there should be a better way to automatically detect the resources folder before ANT 1.7.1.

        Maybe we should ask Erik Hatcher as the ANT specialist...!

        Show
        Uwe Schindler added a comment - Its only needed to have the src/resources folder, no subfolders, I think it would be no problem to add this folder to every compilation unit (I added it to my svn in minutes). The good thing is, that future developments then know, where to put the resource files. But I agree, there should be a better way to automatically detect the resources folder before ANT 1.7.1. Maybe we should ask Erik Hatcher as the ANT specialist...!
        Hide
        Erik Hatcher added a comment - - edited

        My initial thought is to move the <copy> excluding

         **/*.java and **/*.html

        to the "compile" macro. In the ancient past, Ant actually used to do this automatically with <javac>.

        Show
        Erik Hatcher added a comment - - edited My initial thought is to move the <copy> excluding **/*.java and **/*.html to the "compile" macro. In the ancient past, Ant actually used to do this automatically with <javac>.
        Hide
        Uwe Schindler added a comment -

        Here another try with Erik's suggestion:
        I moved the <copy> task to the compile macro and extended the list of exclusions. With some work and verbose=true, I added all "source" files to the exclusion (also .jj and so on).

        Using this patch, you can compile Xiaoping Gao patch, add the resources to cn/ and cn/smart/hhmm/ and they appear in classpath for testing and the final jar file.

        My problem with this is the messy exclusion list. During reading ANT docs, I dound out that there is the possibility with the <copy> task to not stop on errors. The idea is now again to put the data files into a maven-like resources folder and just copy them to the classpath (if the folder does not exist, copy would simply do nothing).

        I post a patch/test later.

        Show
        Uwe Schindler added a comment - Here another try with Erik's suggestion: I moved the <copy> task to the compile macro and extended the list of exclusions. With some work and verbose=true, I added all "source" files to the exclusion (also .jj and so on). Using this patch, you can compile Xiaoping Gao patch, add the resources to cn/ and cn/smart/hhmm/ and they appear in classpath for testing and the final jar file. My problem with this is the messy exclusion list. During reading ANT docs, I dound out that there is the possibility with the <copy> task to not stop on errors. The idea is now again to put the data files into a maven-like resources folder and just copy them to the classpath (if the folder does not exist, copy would simply do nothing). I post a patch/test later.
        Hide
        Uwe Schindler added a comment -

        This is a second try, again with the resources folder. It is now optional, to have a src/resources folder, if it exists, all files from inside are copied to the build destination.

        The trick was, that the copy task can additionally use a globmapping, and by that, does the following:

        • The source fileset of the copy task uses the src/ folder directly
        • The fileset only includes resources/**
        • Because then the target folder would get an additional sub-folder "resources" (because the base dir of the copy operation is "src/"), the filenames are replaced by a globmapping, stripping the "resources/" from the relative path

        This patch also adds a simple test case, that shows, that ArabicAnalyzer does not start correctly, when the stopwords.txt file is not in the classpath. The test fails, if the stopwords.txt file stays at the original location and/or the copy operation is commented out.

        The patch does not contain the deletion of the arabic stopwords file from the sources folder (was binary), so remove it by hand or simply move it after aplying the patch.

        Show
        Uwe Schindler added a comment - This is a second try, again with the resources folder. It is now optional, to have a src/resources folder, if it exists, all files from inside are copied to the build destination. The trick was, that the copy task can additionally use a globmapping, and by that, does the following: The source fileset of the copy task uses the src/ folder directly The fileset only includes resources/** Because then the target folder would get an additional sub-folder "resources" (because the base dir of the copy operation is "src/"), the filenames are replaced by a globmapping, stripping the "resources/" from the relative path This patch also adds a simple test case, that shows, that ArabicAnalyzer does not start correctly, when the stopwords.txt file is not in the classpath. The test fails, if the stopwords.txt file stays at the original location and/or the copy operation is commented out. The patch does not contain the deletion of the arabic stopwords file from the sources folder (was binary), so remove it by hand or simply move it after aplying the patch.
        Hide
        Michael McCandless added a comment -

        Awesome! I've applied your patch, Uwe, and moved ArabicAnalyzer's stopwords.txt, as well as SmartChineseAnalyzer's stopwords.txt, bigramdict.mem, coredict.mem, under their respective subdirs under src/resources/*. I confirmed TestArabicAnalyzer passes (and verified it really did instantiate ArabicAnalyzer). All tests pass.

        I will commit shortly.

        This issue is a delightful example of the collaboration that makes open source development work so well. Thanks Xiaoping, Uwe and Erik!

        Show
        Michael McCandless added a comment - Awesome! I've applied your patch, Uwe, and moved ArabicAnalyzer's stopwords.txt, as well as SmartChineseAnalyzer's stopwords.txt, bigramdict.mem, coredict.mem, under their respective subdirs under src/resources/*. I confirmed TestArabicAnalyzer passes (and verified it really did instantiate ArabicAnalyzer). All tests pass. I will commit shortly. This issue is a delightful example of the collaboration that makes open source development work so well. Thanks Xiaoping, Uwe and Erik!
        Hide
        Michael McCandless added a comment -

        Thanks everyone!

        Show
        Michael McCandless added a comment - Thanks everyone!
        Hide
        Uwe Schindler added a comment -

        Fine!
        Should I commit the ArabicAnalyzer test, too? But I think the test is not really needed, as the new chinese analyzer already tests for the resources implicit.

        One thing: The change is in the main changes.txt, normally it should be in contrib's changes.txt, or not? If it should stay there, we should also add Spatial and TrieRange to main changes.txt.

        And one other thing: The analyzer (and many more) use the old TokenStream API at the moment, we should change this before 2.9 for all contrib analyzers, see LUCENE-1460?

        Show
        Uwe Schindler added a comment - Fine! Should I commit the ArabicAnalyzer test, too? But I think the test is not really needed, as the new chinese analyzer already tests for the resources implicit. One thing: The change is in the main changes.txt, normally it should be in contrib's changes.txt, or not? If it should stay there, we should also add Spatial and TrieRange to main changes.txt. And one other thing: The analyzer (and many more) use the old TokenStream API at the moment, we should change this before 2.9 for all contrib analyzers, see LUCENE-1460 ?
        Hide
        Michael McCandless added a comment -

        Should I commit the ArabicAnalyzer test, too?

        Woops, I missed it – I'll commit it. The more tests the better!

        The change is in the main changes.txt, normally it should be in contrib's changes.txt, or not?

        Woops – you're right. I'll move this to contrib's CHANGES.txt.

        Show
        Michael McCandless added a comment - Should I commit the ArabicAnalyzer test, too? Woops, I missed it – I'll commit it. The more tests the better! The change is in the main changes.txt, normally it should be in contrib's changes.txt, or not? Woops – you're right. I'll move this to contrib's CHANGES.txt.
        Hide
        Michael McCandless added a comment -

        The analyzer (and many more) use the old TokenStream API at the moment, we should change this before 2.9 for all contrib analyzers, see LUCENE-1460?

        Yes – we need to resolve LUCENE-1460 (and a great many more; the list keeps growing!) before 2.9.

        Show
        Michael McCandless added a comment - The analyzer (and many more) use the old TokenStream API at the moment, we should change this before 2.9 for all contrib analyzers, see LUCENE-1460 ? Yes – we need to resolve LUCENE-1460 (and a great many more; the list keeps growing!) before 2.9.
        Hide
        Uwe Schindler added a comment -

        Hi Mike,
        a small patch: The HTML files generated by Javadoc do not contain the charset header and are displayed as ISO-8859-1. This breaks the docs for the chinese analyzer. The attached patch sets the output encoding correctly to UTF-8 using the <meta/> html tag.

        Show
        Uwe Schindler added a comment - Hi Mike, a small patch: The HTML files generated by Javadoc do not contain the charset header and are displayed as ISO-8859-1. This breaks the docs for the chinese analyzer. The attached patch sets the output encoding correctly to UTF-8 using the <meta/> html tag.
        Hide
        Xiaoping Gao added a comment -

        Test successful on my laptop now! Thank all of you for your patience and hard work!
        I will continue to maintain this analyzer and develop new features.

        Best Wishes!

        Show
        Xiaoping Gao added a comment - Test successful on my laptop now! Thank all of you for your patience and hard work! I will continue to maintain this analyzer and develop new features. Best Wishes!
        Hide
        Michael McCandless added a comment -

        OK, I just committed that fix (javadocs encoding == UTF-8) Uwe. Thanks.

        Show
        Michael McCandless added a comment - OK, I just committed that fix (javadocs encoding == UTF-8) Uwe. Thanks.
        Hide
        Uwe Schindler added a comment -

        Hi Xiaoping,

        Thanks! The code is now committed.

        Only for the understanding (as I do not know chinese and cannot read some comments), some questions/comments:
        The .mem files are serializations of the dictionaries. They are created by loading from the random access file (these dct files) and then serialized to the mem files. But for developers and further updates you need to have the dct files and rerun these steps (that are all these private methods).
        An interesting addition would be to create a custom build step, that uses the dct files and builds the .mem files from it. How could I invoke that? So maybe you could extract the useless dct file loaders from the current classes and create a separate tool from it, that could be invoked from ant, that builds that mem files.

        Uwe

        P.S.: By the way: In these private conversation methods (that are never called from the library code) you have these default try-catch blocks, which is bad programming practice. So the proposed separate conversion tool should correctly handle the exceptions or better just not catch them at all and pass up (side note: I hate eclipse for generating these auto-catch blocks, better would be to auto-add throws-clauses to the method signatures!)

        Show
        Uwe Schindler added a comment - Hi Xiaoping, Thanks! The code is now committed. Only for the understanding (as I do not know chinese and cannot read some comments), some questions/comments: The .mem files are serializations of the dictionaries. They are created by loading from the random access file (these dct files) and then serialized to the mem files. But for developers and further updates you need to have the dct files and rerun these steps (that are all these private methods). An interesting addition would be to create a custom build step, that uses the dct files and builds the .mem files from it. How could I invoke that? So maybe you could extract the useless dct file loaders from the current classes and create a separate tool from it, that could be invoked from ant, that builds that mem files. Uwe P.S.: By the way: In these private conversation methods (that are never called from the library code) you have these default try-catch blocks, which is bad programming practice. So the proposed separate conversion tool should correctly handle the exceptions or better just not catch them at all and pass up (side note: I hate eclipse for generating these auto-catch blocks, better would be to auto-add throws-clauses to the method signatures!)
        Hide
        Mingfai Ma added a comment - - edited

        hi Xiaoping,

        I'm interested to get the Chinese analyzer work for Traditional Chinese (UTF-8/Big5). Just wonder if your coredict.dct comes from ICTCLAS? (http://ictclas.org/Down_share.html) if yes, is it 2009 or 2008?

        The ICTCLAS has traditional chinese edition for its 2008 release. But the distribution are not in .dct. I wonder if we have a simple specification for the .dct so I could find a way to convert the ICTCLAS's lexical dictionary to the .dct format to work with your library?

        Show
        Mingfai Ma added a comment - - edited hi Xiaoping, I'm interested to get the Chinese analyzer work for Traditional Chinese (UTF-8/Big5). Just wonder if your coredict.dct comes from ICTCLAS? ( http://ictclas.org/Down_share.html ) if yes, is it 2009 or 2008? The ICTCLAS has traditional chinese edition for its 2008 release. But the distribution are not in .dct. I wonder if we have a simple specification for the .dct so I could find a way to convert the ICTCLAS's lexical dictionary to the .dct format to work with your library?
        Hide
        Xiaoping Gao added a comment -

        Hello Mingfai!

        coredict.mem is converted from coredict.dct which come from ICTCLAS1.0,
        neither 2008 nor 2009.
        The author authorized me to release just the lexical dictionary from
        ICTCLAS1.0 under APLv2, but he didn't authorize the dictionary of
        ictclas2008~2009.
        As far as I know, coredict.dct just contain GB2312 characters, so it cannot
        support Big5.

        I think we should find the proper big5 dictionary first, then I will help
        you to convert to dct file.

        On May 15, 2009 6:20pm, "Mingfai Ma (JIRA)" <jira@apache.org> wrote:

        Show
        Xiaoping Gao added a comment - Hello Mingfai! coredict.mem is converted from coredict.dct which come from ICTCLAS1.0, neither 2008 nor 2009. The author authorized me to release just the lexical dictionary from ICTCLAS1.0 under APLv2, but he didn't authorize the dictionary of ictclas2008~2009. As far as I know, coredict.dct just contain GB2312 characters, so it cannot support Big5. I think we should find the proper big5 dictionary first, then I will help you to convert to dct file. On May 15, 2009 6:20pm, "Mingfai Ma (JIRA)" <jira@apache.org> wrote:
        Hide
        Xiaoping Gao added a comment -

        Hello Mingfai!

        coredict.mem is converted from coredict.dct which come from ICTCLAS1.0,
        neither 2008 nor 2009.
        The author authorized me to release just the lexical dictionary from
        ICTCLAS1.0 under APLv2, but he didn't authorize the dictionary of
        ictclas2008~2009.
        As far as I know, coredict.dct just contain GB2312 characters, so it cannot
        support Big5.

        I think we should find the proper big5 dictionary first, then I will help
        you to convert to dct file.

        On May 15, 2009 6:20pm, "Mingfai Ma (JIRA)" <jira@apache.org> wrote:

        Show
        Xiaoping Gao added a comment - Hello Mingfai! coredict.mem is converted from coredict.dct which come from ICTCLAS1.0, neither 2008 nor 2009. The author authorized me to release just the lexical dictionary from ICTCLAS1.0 under APLv2, but he didn't authorize the dictionary of ictclas2008~2009. As far as I know, coredict.dct just contain GB2312 characters, so it cannot support Big5. I think we should find the proper big5 dictionary first, then I will help you to convert to dct file. On May 15, 2009 6:20pm, "Mingfai Ma (JIRA)" <jira@apache.org> wrote:
        Hide
        Robert Muir added a comment -

        if you acquire the big5 resources, do you think it would be possible to create a single dictionary that works with both Simplified & Traditional?

        (i.e. merge the big5 resources with the gb resources)

        The reason I say this, is the existing chinese analyzers, although they tokenize in a less intelligent way, they are agnostic to Simplified/Traditional issues...

        Show
        Robert Muir added a comment - if you acquire the big5 resources, do you think it would be possible to create a single dictionary that works with both Simplified & Traditional? (i.e. merge the big5 resources with the gb resources) The reason I say this, is the existing chinese analyzers, although they tokenize in a less intelligent way, they are agnostic to Simplified/Traditional issues...
        Hide
        Robert Muir added a comment -

        another potential issue with big5 i want to point out is that many of the big5 character sets such as HKSCS have characters that are mapped into regions of unicode outside of the BMP.

        just glancing at the code, some things will need to be modified for this to work correctly with surrogate pairs, various functions that take char will need to take codepoint (int), etc.

        Show
        Robert Muir added a comment - another potential issue with big5 i want to point out is that many of the big5 character sets such as HKSCS have characters that are mapped into regions of unicode outside of the BMP. just glancing at the code, some things will need to be modified for this to work correctly with surrogate pairs, various functions that take char will need to take codepoint (int), etc.
        Hide
        Mingfai Ma added a comment -

        could we use CC-CEDICT's dictionary instead? it is using Creative Commons Attribution-Share Alike 3.0 license

        http://www.mdbg.net/chindict/chindict.php?page=cc-cedict

        Show
        Mingfai Ma added a comment - could we use CC-CEDICT's dictionary instead? it is using Creative Commons Attribution-Share Alike 3.0 license http://www.mdbg.net/chindict/chindict.php?page=cc-cedict
        Hide
        Koji Sekiguchi added a comment -

        Just an FYI. There have been a working for mapping between simplified and traditional chinese characters in Solr 1.4. (but you need to define mapping rules in mapping.txt)
        See SOLR-822 and the attached JPG for chinese mapping sample.
        I opened LUCENE-1466 for Lucene.

        Show
        Koji Sekiguchi added a comment - Just an FYI. There have been a working for mapping between simplified and traditional chinese characters in Solr 1.4. (but you need to define mapping rules in mapping.txt) See SOLR-822 and the attached JPG for chinese mapping sample. I opened LUCENE-1466 for Lucene.
        Hide
        Robert Muir added a comment -

        koji, have you considered using icu transforms for this behavior?
        Not only is the rule-based language very nice (you can define variables, use context, etc), but many transformations such as "Traditional-Simplified" are already defined.

        http://userguide.icu-project.org/transforms/general

        Show
        Robert Muir added a comment - koji, have you considered using icu transforms for this behavior? Not only is the rule-based language very nice (you can define variables, use context, etc), but many transformations such as "Traditional-Simplified" are already defined. http://userguide.icu-project.org/transforms/general
        Hide
        Koji Sekiguchi added a comment -

        koji, have you considered using icu transforms for this behavior?

        Not yet.

        Not only is the rule-based language very nice (you can define variables, use context, etc), but many transformations such as "Traditional-Simplified" are already defined.

        Right. CharFilter framework that I introduced in SOLR-822 is not only for rule-based mapping, but it can use existing library like ICU to transform/normalize characters.

        Show
        Koji Sekiguchi added a comment - koji, have you considered using icu transforms for this behavior? Not yet. Not only is the rule-based language very nice (you can define variables, use context, etc), but many transformations such as "Traditional-Simplified" are already defined. Right. CharFilter framework that I introduced in SOLR-822 is not only for rule-based mapping, but it can use existing library like ICU to transform/normalize characters.
        Hide
        Mingfai Ma added a comment -

        i'm not sure if the character mapping is a feasible approach. The CC-CEDICT dictionary has terms in both simplified and traditional chinese already and we need not to define any rule. how much effort is needed to define all mapping rules for the simplified and traditional chinese thesaurus?

        Besides, simplified and traditional Chinese conversion is not as simple as mapping the code. For the sample meaning, it may use different words in the SC and TC.

        if the approach accepted in this issue is ok, I just need to figure out how to convert CC-CEDICT to the dct / mem format, and i suppose it is doable.

        Show
        Mingfai Ma added a comment - i'm not sure if the character mapping is a feasible approach. The CC-CEDICT dictionary has terms in both simplified and traditional chinese already and we need not to define any rule. how much effort is needed to define all mapping rules for the simplified and traditional chinese thesaurus? Besides, simplified and traditional Chinese conversion is not as simple as mapping the code. For the sample meaning, it may use different words in the SC and TC. if the approach accepted in this issue is ok, I just need to figure out how to convert CC-CEDICT to the dct / mem format, and i suppose it is doable.
        Hide
        Xiaoping Gao added a comment -

        The dictionary is loaded in to 2 classes:
        BigramDictionary.java
        WordDictionary.java

        so you can read from the loading section to get "dct" format

        mem files are just arrays that has been serialized to object files, you can get the format from the code and comments.

        Show
        Xiaoping Gao added a comment - The dictionary is loaded in to 2 classes: BigramDictionary.java WordDictionary.java so you can read from the loading section to get "dct" format mem files are just arrays that has been serialized to object files, you can get the format from the code and comments.
        Hide
        Otis Gospodnetic added a comment -

        I just got to look at this code and I only scanned it quickly. Is all of the code really Chinese-specific?
        Would any of it be applicable to other languages, say Japanese or Korean? (assuming we have dictionaries in suitable format)

        Show
        Otis Gospodnetic added a comment - I just got to look at this code and I only scanned it quickly. Is all of the code really Chinese-specific? Would any of it be applicable to other languages, say Japanese or Korean? (assuming we have dictionaries in suitable format)
        Hide
        Xiaoping Gao added a comment -

        I think the algorithm of Hidden Markov Model is applicable,
        but I doubt that Japanese and Korean may need some language specific
        analysis, such as "片假名かたかな.平假名ひらがな" in Japanese, as far as I know, the same
        word can have different spelling. This may be a problem in the application?

        Show
        Xiaoping Gao added a comment - I think the algorithm of Hidden Markov Model is applicable, but I doubt that Japanese and Korean may need some language specific analysis, such as "片假名かたかな.平假名ひらがな" in Japanese, as far as I know, the same word can have different spelling. This may be a problem in the application?
        Hide
        Otis Gospodnetic added a comment -

        Hm, my Japanese is a little weak, so I'm not sure what exactly that means and what exactly
        different spelling means in this context...

        Google says "片假名かたかな.平假名ひらがな" means "How do假名piece. Hiragana假名flat"

        Which word in the above Japanese text is the same as another word, yet with a different spelling? Is this a question of synonyms,
        such as "auto", "automobile", and "car", and even "sedan" in English?

        Show
        Otis Gospodnetic added a comment - Hm, my Japanese is a little weak, so I'm not sure what exactly that means and what exactly different spelling means in this context... Google says "片假名かたかな.平假名ひらがな" means "How do假名piece. Hiragana假名flat" Which word in the above Japanese text is the same as another word, yet with a different spelling? Is this a question of synonyms, such as "auto", "automobile", and "car", and even "sedan" in English?
        Hide
        Robert Muir added a comment -

        otis if you are interested in japanese/korean you might find this link interesting:

        http://bugs.icu-project.org/trac/ticket/2229

        similar to the thai approach (in contrib) but with log probabilities.

        Show
        Robert Muir added a comment - otis if you are interested in japanese/korean you might find this link interesting: http://bugs.icu-project.org/trac/ticket/2229 similar to the thai approach (in contrib) but with log probabilities.
        Show
        Mingfai Ma added a comment - re. 平假名 and 片假名 in Japanese http://en.wikipedia.org/wiki/Kana http://en.wikipedia.org/wiki/Hiragana http://en.wikipedia.org/wiki/Katakana
        Hide
        KuroSaka TeruHiko added a comment -

        WordTokenizer extends Tokenizer, but it's constructor takes a TokenStream rather than a Reader.
        Shouldn't WordTokenizer rather extends TokenFilter, and if so, shouldn't it be named WordTokenFilter?

        Show
        KuroSaka TeruHiko added a comment - WordTokenizer extends Tokenizer, but it's constructor takes a TokenStream rather than a Reader. Shouldn't WordTokenizer rather extends TokenFilter, and if so, shouldn't it be named WordTokenFilter?
        Hide
        Robert Muir added a comment -

        Shouldn't WordTokenizer rather extends TokenFilter, and if so, shouldn't it be named WordTokenFilter?

        yes, you are correct. see LUCENE-1728 where we have proposed correcting this.

        Show
        Robert Muir added a comment - Shouldn't WordTokenizer rather extends TokenFilter, and if so, shouldn't it be named WordTokenFilter? yes, you are correct. see LUCENE-1728 where we have proposed correcting this.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Xiaoping Gao
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development