Solr
  1. Solr
  2. SOLR-1336

Add support for lucene's SmartChineseAnalyzer

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese text as words.

      if the factories for the tokenizer and word token filter are added to solr it can be used, although there should be a sample config or wiki entry showing how to apply the built-in stopwords list.
      this is because it doesn't contain actual stopwords, but must be used to prevent indexing punctuation...

      note: we did some refactoring/cleanup on this analyzer recently, so it would be much easier to do this after the next lucene update.
      it has also been moved out of -analyzers.jar due to size, and now builds in its own smartcn jar file, so that would need to be added if this feature is desired.

      1. SOLR-1336.patch
        45 kB
        Robert Muir
      2. SOLR-1336.patch
        45 kB
        Robert Muir
      3. SOLR-1336.patch
        4 kB
        Robert Muir

        Activity

        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1.0 release

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1.0 release
        Hide
        Robert Muir added a comment -

        I committed just the factories and a simple test to the analysis-extras contrib.

        Committed revision 1030073, 1030076 (3x)

        Show
        Robert Muir added a comment - I committed just the factories and a simple test to the analysis-extras contrib. Committed revision 1030073, 1030076 (3x)
        Hide
        Robert Muir added a comment -

        I don't think jar file size should prevent us from adding support for all the analyzers we have.

        This comes with the territory for CJK. Individuals interested in "optimizing" size can help
        with LUCENE-2510, but I don't think that should block integrating all our analyzers, nor should
        they have to all wait till 4.0

        Show
        Robert Muir added a comment - I don't think jar file size should prevent us from adding support for all the analyzers we have. This comes with the territory for CJK. Individuals interested in "optimizing" size can help with LUCENE-2510 , but I don't think that should block integrating all our analyzers, nor should they have to all wait till 4.0
        Hide
        Robert Muir added a comment -

        Yonik maybe it would be better to wait until these things settle out first? (I glanced at issues and saw -1, +1, and such)

        I guess there is always the option for release 1.4, do nothing, and instruct users that want to use this analyzer to put lucene-smartcn-2.9.jar in their lib and use analyzer= (they will be stuck with porter stemming and such for now though)

        Show
        Robert Muir added a comment - Yonik maybe it would be better to wait until these things settle out first? (I glanced at issues and saw -1, +1, and such) I guess there is always the option for release 1.4, do nothing, and instruct users that want to use this analyzer to put lucene-smartcn-2.9.jar in their lib and use analyzer= (they will be stuck with porter stemming and such for now though)
        Hide
        Yonik Seeley added a comment -

        In theory perhaps, but one problem is that example/solr/lib isn't even in svn... nothing lives there, but is copied there (currently).
        There's been a lot of discussions on solr-dev lately about where the tika libs should live, etc...
        http://search.lucidimagination.com/search/document/a9520632864db021/distinct_example_for_solr_cell
        And SOLR-1449 is also in the mix as a way to reference jars outside of the example lib.

        Show
        Yonik Seeley added a comment - In theory perhaps, but one problem is that example/solr/lib isn't even in svn... nothing lives there, but is copied there (currently). There's been a lot of discussions on solr-dev lately about where the tika libs should live, etc... http://search.lucidimagination.com/search/document/a9520632864db021/distinct_example_for_solr_cell And SOLR-1449 is also in the mix as a way to reference jars outside of the example lib.
        Hide
        Robert Muir added a comment -

        Perhaps we could make them lazy load? token streams are reused now, so a small reflection overhead is no longer an issue.

        If we do this, then we could avoid a contrib that is really just a jar file? and instead could the jar file just go in the example/solr/lib?

        Show
        Robert Muir added a comment - Perhaps we could make them lazy load? token streams are reused now, so a small reflection overhead is no longer an issue. If we do this, then we could avoid a contrib that is really just a jar file? and instead could the jar file just go in the example/solr/lib?
        Hide
        Yonik Seeley added a comment -

        I guess it should go into contrib for now...

        where should i put factories?

        It would be nice if we could avoid another jar, just for 2 small classes.
        Perhaps we could make them lazy load? token streams are reused now, so a small reflection overhead is no longer an issue.

        Show
        Yonik Seeley added a comment - I guess it should go into contrib for now... where should i put factories? It would be nice if we could avoid another jar, just for 2 small classes. Perhaps we could make them lazy load? token streams are reused now, so a small reflection overhead is no longer an issue.
        Hide
        Robert Muir added a comment -

        Thanks, so do we want a contrib (which would mostly just be the jar file + the 2 factories) or should it go in example/solr/lib?

        If we do the latter, where should i put factories? These could be useful if someone wants the chinese analysis to work a little different,
        for example SmartChineseAnalyzer does porter stemming on english but someone might not want that.

        Show
        Robert Muir added a comment - Thanks, so do we want a contrib (which would mostly just be the jar file + the 2 factories) or should it go in example/solr/lib? If we do the latter, where should i put factories? These could be useful if someone wants the chinese analysis to work a little different, for example SmartChineseAnalyzer does porter stemming on english but someone might not want that.
        Hide
        Stanislaw Osinski added a comment -

        Keeping the Chinese analyzer JAR optional sounds good. As Carrot2 also uses it, I'd need to make sure the clustering contrib doesn't fail when the JAR is not there and clustering in Chinese is requested (I think I'd simply log a WARN saying that the Chinese analyzer JAR is required for best clustering results).

        Show
        Stanislaw Osinski added a comment - Keeping the Chinese analyzer JAR optional sounds good. As Carrot2 also uses it, I'd need to make sure the clustering contrib doesn't fail when the JAR is not there and clustering in Chinese is requested (I think I'd simply log a WARN saying that the Chinese analyzer JAR is required for best clustering results).
        Hide
        Yonik Seeley added a comment -

        I agree it would be an awkward thing to have inside solr.war
        Should we copy to example/solr/lib like the Tika libs are (we already have 32MB of jars there)?

        Show
        Yonik Seeley added a comment - I agree it would be an awkward thing to have inside solr.war Should we copy to example/solr/lib like the Tika libs are (we already have 32MB of jars there)?
        Hide
        Robert Muir added a comment -

        contrib?

        sounds reasonable to me. in a few days i can upload a new patch.

        Show
        Robert Muir added a comment - contrib? sounds reasonable to me. in a few days i can upload a new patch.
        Hide
        Hoss Man added a comment -

        The downside is a 3MB jar in solr/lib and in the solr.war

        contrib?

        Chinese isn't something everybody needs, and 3MB would almost double the size of the solr.war.

        Show
        Hoss Man added a comment - The downside is a 3MB jar in solr/lib and in the solr.war contrib? Chinese isn't something everybody needs, and 3MB would almost double the size of the solr.war.
        Hide
        Yonik Seeley added a comment -

        I was going to check this out, but Lucene 2.9_RC3 doesn't work with Solr - need to wait for RC4.

        Any objections to committing this for 1.4 and adding it to the example server, provided we can verify that there isn't a memory cost if it's not used? The downside is a 3MB jar in solr/lib and in the solr.war

        Show
        Yonik Seeley added a comment - I was going to check this out, but Lucene 2.9_RC3 doesn't work with Solr - need to wait for RC4. Any objections to committing this for 1.4 and adding it to the example server, provided we can verify that there isn't a memory cost if it's not used? The downside is a 3MB jar in solr/lib and in the solr.war
        Hide
        Robert Muir added a comment -

        we moved some parts of this analyzer around in LUCENE-1882

        this syncs the patch up with lucene trunk (not rc2 as it does not reflect LUCENE-1882).

        Show
        Robert Muir added a comment - we moved some parts of this analyzer around in LUCENE-1882 this syncs the patch up with lucene trunk (not rc2 as it does not reflect LUCENE-1882 ).
        Hide
        Robert Muir added a comment -

        Kumar, by the way, I wanted to mention if by any chance you feel inclined to help us improve this analyzer, please don't hesitate!

        There is so much work to do: dictionary format, code refactoring, better unicode support, among other things.
        Even if you don't want to write any code but have good Chinese & English skills, there are still some javadocs in Chinese that haven't been translated.

        Show
        Robert Muir added a comment - Kumar, by the way, I wanted to mention if by any chance you feel inclined to help us improve this analyzer, please don't hesitate! There is so much work to do: dictionary format, code refactoring, better unicode support, among other things. Even if you don't want to write any code but have good Chinese & English skills, there are still some javadocs in Chinese that haven't been translated.
        Hide
        Robert Muir added a comment -

        Can this be customized to accomodate those languages?

        Maybe, but we have to do work first. the dictionary is limited to GB2312 encoding, so we can't add support for new languages until this is fixed.

        Is there any wiki link or document to help us understand how this tool works? Sort of behind the scenes....

        There are some sparse javadocs or code comments. also see the original jira ticket: LUCENE-1629

        What exactly does the dictionary contain? Is it any ordinary chinese dictionary or some sort of a customized/lemmatized dictionary?

        There are two dictionaries: word dictionary, and bigram dictionary.
        These dictionaries contain words and bigrams respectively, along with frequency, in a "trie"-like structure organized by chinese character.

        Also, how can one add new words to the dictionary?

        This is currently really difficult. please see LUCENE-1817 for some background information.
        For the moment you will have to recompile your own custom jar file, and be familiar with the file formats the analyzer uses.
        Note, we put strong warnings as we would like to change the file formats in an upcoming release, to something based on Unicode.
        This way, we can support more languages, and perhaps also make it easier to customize the dictionary data

        Show
        Robert Muir added a comment - Can this be customized to accomodate those languages? Maybe, but we have to do work first. the dictionary is limited to GB2312 encoding, so we can't add support for new languages until this is fixed. Is there any wiki link or document to help us understand how this tool works? Sort of behind the scenes.... There are some sparse javadocs or code comments. also see the original jira ticket: LUCENE-1629 What exactly does the dictionary contain? Is it any ordinary chinese dictionary or some sort of a customized/lemmatized dictionary? There are two dictionaries: word dictionary, and bigram dictionary. These dictionaries contain words and bigrams respectively, along with frequency, in a "trie"-like structure organized by chinese character. Also, how can one add new words to the dictionary? This is currently really difficult. please see LUCENE-1817 for some background information. For the moment you will have to recompile your own custom jar file, and be familiar with the file formats the analyzer uses. Note, we put strong warnings as we would like to change the file formats in an upcoming release, to something based on Unicode. This way, we can support more languages, and perhaps also make it easier to customize the dictionary data
        Hide
        Kumar Raja added a comment -

        Since this feature works so well, i think it can easily shipped along with Solr 1.4.
        When is this going to be committed to the Solr build?

        Show
        Kumar Raja added a comment - Since this feature works so well, i think it can easily shipped along with Solr 1.4. When is this going to be committed to the Solr build?
        Hide
        Kumar Raja added a comment -

        Hi Robert,
        Sorry...my bad. There was a mix up of the Solr versions on my machine which caused this error.

        This tool is great. It works wonderful and there is a test case pass rate is amazing!!!! Is there a similar tool for other asian languages, say Japanese and Korean? Can this be customized to accomodate those languages?

        Is there any wiki link or document to help us understand how this tool works? Sort of behind the scenes.... What exactly does the dictionary contain? Is it any ordinary chinese dictionary or some sort of a customized/lemmatized dictionary? Also, how can one add new words to the dictionary?

        Thanks,
        Kumar

        Show
        Kumar Raja added a comment - Hi Robert, Sorry...my bad. There was a mix up of the Solr versions on my machine which caused this error. This tool is great. It works wonderful and there is a test case pass rate is amazing!!!! Is there a similar tool for other asian languages, say Japanese and Korean? Can this be customized to accomodate those languages? Is there any wiki link or document to help us understand how this tool works? Sort of behind the scenes.... What exactly does the dictionary contain? Is it any ordinary chinese dictionary or some sort of a customized/lemmatized dictionary? Also, how can one add new words to the dictionary? Thanks, Kumar
        Hide
        Robert Muir added a comment -

        Hi, thanks for testing!

        first, I am having trouble trying to figure out what is going on here, since it looks like the stack trace is unrelated to smart chinese analyzer.
        its a little bit more difficult since i am looking at the latest solr code and my tokenizerchain:64 is not tokenStream() !

        Due to the exception you are getting, I suspect something is out of date... maybe its as simple as 'ant clean' and recompile?

        Show
        Robert Muir added a comment - Hi, thanks for testing! first, I am having trouble trying to figure out what is going on here, since it looks like the stack trace is unrelated to smart chinese analyzer. its a little bit more difficult since i am looking at the latest solr code and my tokenizerchain:64 is not tokenStream() ! Due to the exception you are getting, I suspect something is out of date... maybe its as simple as 'ant clean' and recompile?
        Hide
        Kumar Raja added a comment -

        I applied the patch with the latest Solr code and lucene-rc2 jars and tried indexing the some chinese text. However, i got a AbstractMethodError during tokenization.
        What am i doing wrong here?

        THE STACK TRACE

         
        SEVERE: java.lang.AbstractMethodError
                at org.apache.solr.analysis.TokenizerChain.tokenStream(TokenizerChain.java:64)
                at org.apache.solr.schema.IndexSchema$SolrIndexAnalyzer.tokenStream(IndexSchema.java:360)
                at org.apache.lucene.analysis.Analyzer.reusableTokenStream(Analyzer.java:44)
                at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:123)
                at org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36)
                at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234)
                at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:762)
                at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:745)
                at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2199)
                at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2171)
                at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:218)
                at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)
                at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:140)
                at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
                at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
                at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
                at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333)
                at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
                at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
                at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        
        Show
        Kumar Raja added a comment - I applied the patch with the latest Solr code and lucene-rc2 jars and tried indexing the some chinese text. However, i got a AbstractMethodError during tokenization. What am i doing wrong here? THE STACK TRACE SEVERE: java.lang.AbstractMethodError at org.apache.solr.analysis.TokenizerChain.tokenStream(TokenizerChain.java:64) at org.apache.solr.schema.IndexSchema$SolrIndexAnalyzer.tokenStream(IndexSchema.java:360) at org.apache.lucene.analysis.Analyzer.reusableTokenStream(Analyzer.java:44) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:123) at org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36) at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:762) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:745) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2199) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2171) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:218) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:140) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        Hide
        Robert Muir added a comment -

        add warning about large dictionaries, note that stopwords are being loaded from jar file, and add an international.xml with examples for several languages.

        Show
        Robert Muir added a comment - add warning about large dictionaries, note that stopwords are being loaded from jar file, and add an international.xml with examples for several languages.
        Hide
        Robert Muir added a comment -

        Are the stopwords (words="org/apache/lucene/analysis/cn/stopwords.txt") being loaded directly from the jar? If so, a comment to that effect might prevent some confusion.

        Yes, good idea.

        Do you happen to know what the memory footprint of this analyzer is if it's used? I assume the dictionaries will get loaded on the first use.

        No, I am not sure of the footprint, but it is probably quite large (a few MB). They will be loaded on first use, correct. Also, the smartcn jar file itself is large due to the dictionaries in question. So, you may have noticed solr.war is much smaller after the last lucene update, since it was factored out of analyzers.jar.

        Might be cool to add a chinese field to example/exampledocs/solr.xml... or maybe there should be an international.xml doc where we could add a few different languages?

        I figured this wasn't the best place to have an example... i like the idea of international.xml, with some examples for other languages too.

        If there is some concern about the size of this (monster) analyzer, one option is to put these factories/examples elsewhere, to keep the size of solr smaller.

        Show
        Robert Muir added a comment - Are the stopwords (words="org/apache/lucene/analysis/cn/stopwords.txt") being loaded directly from the jar? If so, a comment to that effect might prevent some confusion. Yes, good idea. Do you happen to know what the memory footprint of this analyzer is if it's used? I assume the dictionaries will get loaded on the first use. No, I am not sure of the footprint, but it is probably quite large (a few MB). They will be loaded on first use, correct. Also, the smartcn jar file itself is large due to the dictionaries in question. So, you may have noticed solr.war is much smaller after the last lucene update, since it was factored out of analyzers.jar. Might be cool to add a chinese field to example/exampledocs/solr.xml... or maybe there should be an international.xml doc where we could add a few different languages? I figured this wasn't the best place to have an example... i like the idea of international.xml, with some examples for other languages too. If there is some concern about the size of this (monster) analyzer, one option is to put these factories/examples elsewhere, to keep the size of solr smaller.
        Hide
        Yonik Seeley added a comment -

        Thanks Robert!
        Are the stopwords (words="org/apache/lucene/analysis/cn/stopwords.txt") being loaded directly from the jar? If so, a comment to that effect might prevent some confusion.

        Do you happen to know what the memory footprint of this analyzer is if it's used? I assume the dictionaries will get loaded on the first use.

        Might be cool to add a chinese field to example/exampledocs/solr.xml... or maybe there should be an international.xml doc where we could add a few different languages?

        Show
        Yonik Seeley added a comment - Thanks Robert! Are the stopwords (words="org/apache/lucene/analysis/cn/stopwords.txt") being loaded directly from the jar? If so, a comment to that effect might prevent some confusion. Do you happen to know what the memory footprint of this analyzer is if it's used? I assume the dictionaries will get loaded on the first use. Might be cool to add a chinese field to example/exampledocs/solr.xml... or maybe there should be an international.xml doc where we could add a few different languages?
        Hide
        Robert Muir added a comment -

        patch, needs lucene-smartcn-2.9-dev.jar added to lib to work (this analyzer is not in the -analyzers.jar anymore)

        Show
        Robert Muir added a comment - patch, needs lucene-smartcn-2.9-dev.jar added to lib to work (this analyzer is not in the -analyzers.jar anymore)

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            2 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development