Solr
  1. Solr
  2. SOLR-319

changes SynonymFilterFactory to "Analyze" synonyms file

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3
    • Component/s: None
    • Labels:
      None

      Description

      WHAT:
      Currently, SynonymFilterFactory works very well with N-gram tokenizer (CJKTokenizer, for example).
      But we have to take care of the statement in synonyms.txt.
      For example, if I use CJKTokenizer (work as bi-gram for CJK chars) and want C1C2C3 maps to C4C5C6,
      I have to write the rule as follows:

      C1C2 C2C3 => C4C5 C5C6

      But I want to write it "C1C2C3=>C4C5C6". This patch allows it. It is also helpful for sharing synonyms.txt.

      HOW:
      tokenFactory attribute is added to <filter class="solr.SynonymFilterFactory"/>.
      If the attribute is specified, SynonymFilterFactory uses the TokenizerFactory to create Tokenizer.
      Then SynonymFilterFactory uses the Tokenizer to get tokens from the rules in synonyms.txt file.

      sample-1: CJKTokenizer

      <fieldtype name="text_cjk" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
      <tokenizer class="solr.CJKTokenizerFactory"/>
      <filter class="solr.SynonymFilterFactory" synonyms="ngram_synonym_test_ja.txt"
      ignoreCase="true" expand="true" tokenFactory="solr.CJKTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
      <tokenizer class="solr.CJKTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      </fieldtype>

      sample-2: NGramTokenizer

      <fieldtype name="text_ngram" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
      <tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="2"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
      <tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="2"/>
      <filter class="solr.SynonymFilterFactory" synonyms="ngram_synonym_test_ngram.txt"
      ignoreCase="true" expand="true"
      tokenFactory="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="2"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      </fieldtype>

      backward compatibility:
      Yes. If you omit tokenFactory attribute from <filter class="solr.SynonymFilterFactory"/> tag, it works as usual.

      1. SOLR-319.patch
        15 kB
        Koji Sekiguchi
      2. SOLR-319.patch
        15 kB
        Koji Sekiguchi
      3. SOLR-319.patch
        17 kB
        Koji Sekiguchi

        Activity

        Hide
        Otis Gospodnetic added a comment -

        Btw., I noticed this functionality is really pretty well hidden! It may be good to at least add it to:

        Thoughts?

        Show
        Otis Gospodnetic added a comment - Btw., I noticed this functionality is really pretty well hidden! It may be good to at least add it to: one of the example schema.xml files (commented out) http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory Thoughts?
        Hide
        Koji Sekiguchi added a comment -

        All test pass.
        Committed revision 657829.

        Show
        Koji Sekiguchi added a comment - All test pass. Committed revision 657829.
        Hide
        Koji Sekiguchi added a comment -

        Thanks, Otis. I will commit this in a week if there is no objection.
        BTW, I cannot assign myself on JIRA... looks like I have no permission?

        Show
        Koji Sekiguchi added a comment - Thanks, Otis. I will commit this in a week if there is no objection. BTW, I cannot assign myself on JIRA... looks like I have no permission?
        Hide
        Otis Gospodnetic added a comment - - edited

        I think this patch is ripe for a commit. Koji, want to commit your own baby?

        Show
        Otis Gospodnetic added a comment - - edited I think this patch is ripe for a commit. Koji, want to commit your own baby?
        Hide
        Yonik Seeley added a comment - - edited

        Perhaps it should be a full analyzer rather than just a tokenizer?
        In the query elevation (query boosting) component, a field type can be specified, and the analyzer from that is used.

        Oh, I just saw that Hoss already brought up that idea long ago...

        Show
        Yonik Seeley added a comment - - edited Perhaps it should be a full analyzer rather than just a tokenizer? In the query elevation (query boosting) component, a field type can be specified, and the analyzer from that is used. Oh, I just saw that Hoss already brought up that idea long ago...
        Hide
        Koji Sekiguchi added a comment -

        updated for current trunk (SOLR-466).

        Show
        Koji Sekiguchi added a comment - updated for current trunk ( SOLR-466 ).
        Hide
        Koji Sekiguchi added a comment -

        Thank you for your comment, Hoss. The latest attached patch uses tokenizer factory to get tokenizer.
        There seems to be no objections. Any chance to get this into svn soon? A few users in Japan uses this feature (Solr 1.2, though) and works perfectly. We would like to share the feature with CJK users, hopefully out of the box.

        Show
        Koji Sekiguchi added a comment - Thank you for your comment, Hoss. The latest attached patch uses tokenizer factory to get tokenizer. There seems to be no objections. Any chance to get this into svn soon? A few users in Japan uses this feature (Solr 1.2, though) and works perfectly. We would like to share the feature with CJK users, hopefully out of the box.
        Hide
        Hoss Man added a comment -

        FWIW: after rereading my earlier comments, i think the best thing to do (for now at least) is to go with the simplest approach that achieves the goal: do what was done in the orriginal path, and just refer to the tokenizer factory class directly (which can be instantiated using the ResourceLoader) instead of refering to a fieldType name like i suggested.

        (see also SOLR-414)

        Show
        Hoss Man added a comment - FWIW: after rereading my earlier comments, i think the best thing to do (for now at least) is to go with the simplest approach that achieves the goal: do what was done in the orriginal path, and just refer to the tokenizer factory class directly (which can be instantiated using the ResourceLoader) instead of refering to a fieldType name like i suggested. (see also SOLR-414 )
        Hide
        Koji Sekiguchi added a comment -

        updated for current trunk (r597847).

        Show
        Koji Sekiguchi added a comment - updated for current trunk (r597847).
        Hide
        Koji Sekiguchi added a comment -

        Hoss, Yonik, thank you very much for your opinion.

        As for now, SynonymFilterFactory implicitly uses WhitespaceTokenizer when analyzing synonyms.txt file.
        This works well for English and European languages, those use spaces to separate words.
        But from standpoint of CJK users, we would like to replace the implicit tokenizer by an arbitrary tokenizer.
        I thought that fieldtype could be specified to analyze synonyms.txt was a cool idea,
        but it was difficult because IndexSchema hasn't been initialized at that time.

        For CJK users, replacing tokenizer is enough for their purpose and fieldtype is overmuch...

        Show
        Koji Sekiguchi added a comment - Hoss, Yonik, thank you very much for your opinion. As for now, SynonymFilterFactory implicitly uses WhitespaceTokenizer when analyzing synonyms.txt file. This works well for English and European languages, those use spaces to separate words. But from standpoint of CJK users, we would like to replace the implicit tokenizer by an arbitrary tokenizer. I thought that fieldtype could be specified to analyze synonyms.txt was a cool idea, but it was difficult because IndexSchema hasn't been initialized at that time. For CJK users, replacing tokenizer is enough for their purpose and fieldtype is overmuch...
        Hide
        Hoss Man added a comment -

        > I was trying to come up with realistic examples, but the only useful ones I could think of involve tokenization...

        maybe, but with things like the PatternReplaceTokenFilter "tokenization" sometimes happens after the Tokenizer is done.

        Show
        Hoss Man added a comment - > I was trying to come up with realistic examples, but the only useful ones I could think of involve tokenization... maybe, but with things like the PatternReplaceTokenFilter "tokenization" sometimes happens after the Tokenizer is done.
        Hide
        Yonik Seeley added a comment -

        > there may be other things happening in your analysis chain (besides just tokenization)

        I was trying to come up with realistic examples, but the only useful ones I could think of involve tokenization...
        Example: if you have a.b => c.d and switch from a whitespace tokenizer to a letter tokenizer.

        Show
        Yonik Seeley added a comment - > there may be other things happening in your analysis chain (besides just tokenization) I was trying to come up with realistic examples, but the only useful ones I could think of involve tokenization... Example: if you have a.b => c.d and switch from a whitespace tokenizer to a letter tokenizer.
        Hide
        Hoss Man added a comment -

        > You can specify ignoreCase="true"

        doh! ... right, bad example ... but the spirit of my point is still true: there may be other things happening in your analysis chain (besides just tokenization) that you'd like to have happen to your synonyms as well, so that you:
        1) can reuse the synonyms file in multiple field types
        2) don't need to change your synonyms file just because you changed your analysis configuration.

        Show
        Hoss Man added a comment - > You can specify ignoreCase="true" doh! ... right, bad example ... but the spirit of my point is still true: there may be other things happening in your analysis chain (besides just tokenization) that you'd like to have happen to your synonyms as well, so that you: 1) can reuse the synonyms file in multiple field types 2) don't need to change your synonyms file just because you changed your analysis configuration.
        Hide
        Yonik Seeley added a comment -

        > ie: if LowercaseFilterFactory comes before SynonymFilterFactory, then all synonyms must be lowercased in your file.

        You can specify ignoreCase="true"

        Show
        Yonik Seeley added a comment - > ie: if LowercaseFilterFactory comes before SynonymFilterFactory, then all synonyms must be lowercased in your file. You can specify ignoreCase="true"
        Hide
        Hoss Man added a comment -

        I haven't thought it out all hte way, but it should be possible. we only have to remember the name of the fieldtype in SynonymFilterFactory.init ... then in the create method we can call schema.getFieldTypes().get(fieldtypename).

        Hmmm... except we probably don't have any access to the schema at that point do we?

        Hmmm.... i'm not sure what the best way to do this would be. we could just go get the schema from the SolrCore – except we're moving away from it being a singleton and we dn't have direct access to it either.

        anyone have any other suggestions?

        Show
        Hoss Man added a comment - I haven't thought it out all hte way, but it should be possible. we only have to remember the name of the fieldtype in SynonymFilterFactory.init ... then in the create method we can call schema.getFieldTypes().get(fieldtypename). Hmmm... except we probably don't have any access to the schema at that point do we? Hmmm.... i'm not sure what the best way to do this would be. we could just go get the schema from the SolrCore – except we're moving away from it being a singleton and we dn't have direct access to it either. anyone have any other suggestions?
        Hide
        Koji Sekiguchi added a comment -

        I think I cannot implement "fieldtype version" of this issue, because when Solr is initializing SynonymFilterFactory, Solr is in IndexSchema initialization step. Am I wrong?

        Show
        Koji Sekiguchi added a comment - I think I cannot implement "fieldtype version" of this issue, because when Solr is initializing SynonymFilterFactory, Solr is in IndexSchema initialization step. Am I wrong?
        Hide
        Koji Sekiguchi added a comment -

        Absolutely. I'll try to change my patch to implement the fieldtype idea. Thank you.

        Show
        Koji Sekiguchi added a comment - Absolutely. I'll try to change my patch to implement the fieldtype idea. Thank you.
        Hide
        Hoss Man added a comment -

        I've revised the summary line of this bug because it was a little confusing to me ... the issue isn't really specific to n-gram based tokenizers, as you point out this is a general issue that currently when constructing the synonyms file you have to be very aware of the analysis chain of your fieldtype – ie: if LowercaseFilterFactory comes before SynonymFilterFactory, then all synonyms must be lowercased in your file.

        The notion of specifying a TokenizerFactory as a property of the SynonymFilterFactory that tells it how to parse the synonymstxt file is pretyt clever, and would solve the CJKTokenizer problem you describe, but i don't think it really goes far enough – consider the lowercase example. it would be good if you could have a synonyms file that contained proper names, and have it do the right thing when used in lower cased fields as well as exact case fields.

        to extend the tokenizer idea – what if you could specify the name of a fieldtype, and the entire Analyzer for that fieldtype would be used to parse the individual synonym records? this should simplify the patch a bit (since you don't have to worry about initializing any factories, the schema will take care of it for you) and make it a lot more powerful.

        Show
        Hoss Man added a comment - I've revised the summary line of this bug because it was a little confusing to me ... the issue isn't really specific to n-gram based tokenizers, as you point out this is a general issue that currently when constructing the synonyms file you have to be very aware of the analysis chain of your fieldtype – ie: if LowercaseFilterFactory comes before SynonymFilterFactory, then all synonyms must be lowercased in your file. The notion of specifying a TokenizerFactory as a property of the SynonymFilterFactory that tells it how to parse the synonymstxt file is pretyt clever, and would solve the CJKTokenizer problem you describe, but i don't think it really goes far enough – consider the lowercase example. it would be good if you could have a synonyms file that contained proper names, and have it do the right thing when used in lower cased fields as well as exact case fields. to extend the tokenizer idea – what if you could specify the name of a fieldtype, and the entire Analyzer for that fieldtype would be used to parse the individual synonym records? this should simplify the patch a bit (since you don't have to worry about initializing any factories, the schema will take care of it for you) and make it a lot more powerful.
        Hide
        Koji Sekiguchi added a comment -

        1. updated for current trunk (r575145)
        2. eliminate testCJKTokenizer() test method, which includes Japanese chars to test CJK bi-gram

        comments are welcome.

        Show
        Koji Sekiguchi added a comment - 1. updated for current trunk (r575145) 2. eliminate testCJKTokenizer() test method, which includes Japanese chars to test CJK bi-gram comments are welcome.
        Hide
        Koji Sekiguchi added a comment -

        In addition, this is useful for non-N-gram tokenizers for CJK users. For example, we use SenTokenizer, which is a popular morphological analyzer in Japan. It uses a Japanese dictionary to determine morpheme boundaries.

        If I have the following definition in schema.xml:

        <tokenizer class="solr.SenTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

        and I want a map rule "C1C2C3=>C4C5". I'm sure "C1C2C3" is a word, so I write the rule in synonyms.txt as follows:

        C1C2C3=>C4C5

        however, if there isn't "C1C2C3" in SenTokenizer's dictionary but "C1C2" and "C3" are there, SenTokenizer will output "C1C2" and "C3". In this case, the above rule doesn't work.

        The patch solves this problem, in addition, it encourages sharing synonyms.txt file between N-gram and morphological tokenizer.

        Show
        Koji Sekiguchi added a comment - In addition, this is useful for non-N-gram tokenizers for CJK users. For example, we use SenTokenizer, which is a popular morphological analyzer in Japan. It uses a Japanese dictionary to determine morpheme boundaries. If I have the following definition in schema.xml: <tokenizer class="solr.SenTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> and I want a map rule "C1C2C3=>C4C5". I'm sure "C1C2C3" is a word, so I write the rule in synonyms.txt as follows: C1C2C3=>C4C5 however, if there isn't "C1C2C3" in SenTokenizer's dictionary but "C1C2" and "C3" are there, SenTokenizer will output "C1C2" and "C3". In this case, the above rule doesn't work. The patch solves this problem, in addition, it encourages sharing synonyms.txt file between N-gram and morphological tokenizer.
        Hide
        Koji Sekiguchi added a comment -

        The patch includes TestSynonymMap. To test SynonymMap, I removed "private" declaration from parseRules() method.
        This patch includes CJKTokenizerFactory, too.

        Show
        Koji Sekiguchi added a comment - The patch includes TestSynonymMap. To test SynonymMap, I removed "private" declaration from parseRules() method. This patch includes CJKTokenizerFactory, too.

          People

          • Assignee:
            Koji Sekiguchi
            Reporter:
            Koji Sekiguchi
          • Votes:
            2 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development