Solr
  1. Solr
  2. SOLR-813

Add new DoubleMetaphone Filter and Factory

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 1.4
    • Component/s: search
    • Labels:
      None

      Description

      The existing PhoneticFilter allows for use of the DoubleMetaphone encoder. However, it doesn't expose the maxCodeLength() setting, and it ignores the alternate encodings that the encoder provides for some words. This new filter is not as generic as the PhoneticFilter, but allows more detailed control over the DoubleMetaphone encoder.

      1. SOLR-813.patch
        14 kB
        Ryan McKinley
      2. SOLR-813.patch
        14 kB
        Ryan McKinley
      3. SOLR-813.patch
        9 kB
        Todd Feak

        Activity

        Hide
        Todd Feak added a comment -

        Added patch containing Filter, Factor, and Unit Tests for both.

        Show
        Todd Feak added a comment - Added patch containing Filter, Factor, and Unit Tests for both.
        Hide
        Ryan McKinley added a comment -

        rather then create a new Filter for DoubleMetaphone, why not just extend PhoneticFilter to support maxCodeLength?

        Here is a quick untested bit that uses reflection to set the maxCodeLength – the advantage is that it would also work for 'Metaphone' (though i'm not sure anyone uses that).

        Since the reflection only happens once at starup, it is not a big deal.

        Index: src/java/org/apache/solr/analysis/PhoneticFilterFactory.java
        ===================================================================
        --- src/java/org/apache/solr/analysis/PhoneticFilterFactory.java        (revision 704289)
        +++ src/java/org/apache/solr/analysis/PhoneticFilterFactory.java        (working copy)
        @@ -17,10 +17,10 @@
         
         package org.apache.solr.analysis;
         
        +import java.lang.reflect.Method;
         import java.util.HashMap;
         import java.util.Map;
         
        -import org.apache.solr.core.SolrConfig;
         import org.apache.commons.codec.Encoder;
         import org.apache.commons.codec.language.DoubleMetaphone;
         import org.apache.commons.codec.language.Metaphone;
        @@ -80,6 +80,13 @@
             
             try {
               encoder = clazz.newInstance();
        +      
        +      // Try to set the maxCodeLength
        +      String v = args.get( "maxCodeLength" );
        +      if( v != null ) {
        +        Method setter = encoder.getClass().getMethod( "setMaxCodeLength", Integer.class );
        +        setter.invoke( encoder, Integer.parseInt( v ) );
        +      }
             } 
             catch (Exception e) {
               throw new SolrException( SolrException.ErrorCode.SERVER_ERROR, "Error initializing: "+name + "/"+clazz, e );
        
        
        Show
        Ryan McKinley added a comment - rather then create a new Filter for DoubleMetaphone, why not just extend PhoneticFilter to support maxCodeLength? Here is a quick untested bit that uses reflection to set the maxCodeLength – the advantage is that it would also work for 'Metaphone' (though i'm not sure anyone uses that). Since the reflection only happens once at starup, it is not a big deal. Index: src/java/org/apache/solr/analysis/PhoneticFilterFactory.java =================================================================== --- src/java/org/apache/solr/analysis/PhoneticFilterFactory.java (revision 704289) +++ src/java/org/apache/solr/analysis/PhoneticFilterFactory.java (working copy) @@ -17,10 +17,10 @@ package org.apache.solr.analysis; + import java.lang.reflect.Method; import java.util.HashMap; import java.util.Map; - import org.apache.solr.core.SolrConfig; import org.apache.commons.codec.Encoder; import org.apache.commons.codec.language.DoubleMetaphone; import org.apache.commons.codec.language.Metaphone; @@ -80,6 +80,13 @@ try { encoder = clazz.newInstance(); + + // Try to set the maxCodeLength + String v = args.get( "maxCodeLength" ); + if ( v != null ) { + Method setter = encoder.getClass().getMethod( "setMaxCodeLength" , Integer .class ); + setter.invoke( encoder, Integer .parseInt( v ) ); + } } catch (Exception e) { throw new SolrException( SolrException.ErrorCode.SERVER_ERROR, "Error initializing: " +name + "/" +clazz, e );
        Hide
        Todd Feak added a comment -

        Other DoubleMetaphone specific calls are made in the filter. Specifically, it checks to see if there is an alternate encoding for the token, and adds that to the stream as well, if it's different then the default encoding. This is part of the strength of the DoubleMetaphone implementation that the Phonetic filter doesn't leverage.

        An additional change that I made that could be done in the Phonetic filter is blocking empty tokens that are generated on non-alpha strings. This just wraps them all.

        Show
        Todd Feak added a comment - Other DoubleMetaphone specific calls are made in the filter. Specifically, it checks to see if there is an alternate encoding for the token, and adds that to the stream as well, if it's different then the default encoding. This is part of the strength of the DoubleMetaphone implementation that the Phonetic filter doesn't leverage. An additional change that I made that could be done in the Phonetic filter is blocking empty tokens that are generated on non-alpha strings. This just wraps them all.
        Hide
        Ryan McKinley added a comment -

        Here is an update that adresses two concerns:
        1. position increments – this keeps the tokens in sync with the input
        2. previous version would stop processing after a number. That is: "aaa 1234 bbb" would not process "bbb"
        3. Token types... this changes it to "DoubleMetaphone" rather then "ALPHANUM"

        here is the key part:

              boolean isPhonetic = false;
              String v = new String(t.termBuffer(), 0, t.termLength());
              String primaryPhoneticValue = encoder.doubleMetaphone(v);
              if (primaryPhoneticValue.length() > 0) {
                Token token = (Token) t.clone();
                if( inject ) {
                  token.setPositionIncrement( 0 );
                }
                token.setType( TOKEN_TYPE );
                token.setTermBuffer(primaryPhoneticValue);
                remainingTokens.addLast(token);
                isPhonetic = true;
              }
        
              String alternatePhoneticValue = encoder.doubleMetaphone(v, true);
              if (alternatePhoneticValue.length() > 0
                  && !primaryPhoneticValue.equals(alternatePhoneticValue)) {
                Token token = (Token) t.clone();
                token.setPositionIncrement( 0 );
                token.setType( TOKEN_TYPE );
                token.setTermBuffer(alternatePhoneticValue);
                remainingTokens.addLast(token);
                isPhonetic = true;
              }
              
              // If we did not add something, then go to the next one...
              if( !isPhonetic ) {
                t = next(in);
                t.setPositionIncrement( t.getPositionIncrement()+1 ); 
                return t;
              }
        
        Show
        Ryan McKinley added a comment - Here is an update that adresses two concerns: 1. position increments – this keeps the tokens in sync with the input 2. previous version would stop processing after a number. That is: "aaa 1234 bbb" would not process "bbb" 3. Token types... this changes it to "DoubleMetaphone" rather then "ALPHANUM" here is the key part: boolean isPhonetic = false ; String v = new String (t.termBuffer(), 0, t.termLength()); String primaryPhoneticValue = encoder.doubleMetaphone(v); if (primaryPhoneticValue.length() > 0) { Token token = (Token) t.clone(); if ( inject ) { token.setPositionIncrement( 0 ); } token.setType( TOKEN_TYPE ); token.setTermBuffer(primaryPhoneticValue); remainingTokens.addLast(token); isPhonetic = true ; } String alternatePhoneticValue = encoder.doubleMetaphone(v, true ); if (alternatePhoneticValue.length() > 0 && !primaryPhoneticValue.equals(alternatePhoneticValue)) { Token token = (Token) t.clone(); token.setPositionIncrement( 0 ); token.setType( TOKEN_TYPE ); token.setTermBuffer(alternatePhoneticValue); remainingTokens.addLast(token); isPhonetic = true ; } // If we did not add something, then go to the next one... if ( !isPhonetic ) { t = next(in); t.setPositionIncrement( t.getPositionIncrement()+1 ); return t; }
        Hide
        Ryan McKinley added a comment -

        oops, last patch had a bug if the stream ended in a non-phonetic value:

              if( !isPhonetic ) {
                t = next(in);
                if( t != null ) {
                  t.setPositionIncrement( t.getPositionIncrement()+1 ); 
                }
                return t;
              }
        
        Show
        Ryan McKinley added a comment - oops, last patch had a bug if the stream ended in a non-phonetic value: if ( !isPhonetic ) { t = next(in); if ( t != null ) { t.setPositionIncrement( t.getPositionIncrement()+1 ); } return t; }
        Hide
        Ryan McKinley added a comment -

        Added in rev: 705903
        Thanks Todd!

        Show
        Ryan McKinley added a comment - Added in rev: 705903 Thanks Todd!
        Hide
        Todd Feak added a comment -

        Good catch on that bug and enhancements. I put them in my current implementation. Thank you.

        Show
        Todd Feak added a comment - Good catch on that bug and enhancements. I put them in my current implementation. Thank you.
        Hide
        Grant Ingersoll added a comment -

        Bulk close for Solr 1.4

        Show
        Grant Ingersoll added a comment - Bulk close for Solr 1.4

          People

          • Assignee:
            Ryan McKinley
            Reporter:
            Todd Feak
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development