Lucene - Core
  1. Lucene - Core
  2. LUCENE-5437

ASCIIFoldingFilter that emits both unfolded and folded tokens

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.7, 6.0
    • Fix Version/s: 4.7, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I've found myself wanting an ASCIIFoldingFilter that emits both the folded tokens and the original, unfolded tokens.

        Activity

        Hide
        Nik Everett added a comment -

        Sorry for moving so much code.

        Show
        Nik Everett added a comment - Sorry for moving so much code.
        Hide
        Simon Willnauer added a comment -

        nik, can't we just use the keyword attribute and support it in ASCIIFoldingFilter? this should be an easy one no?

        Show
        Simon Willnauer added a comment - nik, can't we just use the keyword attribute and support it in ASCIIFoldingFilter? this should be an easy one no?
        Hide
        Nik Everett added a comment -

        I thought about that but my instinct was that duplicating with the keyword attribute would add overhead in the case where there aren't characters to fold which is by far the more common case. I think you'd also have to make supporting the keyword attribute optional so it wouldn't break backwards compatibility. I figured optionally supporting the keyword attribute would be about the same amount of work/code as only duplicating when required so I went that way. I went with adding the extra class and moving the real implementation to an absract base class more out of desire to be minimally invasive to the original then anything technical.

        Show
        Nik Everett added a comment - I thought about that but my instinct was that duplicating with the keyword attribute would add overhead in the case where there aren't characters to fold which is by far the more common case. I think you'd also have to make supporting the keyword attribute optional so it wouldn't break backwards compatibility. I figured optionally supporting the keyword attribute would be about the same amount of work/code as only duplicating when required so I went that way. I went with adding the extra class and moving the real implementation to an absract base class more out of desire to be minimally invasive to the original then anything technical.
        Hide
        Simon Willnauer added a comment -

        well the keyword attribute is supported all over the place for that purpose. For BW compat we have the version attribute and to remove duplicates there is RemoveDuplicatesTokenFilter. We should really think about if we want to add an abstract class here. I don't think so to be honest. Another way is to make it an option or to use a delegator and store the state in the delegate TokenFilter?

        Show
        Simon Willnauer added a comment - well the keyword attribute is supported all over the place for that purpose. For BW compat we have the version attribute and to remove duplicates there is RemoveDuplicatesTokenFilter. We should really think about if we want to add an abstract class here. I don't think so to be honest. Another way is to make it an option or to use a delegator and store the state in the delegate TokenFilter?
        Hide
        Nik Everett added a comment -

        I suppose I'm just used to abstract classes but you are right, the delegate would work better here. I'll make that change. Before I do, though, does my argument (more instinct, really) about only cloning the token if there is anything to fold make sense? If not I'll just add support for the keyword attribute with a version check.

        Show
        Nik Everett added a comment - I suppose I'm just used to abstract classes but you are right, the delegate would work better here. I'll make that change. Before I do, though, does my argument (more instinct, really) about only cloning the token if there is anything to fold make sense? If not I'll just add support for the keyword attribute with a version check.
        Hide
        Simon Willnauer added a comment -

        well I mean actually doing the delegate in user land aka in your code. I could see this being useful but why makeing such an effort, can't you just add a boolean to the ctor and implement it in the filter?

        Show
        Simon Willnauer added a comment - well I mean actually doing the delegate in user land aka in your code. I could see this being useful but why makeing such an effort, can't you just add a boolean to the ctor and implement it in the filter?
        Hide
        Nik Everett added a comment -

        Patch that uses a simple boolean rather than crazy subclassing.

        Show
        Nik Everett added a comment - Patch that uses a simple boolean rather than crazy subclassing.
        Hide
        Nik Everett added a comment -

        Minor improvement in the names of things in the tests.

        Show
        Nik Everett added a comment - Minor improvement in the names of things in the tests.
        Hide
        Simon Willnauer added a comment -

        I like this patch much better. I wonder if we can come up with a better naming for the variables / options. In PatternCaptureGroupTokenFilter we use preserveOriginal which I kind of like better. For the tests I wonder if you can make assertNextTerms accept a ASCIIFoldingFilter instead of a TokenFilter and add a getter to it so we can randomly set if the orig should be preserved? That would also prevent the subclass?

        Show
        Simon Willnauer added a comment - I like this patch much better. I wonder if we can come up with a better naming for the variables / options. In PatternCaptureGroupTokenFilter we use preserveOriginal which I kind of like better. For the tests I wonder if you can make assertNextTerms accept a ASCIIFoldingFilter instead of a TokenFilter and add a getter to it so we can randomly set if the orig should be preserved? That would also prevent the subclass?
        Hide
        Nik Everett added a comment -

        Uploading new diff with changes Simon asked for.

        Show
        Nik Everett added a comment - Uploading new diff with changes Simon asked for.
        Hide
        Simon Willnauer added a comment -

        this looks good to me - I plan to commit this soon

        Show
        Simon Willnauer added a comment - this looks good to me - I plan to commit this soon
        Hide
        ASF subversion and git services added a comment -

        Commit 1567595 from Simon Willnauer in branch 'dev/trunk'
        [ https://svn.apache.org/r1567595 ]

        LUCENE-5437: ASCIIFoldingFilter now has an option to preserve the original token

        Show
        ASF subversion and git services added a comment - Commit 1567595 from Simon Willnauer in branch 'dev/trunk' [ https://svn.apache.org/r1567595 ] LUCENE-5437 : ASCIIFoldingFilter now has an option to preserve the original token
        Hide
        ASF subversion and git services added a comment -

        Commit 1567597 from Simon Willnauer in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1567597 ]

        LUCENE-5437: ASCIIFoldingFilter now has an option to preserve the original token

        Show
        ASF subversion and git services added a comment - Commit 1567597 from Simon Willnauer in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1567597 ] LUCENE-5437 : ASCIIFoldingFilter now has an option to preserve the original token
        Hide
        Simon Willnauer added a comment -

        thanks Nik

        Show
        Simon Willnauer added a comment - thanks Nik
        Hide
        Ahmet Arslan added a comment -

        ASCIIFoldingFilter's preserveOriginal=true breaks wildcard queries, since it is a MultiTermAwareComponent.

          <fieldType name="text_ascii_preserve" class="solr.TextField" positionIncrementGap="100">
              <analyzer>
                <tokenizer class="solr.StandardTokenizerFactory"/>            
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
              </analyzer>      
            </fieldType>
        

        With the above type, q=manu:Belkı* yields:

         "error": {
            "msg": "analyzer returned too many terms for multiTerm term: Belkı",
            "code": 400
          } 
        

        I think preserveOriginal is dangerous for token filters that implement MultiTermAwareComponent.
        What is the preferred action here? Document this limitation/behavior? Or consider this as a bug and open a jira?

        Show
        Ahmet Arslan added a comment - ASCIIFoldingFilter's preserveOriginal=true breaks wildcard queries, since it is a MultiTermAwareComponent. <fieldType name= "text_ascii_preserve" class= "solr.TextField" positionIncrementGap= "100" > <analyzer> <tokenizer class= "solr.StandardTokenizerFactory" /> <filter class= "solr.LowerCaseFilterFactory" /> <filter class= "solr.ASCIIFoldingFilterFactory" preserveOriginal= "true" /> </analyzer> </fieldType> With the above type, q=manu:Belkı* yields: "error" : { "msg" : "analyzer returned too many terms for multiTerm term: Belkı" , "code" : 400 } I think preserveOriginal is dangerous for token filters that implement MultiTermAwareComponent. What is the preferred action here? Document this limitation/behavior? Or consider this as a bug and open a jira?
        Hide
        Ahmet Arslan added a comment -

        How about this change in ASCIIFoldingFilterFactory?. Aways return preserveOriginal=false in getMultiTermComponent so we don't get exceptions.

        @Override
          public AbstractAnalysisFactory getMultiTermComponent() {
            if (preserveOriginal) {
              Map<String, String> args = new HashMap<>(1);
              args.put("preserveOriginal", "false");
              return new ASCIIFoldingFilterFactory(args);
            } else
              return this;
          }
        
        Show
        Ahmet Arslan added a comment - How about this change in ASCIIFoldingFilterFactory ?. Aways return preserveOriginal=false in getMultiTermComponent so we don't get exceptions. @Override public AbstractAnalysisFactory getMultiTermComponent() { if (preserveOriginal) { Map< String , String > args = new HashMap<>(1); args.put( "preserveOriginal" , " false " ); return new ASCIIFoldingFilterFactory(args); } else return this ; }

          People

          • Assignee:
            Simon Willnauer
            Reporter:
            Nik Everett
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development