Lucene - Core
  1. Lucene - Core
  2. LUCENE-5559

Argument validation for TokenFilters having numeric constructor parameter(s)

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.7
    • Fix Version/s: 4.8, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Some TokenFilters have numeric arguments in their constructors. They should throw IllegalArgumentException for negative or meaningless values.

      Here is some examples that demonstrates invalid/meaningless arguments :

       <filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="-10" />
      
       <filter class="solr.LengthFilterFactory" min="-5" max="-1" />
      
       <filter class="solr.LimitTokenPositionFilterFactory" maxTokenPosition="-3" />
      
      1. LUCENE-5559.patch
        8 kB
        Ahmet Arslan
      2. LUCENE-5559.patch
        32 kB
        Ahmet Arslan
      3. LUCENE-5559.patch
        12 kB
        Ahmet Arslan
      4. LUCENE-5559.patch
        11 kB
        Ahmet Arslan
      5. LUCENE-5559.patch
        7 kB
        Ahmet Arslan

        Activity

        Hide
        Ahmet Arslan added a comment -

        NGramTokenFilte, EdgeNGramTokenFilter, NGramTokenizer andEdgeNGramTokenizer have this check

          
            if (minGram < 1) {
              throw new IllegalArgumentException("minGram must be greater than zero");
            }
            if (minGram > maxGram) {
              throw new IllegalArgumentException("minGram must not be greater than maxGram");
            }
        

        It is necessary to add checks to Factory classes too?

        Show
        Ahmet Arslan added a comment - NGramTokenFilte, EdgeNGramTokenFilter, NGramTokenizer andEdgeNGramTokenizer have this check if (minGram < 1) { throw new IllegalArgumentException( "minGram must be greater than zero" ); } if (minGram > maxGram) { throw new IllegalArgumentException( "minGram must not be greater than maxGram" ); } It is necessary to add checks to Factory classes too?
        Hide
        Robert Muir added a comment -

        I don't think the factories need anything if you will already hit the check from the Filter itself.

        But if the factory would crash in some way that is confusing on certain inputs, then it would need a check too.

        Show
        Robert Muir added a comment - I don't think the factories need anything if you will already hit the check from the Filter itself. But if the factory would crash in some way that is confusing on certain inputs, then it would need a check too.
        Hide
        Ahmet Arslan added a comment -

        initial patch for LimitTokenCountFilter, LengthFilter and LimitTokenPositionFilter. This patch adds new TestLimitTokenCountFilter class. TokenFilters are responsible for argument validation.

        Show
        Ahmet Arslan added a comment - initial patch for LimitTokenCountFilter, LengthFilter and LimitTokenPositionFilter. This patch adds new TestLimitTokenCountFilter class. TokenFilters are responsible for argument validation.
        Hide
        Ahmet Arslan added a comment -

        ShingleFilter is OK in that sense. Only difference is that validity check repeated in ShingleFilterFactory too.

        Show
        Ahmet Arslan added a comment - ShingleFilter is OK in that sense. Only difference is that validity check repeated in ShingleFilterFactory too.
        Hide
        Ahmet Arslan added a comment -

        UAX29URLEmailTokenizer, StandardTokenizer and ClassicTokenizer added.

        Show
        Ahmet Arslan added a comment - UAX29URLEmailTokenizer, StandardTokenizer and ClassicTokenizer added.
        Hide
        Ahmet Arslan added a comment -

        DictionaryCompoundWordTokenFilter, HyphenationCompoundWordTokenFilter, PathHierarchyTokenizer and ReversePathHierarchyTokenizer are OK.

        TokenRangeSinkFilter has (int lower, int upper) but I am not sure their valid values.

        Show
        Ahmet Arslan added a comment - DictionaryCompoundWordTokenFilter, HyphenationCompoundWordTokenFilter, PathHierarchyTokenizer and ReversePathHierarchyTokenizer are OK. TokenRangeSinkFilter has (int lower, int upper) but I am not sure their valid values.
        Hide
        Ahmet Arslan added a comment -

        TokenRangeSinkFilter aded

        Show
        Ahmet Arslan added a comment - TokenRangeSinkFilter aded
        Hide
        Ahmet Arslan added a comment -

        In SinkTokenTest files, license header starts with this line

        Copyright 2004 The Apache Software Foundation 
        Show
        Ahmet Arslan added a comment - In SinkTokenTest files, license header starts with this line Copyright 2004 The Apache Software Foundation
        Hide
        Ahmet Arslan added a comment -

        TokenRangeSinkFilter aded

        Show
        Ahmet Arslan added a comment - TokenRangeSinkFilter aded
        Hide
        Uwe Schindler added a comment - - edited

        Hi Achmet,

        +  public void test() throws Exception {
        +    MockTokenizer tokenizer = whitespaceMockTokenizer("A1 B2 C3 D4 E5 F6");
        +    // LimitTokenCountFilter doesn't consume the entire stream that it wraps
        +    tokenizer.setEnableChecks(false);
        +    TokenStream stream = new LimitTokenCountFilter(tokenizer, 3);
        +    assertTokenStreamContents(stream, new String[]{"A1", "B2", "C3"});
        +  }
        

        LimitTokenCount (and others like LimitPosition*) filter has the option to consume all tokens. Maybe better check this configuration, so the tokenizer can have checks enabled.

        Show
        Uwe Schindler added a comment - - edited Hi Achmet, + public void test() throws Exception { + MockTokenizer tokenizer = whitespaceMockTokenizer("A1 B2 C3 D4 E5 F6"); + // LimitTokenCountFilter doesn't consume the entire stream that it wraps + tokenizer.setEnableChecks(false); + TokenStream stream = new LimitTokenCountFilter(tokenizer, 3); + assertTokenStreamContents(stream, new String[]{"A1", "B2", "C3"}); + } LimitTokenCount (and others like LimitPosition*) filter has the option to consume all tokens. Maybe better check this configuration, so the tokenizer can have checks enabled.
        Hide
        Ahmet Arslan added a comment -

        Hi Uwe,

        I copied/applied the following 'for loop logic' from TestLimitTokenCountAnalyzer to TestLimitToken(Position|Count)Filter(Factory)

         for (boolean consumeAll : new boolean[] { true, false }) {      
              MockAnalyzer mock = new MockAnalyzer(random());
              // if we are consuming all tokens, we can use the checks, otherwise we can't
              mock.setEnableChecks(consumeAll);      
              new LimitTokenCountAnalyzer(mock, 2, consumeAll);      
        }
        
        Show
        Ahmet Arslan added a comment - Hi Uwe, I copied/applied the following 'for loop logic' from TestLimitTokenCountAnalyzer to TestLimitToken(Position|Count)Filter(Factory) for ( boolean consumeAll : new boolean [] { true , false }) { MockAnalyzer mock = new MockAnalyzer(random()); // if we are consuming all tokens, we can use the checks, otherwise we can't mock.setEnableChecks(consumeAll); new LimitTokenCountAnalyzer(mock, 2, consumeAll); }
        Hide
        ASF subversion and git services added a comment -

        Commit 1583530 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1583530 ]

        LUCENE-5559: Add missing checks to TokenFilters with numeric arguments

        Show
        ASF subversion and git services added a comment - Commit 1583530 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1583530 ] LUCENE-5559 : Add missing checks to TokenFilters with numeric arguments
        Hide
        ASF subversion and git services added a comment -

        Commit 1583531 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1583531 ]

        LUCENE-5559: Add missing checks to TokenFilters with numeric arguments

        Show
        ASF subversion and git services added a comment - Commit 1583531 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1583531 ] LUCENE-5559 : Add missing checks to TokenFilters with numeric arguments
        Hide
        Robert Muir added a comment -

        Thanks for cleaning this up Ahmet!

        Show
        Robert Muir added a comment - Thanks for cleaning this up Ahmet!
        Hide
        Ahmet Arslan added a comment -

        This patch added checks for overlooked TokenFilters : CapitalizationFilter and CodepointCountFilter.

        Show
        Ahmet Arslan added a comment - This patch added checks for overlooked TokenFilters : CapitalizationFilter and CodepointCountFilter.
        Hide
        Ahmet Arslan added a comment -

        Pinging Robert Muir, if there is an interest for last patch that covers two overlooked TokenFilters : CapitalizationFilter and CodepointCountFilter

        Show
        Ahmet Arslan added a comment - Pinging Robert Muir , if there is an interest for last patch that covers two overlooked TokenFilters : CapitalizationFilter and CodepointCountFilter
        Hide
        Robert Muir added a comment -

        Oops, looks like i missed this patch? thanks Ahmet. I will take care.

        Show
        Robert Muir added a comment - Oops, looks like i missed this patch? thanks Ahmet. I will take care.
        Hide
        Ahmet Arslan added a comment -

        looks like i missed this patch?

        No, actually I found those two after your commit.

        Show
        Ahmet Arslan added a comment - looks like i missed this patch? No, actually I found those two after your commit.
        Hide
        ASF subversion and git services added a comment -

        Commit 1589919 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1589919 ]

        LUCENE-5559: additional argument validation for CapitalizationFilter and CodepointCountFilter

        Show
        ASF subversion and git services added a comment - Commit 1589919 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1589919 ] LUCENE-5559 : additional argument validation for CapitalizationFilter and CodepointCountFilter
        Hide
        ASF subversion and git services added a comment -

        Commit 1589922 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1589922 ]

        LUCENE-5559: additional argument validation for CapitalizationFilter and CodepointCountFilter

        Show
        ASF subversion and git services added a comment - Commit 1589922 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1589922 ] LUCENE-5559 : additional argument validation for CapitalizationFilter and CodepointCountFilter
        Hide
        Uwe Schindler added a comment -

        Close issue after release of 4.8.0

        Show
        Uwe Schindler added a comment - Close issue after release of 4.8.0
        Hide
        ASF subversion and git services added a comment -

        Commit 1592590 from Robert Muir in branch 'dev/branches/lucene_solr_4_8'
        [ https://svn.apache.org/r1592590 ]

        LUCENE-5559: additional argument validation for CapitalizationFilter and CodepointCountFilter

        Show
        ASF subversion and git services added a comment - Commit 1592590 from Robert Muir in branch 'dev/branches/lucene_solr_4_8' [ https://svn.apache.org/r1592590 ] LUCENE-5559 : additional argument validation for CapitalizationFilter and CodepointCountFilter

          People

          • Assignee:
            Unassigned
            Reporter:
            Ahmet Arslan
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development