Solr
  1. Solr
  2. SOLR-3017

Allow edismax stopword filter factory implementation to be specified

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      Currently, the edismax query parser assumes that stopword filtering is being done by StopFilter: the removal of the stop filter is performed by looking for an instance of 'StopFilterFactory' (hard-coded) within the associated field's analysis chain.

      We'd like to be able to use our own stop filters whilst keeping the edismax stopword removal goodness. The supplied patch allows the stopword filter factory class to be supplied as a param, "stopwordFilterClassName". If no value is given, the default (StopFilterFactory) is used.

      Another option I looked into was to extend StopFilterFactory to create our own filter. Unfortunately, StopFilterFactory's 'create' method returns StopFilter, not TokenStream. StopFilter is also final.

      1. SOLR-3017-without-guava-alternative.patch
        45 kB
        Michael Dodsworth
      2. SOLR-3017.patch
        45 kB
        Michael Dodsworth
      3. SOLR-3017.patch
        11 kB
        Erick Erickson
      4. edismax_stop_filter_factory.patch
        14 kB
        Michael Dodsworth

        Issue Links

          Activity

          Hide
          Michael Dodsworth added a comment -

          previous patch had issues applying

          Show
          Michael Dodsworth added a comment - previous patch had issues applying
          Hide
          Michael Dodsworth added a comment -

          all reviews welcome (sorry to pester)

          Show
          Michael Dodsworth added a comment - all reviews welcome (sorry to pester)
          Hide
          Erick Erickson added a comment -

          A couple of questions:
          1> I notice that guava is in here. The only other place I see imports for google.common is in the carrot code. Does anyone object to guava getting used in core? I only ask because it's used in so few places, do we prefer apache commons StringUtils for this kind of stuff or do we just not care?

          2> In ExtendedDismaxQParserPlugin, around lines 1140 (in noStopwordFilterAnalyzer) there are a couple of tests like:
          if (stopwordFilterFactoryClass.isInstance(tf)) {

          Scanning the code, it seems like stopwordFilterFactoryClass could be null, an NPE here seems questionable.

          Otherwise, this seems fine to me from a tactical perspective, anyone want to weigh in on whether this is a good thing overall?

          Show
          Erick Erickson added a comment - A couple of questions: 1> I notice that guava is in here. The only other place I see imports for google.common is in the carrot code. Does anyone object to guava getting used in core? I only ask because it's used in so few places, do we prefer apache commons StringUtils for this kind of stuff or do we just not care? 2> In ExtendedDismaxQParserPlugin, around lines 1140 (in noStopwordFilterAnalyzer) there are a couple of tests like: if (stopwordFilterFactoryClass.isInstance(tf)) { Scanning the code, it seems like stopwordFilterFactoryClass could be null, an NPE here seems questionable. Otherwise, this seems fine to me from a tactical perspective, anyone want to weigh in on whether this is a good thing overall?
          Hide
          Michael Dodsworth added a comment -

          Thanks for the review, Erick. Much appreciated.

          1 - I'll create an alternative patch with the guava stuff switched out for StringUtils. I'm personally a big fan of the guava lib but I'm not using it for anything too useful here.

          2 - I've actually provided ExtendedSolrQueryParser with a default value (around ExtendedDismaxQParserPlugin:851) for 'stopwordFilterFactoryClass'. It's possible that someone could call 'setStopwordFilterFactoryClass' with null, in which case we would have a NPE. I've no problem adding a defensive null check before the 'isInstance' call. The other option would be to add a @NonNull annotation to that argument...but I'm not sure if findbugs or similar is run as part of the solr build process.

          Show
          Michael Dodsworth added a comment - Thanks for the review, Erick. Much appreciated. 1 - I'll create an alternative patch with the guava stuff switched out for StringUtils. I'm personally a big fan of the guava lib but I'm not using it for anything too useful here. 2 - I've actually provided ExtendedSolrQueryParser with a default value (around ExtendedDismaxQParserPlugin:851) for 'stopwordFilterFactoryClass'. It's possible that someone could call 'setStopwordFilterFactoryClass' with null, in which case we would have a NPE. I've no problem adding a defensive null check before the 'isInstance' call. The other option would be to add a @NonNull annotation to that argument...but I'm not sure if findbugs or similar is run as part of the solr build process.
          Hide
          Michael Dodsworth added a comment -

          Alternative patch with guava calls replaced with StringUtils

          Show
          Michael Dodsworth added a comment - Alternative patch with guava calls replaced with StringUtils
          Hide
          Erick Erickson added a comment -

          new version that:
          1> removes the new schema file and just modifies schema12 instead. All tests pass with this change.

          2> Adds null check to setStopwordFilterFactoryClass rather than where it's called.

          I guess theoretically someone could override this class, override setStopwordFilterFactoryClass, call it with null and set the member var to null then encounter an NPE in noStopwordFilterAnalyzer which they couldn't fix due to scope issues. But that doesn't sound like something we need to guard against at this point.

          If nobody objects, I'll commit this over the weekend or early next week.

          Show
          Erick Erickson added a comment - new version that: 1> removes the new schema file and just modifies schema12 instead. All tests pass with this change. 2> Adds null check to setStopwordFilterFactoryClass rather than where it's called. I guess theoretically someone could override this class, override setStopwordFilterFactoryClass, call it with null and set the member var to null then encounter an NPE in noStopwordFilterAnalyzer which they couldn't fix due to scope issues. But that doesn't sound like something we need to guard against at this point. If nobody objects, I'll commit this over the weekend or early next week.
          Hide
          Michael Dodsworth added a comment -

          Cheers Erick.

          Regarding the null check, would it be better to throw an IllegalArgumentException if null is passed through? Either way, it might be a good idea to javadoc the behaviour when null is passed.

          Show
          Michael Dodsworth added a comment - Cheers Erick. Regarding the null check, would it be better to throw an IllegalArgumentException if null is passed through? Either way, it might be a good idea to javadoc the behaviour when null is passed.
          Hide
          Erick Erickson added a comment -

          Yeah, it's unclear to me what the "right thing" here is. I'm really a bit fuzzy about whether ever setting this value to null is permissible or whether preventing that is too draconian. I suppose that since we're going from hard checking for StopFilterFactory, we can at least argue that either way it's not likely to break existing code.

          Please feel free to update the patch as you see fit, then we'll wait a couple of days for anyone who looks to object and I'll check it in.

          Show
          Erick Erickson added a comment - Yeah, it's unclear to me what the "right thing" here is. I'm really a bit fuzzy about whether ever setting this value to null is permissible or whether preventing that is too draconian. I suppose that since we're going from hard checking for StopFilterFactory, we can at least argue that either way it's not likely to break existing code. Please feel free to update the patch as you see fit, then we'll wait a couple of days for anyone who looks to object and I'll check it in.
          Hide
          Yonik Seeley added a comment -

          Checking for the stop filter always felt a little hacky to me (I know... I'm the one who did it)... but specifying a classname as a query parameter feels like it goes down the wrong path (esp since seems like a very rare issue). "stopwordFilterClassName" isn't even accurate if it's supposed to be specifying the name of the factory.

          Going back to the original issue:

          Unfortunately, StopFilterFactory's 'create' method returns StopFilter, not TokenStream. StopFilter is also final.

          The simplest solution here would seem to be to change StopFilterFactory.create() to return a TokenStream?

          Show
          Yonik Seeley added a comment - Checking for the stop filter always felt a little hacky to me (I know... I'm the one who did it)... but specifying a classname as a query parameter feels like it goes down the wrong path (esp since seems like a very rare issue). "stopwordFilterClassName" isn't even accurate if it's supposed to be specifying the name of the factory. Going back to the original issue: Unfortunately, StopFilterFactory's 'create' method returns StopFilter, not TokenStream. StopFilter is also final. The simplest solution here would seem to be to change StopFilterFactory.create() to return a TokenStream?
          Hide
          Michael Dodsworth added a comment -

          That would certainly allow us to hack around the problem in a way that doesn't require a change to the query parser (i.e., extending StopFilterFactory and overriding its create method to return our own filter).

          Are we concerned about breaking code that may be calling StopFilterFactory.create() and is expecting a StopFilter (I wonder if there's a reason TokenStream wasn't used originally)?

          Agreed on the inaccurate param name. I'll fix that up in the next patch.

          Specifying the factory class name as a param is optional and, as you say, should be a rare case.
          If there's a better, more general fix for this, I'm happy to take that on.

          Show
          Michael Dodsworth added a comment - That would certainly allow us to hack around the problem in a way that doesn't require a change to the query parser (i.e., extending StopFilterFactory and overriding its create method to return our own filter). Are we concerned about breaking code that may be calling StopFilterFactory.create() and is expecting a StopFilter (I wonder if there's a reason TokenStream wasn't used originally)? Agreed on the inaccurate param name. I'll fix that up in the next patch. Specifying the factory class name as a param is optional and, as you say, should be a rare case. If there's a better, more general fix for this, I'm happy to take that on.
          Hide
          Yonik Seeley added a comment -

          Are we concerned about breaking code that may be calling StopFilterFactory.create()

          Nope. I've just committed this change in trunk. There wasn't a good reason to use a more specific type (and it was not used anywhere).

          Show
          Yonik Seeley added a comment - Are we concerned about breaking code that may be calling StopFilterFactory.create() Nope. I've just committed this change in trunk. There wasn't a good reason to use a more specific type (and it was not used anywhere).
          Hide
          Erick Erickson added a comment -

          One word fixes, I love it. Anyway, is there any good reason not to put this in 3.6? I'm running tests after that change now and I'll put it in if this seems reasonable.

          Show
          Erick Erickson added a comment - One word fixes, I love it. Anyway, is there any good reason not to put this in 3.6? I'm running tests after that change now and I'll put it in if this seems reasonable.
          Hide
          Michael Dodsworth added a comment -

          Great, thank you both.

          Show
          Michael Dodsworth added a comment - Great, thank you both.
          Hide
          Hoss Man added a comment -

          Nope. I've just committed this change in trunk. There wasn't a good reason to use a more specific type (and it was not used anywhere).

          FWIW: I'm pretty sure the only reason any of these factories are declared to return specific types (instead of just TokenStream) was SOLR-396 – which isn't really that important now that lucene & solr development is in a single repo and people can easily commit factories at the same time that new impls are added.

          Show
          Hoss Man added a comment - Nope. I've just committed this change in trunk. There wasn't a good reason to use a more specific type (and it was not used anywhere). FWIW: I'm pretty sure the only reason any of these factories are declared to return specific types (instead of just TokenStream) was SOLR-396 – which isn't really that important now that lucene & solr development is in a single repo and people can easily commit factories at the same time that new impls are added.
          Hide
          Michael Dodsworth added a comment -

          Yonik's fix resolves this. Much appreciated.

          Show
          Michael Dodsworth added a comment - Yonik's fix resolves this. Much appreciated.

            People

            • Assignee:
              Unassigned
              Reporter:
              Michael Dodsworth
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development