Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 4.9, 5.0
    • Component/s: None
    • Labels:
      None

      Description

      This is just exposing LUCENE-2892, so you can easily configure things
      so that if users put things in double quotes they get a more precise search.

      1. SOLR-2477.patch
        15 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        here's my example fieldtype from the test:

              <analyzer type="index">
                <!--  pretty standard, except stopwords are indexed, and WDF preserves -->
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.WordDelimiterFilterFactory"  preserveOriginal="1" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
                <filter class="solr.PorterStemFilterFactory"/>
              </analyzer>
              <analyzer type="query">
                <!--  remove stopwords, expand synonyms, WDF, etc etc. -->
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
                <filter class="solr.PorterStemFilterFactory"/>
              </analyzer>
              <analyzer type="phrase">
              <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <!--  in this case no synonyms are expanded, and the exact stopwords, punctuation, etc must be present  -->
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
                <filter class="solr.PorterStemFilterFactory"/>
              </analyzer>
            </fieldType>
        
        Show
        Robert Muir added a comment - here's my example fieldtype from the test: <analyzer type="index"> <!-- pretty standard, except stopwords are indexed, and WDF preserves --> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> <analyzer type="query"> <!-- remove stopwords, expand synonyms, WDF, etc etc. --> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> <analyzer type="phrase"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- in this case no synonyms are expanded, and the exact stopwords, punctuation, etc must be present --> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldType>
        Hide
        Yonik Seeley added a comment -

        Interesting idea having a separate analyzer to expose this.
        It's probably important to come up with a good example for the example schema, because I could see it being error-prone if people do it themselves. For example, if they tried your test example (which may look reasonable to someone at first blush)
        they wouldn't get any matches for anything that the WDF would normally split?

        Show
        Yonik Seeley added a comment - Interesting idea having a separate analyzer to expose this. It's probably important to come up with a good example for the example schema, because I could see it being error-prone if people do it themselves. For example, if they tried your test example (which may look reasonable to someone at first blush) they wouldn't get any matches for anything that the WDF would normally split?
        Hide
        Robert Muir added a comment -

        Well, we could maybe add something to the example, I thought it was sort of expert.

        Well in my example, they would get matches for things that WDF normally splits, but only if the punctuation is exactly as they entered it:
        assume doc 3 is 'foo bar' and doc4 is 'foo-bar'

          /** 
           * test punctuation, we preserve the original for this purpose
           */
          public void testPunctuation() {
            assertQ("normal query: ",
               req("fl", "id", "q", "foo-bar", "sort", "id asc" ),
                      "//*[@numFound='2']",
                      "//result/doc[1]/int[@name='id'][.=3]",
                      "//result/doc[2]/int[@name='id'][.=4]"
            );
            
            assertQ("phrase query: ",
                req("fl", "id", "q", "\"foo-bar\"", "sort", "id asc" ),
                       "//*[@numFound='1']",
                       "//result/doc[1]/int[@name='id'][.=4]"
             );
          }
        

        But, this was just an example, you don't have to involve WDF to take advantage of this (probably stopwords/synonyms/decompounders are the simplest way). I was just coming up with an examples to have some unit tests.

        Show
        Robert Muir added a comment - Well, we could maybe add something to the example, I thought it was sort of expert. Well in my example, they would get matches for things that WDF normally splits, but only if the punctuation is exactly as they entered it: assume doc 3 is 'foo bar' and doc4 is 'foo-bar' /** * test punctuation, we preserve the original for this purpose */ public void testPunctuation() { assertQ("normal query: ", req("fl", "id", "q", "foo-bar", "sort", "id asc" ), "//*[@numFound='2']", "//result/doc[1]/int[@name='id'][.=3]", "//result/doc[2]/int[@name='id'][.=4]" ); assertQ("phrase query: ", req("fl", "id", "q", "\"foo-bar\"", "sort", "id asc" ), "//*[@numFound='1']", "//result/doc[1]/int[@name='id'][.=4]" ); } But, this was just an example, you don't have to involve WDF to take advantage of this (probably stopwords/synonyms/decompounders are the simplest way). I was just coming up with an examples to have some unit tests.
        Hide
        Yonik Seeley added a comment -

        Well in my example, they would get matches for things that WDF normally splits, but only if the punctuation is exactly as they entered it

        Ah, I had missed the "preserveOriginal" on the index analyzer.

        Show
        Yonik Seeley added a comment - Well in my example, they would get matches for things that WDF normally splits, but only if the punctuation is exactly as they entered it Ah, I had missed the "preserveOriginal" on the index analyzer.
        Hide
        Robert Muir added a comment -

        Yeah, still even then, if we want something for the example, maybe its enough to just exclude the synonymfilter?

        Show
        Robert Muir added a comment - Yeah, still even then, if we want something for the example, maybe its enough to just exclude the synonymfilter?
        Hide
        Hoss Man added a comment -

        At first glance this looks great to me ... but we should seriously consider whether FieldQParser should also be using getPhraseAnalyzer. I think given the semantics the answer is "yes" – but either way it should be clearly documented.

        we should also make sure analysis.jsp and the Analysis RequestHandler(s?) have options for using this.

        Show
        Hoss Man added a comment - At first glance this looks great to me ... but we should seriously consider whether FieldQParser should also be using getPhraseAnalyzer. I think given the semantics the answer is "yes" – but either way it should be clearly documented. we should also make sure analysis.jsp and the Analysis RequestHandler(s?) have options for using this.
        Hide
        Robert Muir added a comment -

        but we should seriously consider whether FieldQParser should also be using getPhraseAnalyzer.

        Looking at how this is described, it seems to me it should use the phrase analyzer... we can document that it does this, and of course the change is backwards compatible (because if you don't define it, its your query analyzer).

        we should also make sure analysis.jsp and the Analysis RequestHandler(s?) have options for using this.

        I agree... hopefully this isn't too bad.

        Show
        Robert Muir added a comment - but we should seriously consider whether FieldQParser should also be using getPhraseAnalyzer. Looking at how this is described, it seems to me it should use the phrase analyzer... we can document that it does this, and of course the change is backwards compatible (because if you don't define it, its your query analyzer). we should also make sure analysis.jsp and the Analysis RequestHandler(s?) have options for using this. I agree... hopefully this isn't too bad.
        Hide
        Hoss Man added a comment -

        Having just looked at this code in SOLR-2663 i'm realizing that as we add more types of analyzers, we should really clean up the semantics of how a analyzers w/o "type" attributes are treated, and how each of hte analyzers default if they aren't specified.

        Consider the following (contrived) example...

        <fieldType name="hoss" class="solr.TextField" positionIncrementGap="100">
           <analyzer>
             <tokenizer class="solr.WhitespaceTokenizerFactory"/>
           </analyzer>
           <analyzer type="index">
             <tokenizer class="solr.KeywordTokenizerFactory"/>
           </analyzer>
        </fieldType>
        

        Right now (on trunk and with this patch) that config will result in all of the analyzers (index/query[/phrase]) using KeywordTokenizerFactory because the type-less analyzer is ignored if there is is an analyzer with type="index". I don't think that makes much sense, and as we add more types of analyzers it makes even less sense – an analyzer w/o a type attribute should really be the "default" for each other type

        I think we should change the overall flow to be (psudeo-code) ...

        
        // exactly what is in the config
        Analyzer defaultA = readAnalyzer(xpath("./analyzer[not(@type)]"));
        Analyzer indexA = readAnalyzer(xpath("./analyzer[@type='index']"));
        Analyzer queryA = readAnalyzer(xpath("./analyzer[@type='query']"));
        Analyzer phraseA = readAnalyzer(xpath("./analyzer[@type='phrase']"));
        
        if (null != defaultA) {
          // we have an explicit default
          if (null == indexA) indexA = defaultA;
          if (null == queryA) queryA = defaultA;
          if (null == phraseA) phraseA = defaultA;
        } else {
          // implicit defaults, either historical or common sense
          if (null == queryA) queryA = indexA;
          if (null == phraseA) phraseA = queryA;
        }
        
        Show
        Hoss Man added a comment - Having just looked at this code in SOLR-2663 i'm realizing that as we add more types of analyzers, we should really clean up the semantics of how a analyzers w/o "type" attributes are treated, and how each of hte analyzers default if they aren't specified. Consider the following (contrived) example... <fieldType name= "hoss" class= "solr.TextField" positionIncrementGap= "100" > <analyzer> <tokenizer class= "solr.WhitespaceTokenizerFactory" /> </analyzer> <analyzer type= "index" > <tokenizer class= "solr.KeywordTokenizerFactory" /> </analyzer> </fieldType> Right now (on trunk and with this patch) that config will result in all of the analyzers (index/query [/phrase] ) using KeywordTokenizerFactory because the type-less analyzer is ignored if there is is an analyzer with type="index". I don't think that makes much sense, and as we add more types of analyzers it makes even less sense – an analyzer w/o a type attribute should really be the "default" for each other type I think we should change the overall flow to be (psudeo-code) ... // exactly what is in the config Analyzer defaultA = readAnalyzer(xpath( "./analyzer[not(@type)]" )); Analyzer indexA = readAnalyzer(xpath( "./analyzer[@type='index']" )); Analyzer queryA = readAnalyzer(xpath( "./analyzer[@type='query']" )); Analyzer phraseA = readAnalyzer(xpath( "./analyzer[@type='phrase']" )); if ( null != defaultA) { // we have an explicit default if ( null == indexA) indexA = defaultA; if ( null == queryA) queryA = defaultA; if ( null == phraseA) phraseA = defaultA; } else { // implicit defaults, either historical or common sense if ( null == queryA) queryA = indexA; if ( null == phraseA) phraseA = queryA; }
        Hide
        Robert Muir added a comment -

        +1

        If we decide to implement this or SOLR-219 via 'types of analyzers', I don't want to think of all the combinations if we do it any other way.

        I would even go so far as to say, dont call it defaultA, but instead globalA, and if you declare this thing, and then also declare some specific analyzer,
        we throw an exception.

        Show
        Robert Muir added a comment - +1 If we decide to implement this or SOLR-219 via 'types of analyzers', I don't want to think of all the combinations if we do it any other way. I would even go so far as to say, dont call it defaultA, but instead globalA, and if you declare this thing, and then also declare some specific analyzer, we throw an exception.
        Hide
        Steve Rowe added a comment -

        Bulk move 4.4 issues to 4.5 and 5.0

        Show
        Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
        Hide
        Uwe Schindler added a comment -

        Move issue to Solr 4.9.

        Show
        Uwe Schindler added a comment - Move issue to Solr 4.9.

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development