Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-5212

bad qs and mm when using edismax for field with CJKBigramFilter

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Not A Problem
    • 4.4
    • None
    • search
    • None

    Description

      When I have a field using CJKBigramFilter, a mysterious qs value (or what i take as qs, because it shows as ~x after the first DisjunctionMaxQuery) appears in my parsed query. The qs value that appears is the minimum of:
      mm setting, number of bigrams in query string.

      This makes no sense, from a retrieval standpoint. It could possibly make sense to adjust the ps value, but certainly not the qs. Moreover, changing the mm setting via an HTTP param can affect the qs, but sending in a qs parameter has no effect on the qs in the parsed query.

      If I use a field in qf that has only bigrams, then qs is set to MIN(original mm setting, number of bigrams in query string)

      arg sent in: q=

      {!qf=cjk_bi_search pf= pf2= pf3=}旧小说
      旧小说 is 3 chars, so 2 bigrams

      debugQuery
      <str name="rawquerystring">{!qf=cjk_bi_search pf= pf2= pf3=}

      旧小说</str>
      <str name="querystring">

      {!qf=cjk_bi_search pf= pf2= pf3=}

      旧小说</str>
      <str name="parsedquery">(+DisjunctionMaxQuery((((cjk_bi_search:旧小 cjk_bi_search:小说)~2))~0.01) ())/no_coord</str>
      <str name="parsedquery_toString">+(((cjk_bi_search:旧小 cjk_bi_search:小说)~2))~0.01 ()</str>

      If I use a field in qf that has only unigrams, then qs is set to MIN(original mm setting, number of unigrams in query string)

      arg sent in: q=

      {!qf=cjk_uni_search pf= pf2= pf3=}旧小说
      旧小说 is 3 chars, so 3 bigrams

      debugQuery
      <str name="rawquerystring">{!qf=cjk_uni_search pf= pf2= pf3=}

      旧小说</str>
      <str name="querystring">

      {!qf=cjk_uni_search pf= pf2= pf3=}

      旧小说</str>
      <str name="parsedquery">(+DisjunctionMaxQuery((((cjk_uni_search:旧 cjk_uni_search:小 cjk_uni_search:说)~3))~0.01) ())/no_coord</str>
      <str name="parsedquery_toString">+(((cjk_uni_search:旧 cjk_uni_search:小 cjk_uni_search:说)~3))~0.01 ()</str>

      If I use a field in qf that has both bigrams and unigrams, then qs is set to MIN(original mm setting, number of bigrams + unigrams in query string).

      arg sent in: q=

      {!qf=cjk_both_search pf= pf2= pf3=}

      旧小说
      旧小说 is 3 chars, so 3 unigrams + 2 bigrams = 5

      debugQuery
      <str name="rawquerystring">

      {!qf=cjk_both_pub_search pf= pf2= pf3=}旧小说</str>
      <str name="querystring">{!qf=cjk_both_pub_search pf= pf2= pf3=}

      旧小说</str>
      <str name="parsedquery">(+DisjunctionMaxQuery((((cjk_both_search:旧 cjk_both_search:旧小 cjk_both_search:小 cjk_both_search:小说 cjk_both_search:说)~5))~0.01) ())/no_coord</str>
      <str name="parsedquery_toString">+(((cjk_both_search:旧 cjk_both_search:旧小 cjk_both_search:小 cjk_both_search:小说 cjk_both_search:说)~5))~0.01 ()</str>

      I am running Solr 4.4. I have fields defined like so:

      <fieldtype name="text_cjk_both" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false">
      <analyzer>
      <tokenizer class="solr.ICUTokenizerFactory" />
      <filter class="solr.CJKWidthFilterFactory"/>
      <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
      <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
      <filter class="solr.ICUFoldingFilterFactory"/>
      <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
      </analyzer>
      </fieldtype>
      <fieldtype name="text_cjk_bi" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false">
      <analyzer>
      <tokenizer class="solr.ICUTokenizerFactory" />
      <filter class="solr.CJKWidthFilterFactory"/>
      <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
      <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
      <filter class="solr.ICUFoldingFilterFactory"/>
      <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="false" />
      </analyzer>
      </fieldtype>
      <fieldtype name="text_cjk_uni" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false">
      <analyzer>
      <tokenizer class="solr.ICUTokenizerFactory" />
      <filter class="solr.CJKWidthFilterFactory"/>
      <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
      <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
      <filter class="solr.ICUFoldingFilterFactory"/>
      </analyzer>
      </fieldtype>

      The request handler uses edismax:

      <requestHandler name="search" class="solr.SearchHandler" default="true">
      <lst name="defaults">
      <str name="defType">edismax</str>
      <str name="q.alt">:</str>
      <str name="mm">6<-1 6<90%</str>
      <int name="qs">1</int>
      <int name="ps">0</int>

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ndushay Naomi Dushay
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: