Solr
  1. Solr
  2. SOLR-1196

Incorrect matches when using non alphanumeric search string !@#$%\^\&\*\(\)

    Details

    • Type: Bug Bug
    • Status: Reopened
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.3
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      Solr 1.3/ Java 1.6/ Win XP/Eclipse 3.3

      Description

      When matching strings that do not include alphanumeric chars, all the data is returned as matches. (There is actually no match, so nothing should be returned.)

      When I run a query like - (activity_type:NAME) AND title!@#$%^&*()) all the documents are returned even though there is not a single match. There is no title that matches the string (which has been escaped).

      My document structure is as follows

      <doc>
      <str name="activity_type">NAME</str>
      <str name="title">Bathing</str>
      ....
      </doc>

      The title field is of type text_title which is described below.

      <fieldType name="text_title" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

      </analyzer>
      </fieldType>

      -----------------------------------------------------
      Yonik's analysis as follows.

      <str name="rawquerystring">-features:foo features!@#$%^&*())</str>
      <str name="querystring">-features:foo features!@#$%^&*())</str>
      <str name="parsedquery">-features:foo</str>
      <str name="parsedquery_toString">-features:foo</str>

      The text analysis is throwing away non alphanumeric chars (probably
      the WordDelimiterFilter). The Lucene (and Solr) query parser throws
      away term queries when the token is zero length (after analysis).
      Solr then interprets the left over "-features:foo" as "all documents
      not containing foo in the features field", so you get a bunch of
      matches.

      As per his suggestion, a bug is filed.

        Activity

        Sam Michael created issue -
        Hide
        Oystein Steimler added a comment -

        This looks similar to this scenario:

        <doc>
        <str name="id">1</str>
        <str name="phoneno">abc</str>
        </doc>

        The field 'phoneno' is among other steps analyzed like this:

        <filter class="solr.PatternReplaceFilterFactory" pattern="[^0-9]" replacement="" replace="all" />

        When using a dismax handler containing the field phoneno, the document id=1
        will match on every query phrase. (I guess this is the same as matching any
        query on the field)

        Show
        Oystein Steimler added a comment - This looks similar to this scenario: <doc> <str name="id">1</str> <str name="phoneno">abc</str> </doc> The field 'phoneno' is among other steps analyzed like this: <filter class="solr.PatternReplaceFilterFactory" pattern=" [^0-9] " replacement="" replace="all" /> When using a dismax handler containing the field phoneno, the document id=1 will match on every query phrase. (I guess this is the same as matching any query on the field)
        Sami Siren made changes -
        Field Original Value New Value
        Component/s clients - java [ 12311580 ]
        Hide
        Erick Erickson added a comment -

        2013 Old JIRA cleanup

        Show
        Erick Erickson added a comment - 2013 Old JIRA cleanup
        Erick Erickson made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Won't Fix [ 2 ]
        Erick Erickson made changes -
        Resolution Won't Fix [ 2 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        1642d 21h 57m 1 Erick Erickson 30/Nov/13 13:24
        Resolved Resolved Reopened Reopened
        46m 33s 1 Erick Erickson 30/Nov/13 14:10

          People

          • Assignee:
            Unassigned
            Reporter:
            Sam Michael
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development