Solr
  1. Solr
  2. SOLR-1982

Leading wildcard queries work for "all" fields if ReversedWildcardFilterFactory is used for "any" field

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.4, 1.4.1
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      As noted on the mailing list...

      http://search.lucidimagination.com/search/document/8064e6877f49e4c4/leading_wildcard_query_strangeness

      ...SolrQueryParse supports leading wild card queries for any field as long as at least one field type exists in the schema.xml which uses ReversedWildcardFilterFactory – even if that field type is never used.

      This is extremely confusing, and ost likely indicates a bug in how SolrQueryParser deals with ReversedWildcardFilterFactory

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          The behavior comes from the fact that during initialization, SolrQueryParser.checkAllowLeadingWildcards calls setAllowLeadingWildcard(true); if any field type uses ReversedWildcardFilterFactory.

          Then in getWildcardQuery, it the specific field type before calling ReverseStringFilter.reverse, but then, regardless of field type, delegates to super.getWildcardQuery which will allow the leading wildcard for all fields based on the previous call to setAllowLeadingWildcard(true).

          I'm not really sure what the intention was for fields that don't use ReversedWildcardFilterFactory, but the current behavior makes no sense at all. Either leading wildcards should only be allowed for fieldtypes that use ReversedWildcardFilterFactory, or the QParser should have a config option to control it for other fields – but as it stands it makes no sense what so ever.

          Show
          Hoss Man added a comment - The behavior comes from the fact that during initialization, SolrQueryParser.checkAllowLeadingWildcards calls setAllowLeadingWildcard(true); if any field type uses ReversedWildcardFilterFactory. Then in getWildcardQuery, it the specific field type before calling ReverseStringFilter.reverse, but then, regardless of field type, delegates to super.getWildcardQuery which will allow the leading wildcard for all fields based on the previous call to setAllowLeadingWildcard(true). I'm not really sure what the intention was for fields that don't use ReversedWildcardFilterFactory, but the current behavior makes no sense at all. Either leading wildcards should only be allowed for fieldtypes that use ReversedWildcardFilterFactory, or the QParser should have a config option to control it for other fields – but as it stands it makes no sense what so ever.
          Hide
          David Smiley added a comment -

          A perhaps unexpected side-effect of this bug is that people tell me fieldname:* works as expected for them (without knowing what ReversedWildcardFilterFactory is, it just happens to be defined in the schema somewhere). They didn't know it's actually not supposed to work, they they should have done fieldname:[* TO *]. I can't blame them from thinking what they did should work; I agree with them. But it works for the wrong reasons, as explained in this bug report. I deal a lot with wildcards so I'm intimately familiar with the issues involved.

          I think Hoss is on to the right solution. Always enable leading wildcards for Lucene's query parser, and then (here's my suggestion) getWildCardQuery() can let a simple '' through as equivalent to a [* TO *] range query. If it's not a simple '' the existing logic is mostly fine, though it should throw an error to prevent a leading wildcard when ReversedWildcardFilterFactory isn't used.

          Show
          David Smiley added a comment - A perhaps unexpected side-effect of this bug is that people tell me fieldname:* works as expected for them (without knowing what ReversedWildcardFilterFactory is, it just happens to be defined in the schema somewhere). They didn't know it's actually not supposed to work, they they should have done fieldname:[* TO *] . I can't blame them from thinking what they did should work; I agree with them. But it works for the wrong reasons, as explained in this bug report. I deal a lot with wildcards so I'm intimately familiar with the issues involved. I think Hoss is on to the right solution. Always enable leading wildcards for Lucene's query parser, and then (here's my suggestion) getWildCardQuery() can let a simple ' ' through as equivalent to a [* TO *] range query. If it's not a simple ' ' the existing logic is mostly fine, though it should throw an error to prevent a leading wildcard when ReversedWildcardFilterFactory isn't used.
          Hide
          Robert Muir added a comment -

          and then (here's my suggestion) getWildCardQuery() can let a simple '*' through as equivalent to a [* TO *] range query.

          Curious, what is the reasoning here. In trunk wildcard query already "rewrites" to just passing thru the underlying TermsEnum in this case (as the DFA is Total). So solr doesn't need to do anything here.

          Show
          Robert Muir added a comment - and then (here's my suggestion) getWildCardQuery() can let a simple '*' through as equivalent to a [* TO *] range query. Curious, what is the reasoning here. In trunk wildcard query already "rewrites" to just passing thru the underlying TermsEnum in this case (as the DFA is Total). So solr doesn't need to do anything here.
          Hide
          David Smiley added a comment - - edited

          It's true that the existing code path already supports a plain '*'. What I meant to say was that in adding the code to throw an error when ReverseWildcardFilterFactory is not in the field and there is a leading wildcard, do still support a plain '*' anyway. Sorry for any confusion.

          Show
          David Smiley added a comment - - edited It's true that the existing code path already supports a plain '*'. What I meant to say was that in adding the code to throw an error when ReverseWildcardFilterFactory is not in the field and there is a leading wildcard, do still support a plain '*' anyway. Sorry for any confusion.
          Hide
          Yonik Seeley added a comment -

          I ran onto this myself while looking at the query parser code again. We introspect all of the schema field types every time a query parser is created - really not ideal for performance. Perhaps this should be cached in the schema, or at

          Zero length prefix queries have pretty much always been allowed in solr, and leading wildcard queries have been effectively allowed with the default example schema since '09. Seems like when we get around to fixing this stuff, permissive should be the default, but should somehow be overridable (or maybe we can punt to a higher level parser like edismax to handle per-field overrides).

          Show
          Yonik Seeley added a comment - I ran onto this myself while looking at the query parser code again. We introspect all of the schema field types every time a query parser is created - really not ideal for performance. Perhaps this should be cached in the schema, or at Zero length prefix queries have pretty much always been allowed in solr, and leading wildcard queries have been effectively allowed with the default example schema since '09. Seems like when we get around to fixing this stuff, permissive should be the default, but should somehow be overridable (or maybe we can punt to a higher level parser like edismax to handle per-field overrides).
          Hide
          David Smiley added a comment -

          It appears that this issue has been addressed, probably by inadvertent side-effect of something else. It doesn't matter any more if ReversedWildcardFilterFactory is in the schema for any field or not; field:* now works. I tried this on 4.1 just now. I know this used to be a problem back in Solr 3 when I wrote about it in my book.

          Show
          David Smiley added a comment - It appears that this issue has been addressed, probably by inadvertent side-effect of something else. It doesn't matter any more if ReversedWildcardFilterFactory is in the schema for any field or not; field:* now works. I tried this on 4.1 just now. I know this used to be a problem back in Solr 3 when I wrote about it in my book.
          Hide
          David Smiley added a comment -

          I should note that this syntax may seem more intuitive, but it is a different code-path that sidesteps the smarts that a field type might have. For example timestamp:* is much slower than timestamp:[* TO *] assuming a precision step was used. Arguably this is a bug from a user perspective, who doesn't know/care about an implementation detail like that.

          Show
          David Smiley added a comment - I should note that this syntax may seem more intuitive, but it is a different code-path that sidesteps the smarts that a field type might have. For example timestamp:* is much slower than timestamp: [* TO *] assuming a precision step was used. Arguably this is a bug from a user perspective, who doesn't know/care about an implementation detail like that.

            People

            • Assignee:
              Unassigned
              Reporter:
              Hoss Man
            • Votes:
              2 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Development