Solr
  1. Solr
  2. SOLR-1321

Support for efficient leading wildcards search

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.4
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      This patch is an implementation of the "reversed tokens" strategy for efficient leading wildcards queries.

      ReversedWildcardsTokenFilter reverses tokens and returns both the original token (optional) and the reversed token (with positionIncrement == 0). Reversed tokens are prepended with a marker character to avoid collisions between legitimate tokens and the reversed tokens - e.g. "DNA" would become "and", thus colliding with the regular term "and", but with the marker character it becomes "\u0001and".

      This TokenFilter can be added to the analyzer chain that it used during indexing.

      SolrQueryParser has been modified to detect the presence of such fields in the current schema, and treat them in a special way. First, SolrQueryParser examines the schema and collects a map of fields where these reversed tokens are indexed. If there is at least one such field, it also sets QueryParser.setAllowLeadingWildcards(true). When building a wildcard query (in getWildcardQuery) the term text may be optionally reversed to put wildcards further along the term text. This happens when the field uses the reversing filter during indexing (as detected above), AND if the wildcard characters are either at 0-th or 1-st position in the term. Otherwise the term text is processed as before, i.e. turned into a regular wildcard query.

      Unit tests are provided to test the TokenFilter and the query parsing.

      1. wildcards-3.patch
        20 kB
        Andrzej Bialecki
      2. wildcards-2.patch
        19 kB
        Andrzej Bialecki
      3. wildcards.patch
        15 kB
        Andrzej Bialecki
      4. SOLR-1321.patch
        22 kB
        Grant Ingersoll
      5. SOLR-1321.patch
        23 kB
        Grant Ingersoll
      6. SOLR-1321.patch
        23 kB
        Robert Muir

        Activity

        Grant Ingersoll made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Grant Ingersoll added a comment -

        Bulk close Solr 1.4 issues

        Show
        Grant Ingersoll added a comment - Bulk close Solr 1.4 issues
        Hide
        Ravi Kiran Bhaskar added a comment -

        Is there a way to get only the actual instead of both actual and reversed texts (I see that we can just get the reserved without the actual. Is the other way round possible ?).

        The reason being, I tried to facet the field and it returned be both as follows

        <int name="#1;ssergnoC">78</int>
        <int name="Congress">78</int>
        <int name="#1;.W egroeG ,hsuB">28</int>
        <int name="Bush, George W.">28</int>
        <int name="#1;detinU">263</int>
        <int name="United">263</int>

        Show
        Ravi Kiran Bhaskar added a comment - Is there a way to get only the actual instead of both actual and reversed texts (I see that we can just get the reserved without the actual. Is the other way round possible ?). The reason being, I tried to facet the field and it returned be both as follows <int name="#1;ssergnoC">78</int> <int name="Congress">78</int> <int name="#1;.W egroeG ,hsuB">28</int> <int name="Bush, George W.">28</int> <int name="#1;detinU">263</int> <int name="United">263</int>
        Hide
        Ravi Kiran Bhaskar added a comment -

        Thanks Yonik for the very prompt reply. Yes, you were right. The lucene related jars had a 'dev' suffix previously and now the 'dev' was removed and hence my eclipse had jar resolution problems while building. I tested it and works great as always. BTW you folks did a wonderful job with solr

        Show
        Ravi Kiran Bhaskar added a comment - Thanks Yonik for the very prompt reply. Yes, you were right. The lucene related jars had a 'dev' suffix previously and now the 'dev' was removed and hence my eclipse had jar resolution problems while building. I tested it and works great as always. BTW you folks did a wonderful job with solr
        Hide
        Yonik Seeley added a comment -

        Works for me in the latest trunk - perhaps you have some old class files laying around somewhere? That's normally the case with something like AbstractMethodError.

        Show
        Yonik Seeley added a comment - Works for me in the latest trunk - perhaps you have some old class files laying around somewhere? That's normally the case with something like AbstractMethodError.
        Hide
        Ravi Kiran Bhaskar added a comment -

        While using ReversedWildcardFilterFactory with KeywordTokenizerFactory I get the following error for the fieldType

        <fieldType name="keywordText" class="solr.TextField" sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
        <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        </analyzer>
        <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        </analyzer>
        </fieldType>

        ERROR
        --------
        HTTP Status 500 - org.apache.solr.analysis.WhitespaceTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer; java.lang.AbstractMethodError: org.apache.solr.analysis.WhitespaceTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer; at org.apache.solr.analysis.TokenizerChain.getStream(TokenizerChain.java:69) at org.apache.solr.analysis.SolrAnalyzer.reusableTokenStream(SolrAnalyzer.java:74) at org.apache.solr.schema.IndexSchema$SolrIndexAnalyzer.reusableTokenStream(IndexSchema.java:364) at org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:543) at org.apache.solr.search.SolrQueryParser.getFieldQuery(SolrQueryParser.java:153) at org.apache.solr.util.SolrPluginUtils$DisjunctionMaxQueryParser.getFieldQuery(SolrPluginUtils.java:807) at org.apache.solr.util.SolrPluginUtils$DisjunctionMaxQueryParser.getFieldQuery(SolrPluginUtils.java:794) at org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:1425) at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1313) at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1241) at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1230) at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:176) at org.apache.solr.search.DisMaxQParser.getUserQuery(DisMaxQParser.java:195) at org.apache.solr.search.DisMaxQParser.addMainQuery(DisMaxQParser.java:158) at org.apache.solr.search.DisMaxQParser.parse(DisMaxQParser.java:74) at org.apache.solr.search.QParser.getQuery(QParser.java:131) at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:89) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1313) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:230) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:198) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:297) at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:271) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:202) at org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:632) at org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:577) at com.sun.enterprise.web.WebPipeline.invoke(WebPipeline.java:94) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:206) at org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:632) at org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:577) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:571) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:1080) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:150) at org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:632) at org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:577) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:571) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:1080) at org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:272) at com.sun.enterprise.web.connector.grizzly.DefaultProcessorTask.invokeAdapter(DefaultProcessorTask.java:637) at com.sun.enterprise.web.connector.grizzly.DefaultProcessorTask.doProcess(DefaultProcessorTask.java:568) at com.sun.enterprise.web.connector.grizzly.DefaultProcessorTask.process(DefaultProcessorTask.java:813) at com.sun.enterprise.web.connector.grizzly.DefaultReadTask.executeProcessorTask(DefaultReadTask.java:341) at com.sun.enterprise.web.connector.grizzly.DefaultReadTask.doTask(DefaultReadTask.java:263) at com.sun.enterprise.web.connector.grizzly.DefaultReadTask.doTask(DefaultReadTask.java:214) at com.sun.enterprise.web.connector.grizzly.TaskBase.run(TaskBase.java:265) at com.sun.enterprise.web.connector.grizzly.ssl.SSLWorkerThread.run(SSLWorkerThread.java:106)

        Show
        Ravi Kiran Bhaskar added a comment - While using ReversedWildcardFilterFactory with KeywordTokenizerFactory I get the following error for the fieldType <fieldType name="keywordText" class="solr.TextField" sortMissingLast="true" omitNorms="true" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.TrimFilterFactory" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" /> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.TrimFilterFactory" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" /> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType> ERROR -------- HTTP Status 500 - org.apache.solr.analysis.WhitespaceTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer; java.lang.AbstractMethodError: org.apache.solr.analysis.WhitespaceTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer; at org.apache.solr.analysis.TokenizerChain.getStream(TokenizerChain.java:69) at org.apache.solr.analysis.SolrAnalyzer.reusableTokenStream(SolrAnalyzer.java:74) at org.apache.solr.schema.IndexSchema$SolrIndexAnalyzer.reusableTokenStream(IndexSchema.java:364) at org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:543) at org.apache.solr.search.SolrQueryParser.getFieldQuery(SolrQueryParser.java:153) at org.apache.solr.util.SolrPluginUtils$DisjunctionMaxQueryParser.getFieldQuery(SolrPluginUtils.java:807) at org.apache.solr.util.SolrPluginUtils$DisjunctionMaxQueryParser.getFieldQuery(SolrPluginUtils.java:794) at org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:1425) at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1313) at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1241) at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1230) at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:176) at org.apache.solr.search.DisMaxQParser.getUserQuery(DisMaxQParser.java:195) at org.apache.solr.search.DisMaxQParser.addMainQuery(DisMaxQParser.java:158) at org.apache.solr.search.DisMaxQParser.parse(DisMaxQParser.java:74) at org.apache.solr.search.QParser.getQuery(QParser.java:131) at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:89) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1313) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:230) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:198) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:297) at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:271) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:202) at org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:632) at org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:577) at com.sun.enterprise.web.WebPipeline.invoke(WebPipeline.java:94) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:206) at org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:632) at org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:577) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:571) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:1080) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:150) at org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:632) at org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:577) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:571) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:1080) at org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:272) at com.sun.enterprise.web.connector.grizzly.DefaultProcessorTask.invokeAdapter(DefaultProcessorTask.java:637) at com.sun.enterprise.web.connector.grizzly.DefaultProcessorTask.doProcess(DefaultProcessorTask.java:568) at com.sun.enterprise.web.connector.grizzly.DefaultProcessorTask.process(DefaultProcessorTask.java:813) at com.sun.enterprise.web.connector.grizzly.DefaultReadTask.executeProcessorTask(DefaultReadTask.java:341) at com.sun.enterprise.web.connector.grizzly.DefaultReadTask.doTask(DefaultReadTask.java:263) at com.sun.enterprise.web.connector.grizzly.DefaultReadTask.doTask(DefaultReadTask.java:214) at com.sun.enterprise.web.connector.grizzly.TaskBase.run(TaskBase.java:265) at com.sun.enterprise.web.connector.grizzly.ssl.SSLWorkerThread.run(SSLWorkerThread.java:106)
        Grant Ingersoll made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Grant Ingersoll added a comment -

        Committed revision 813830.

        Show
        Grant Ingersoll added a comment - Committed revision 813830.
        Hide
        Grant Ingersoll added a comment -

        Only annoying thing about this solution is now that * check gets done twice, once in the SolrQueryParser method and once in the QueryParser method.

        Also, why not just && the two clauses (I realize it is a cut and paste from the parent).

        I'll fix and commit.

        Show
        Grant Ingersoll added a comment - Only annoying thing about this solution is now that * check gets done twice, once in the SolrQueryParser method and once in the QueryParser method. Also, why not just && the two clauses (I realize it is a cut and paste from the parent). I'll fix and commit.
        Robert Muir made changes -
        Attachment SOLR-1321.patch [ 12419264 ]
        Hide
        Robert Muir added a comment -

        grant's patch, but with the logic to handle matchalldocs

        Show
        Robert Muir added a comment - grant's patch, but with the logic to handle matchalldocs
        Hide
        Yonik Seeley added a comment -

        btw shouldnt the qp be looking at .getQueryAnalyzer() versus .getAnalyzer()?

        The QueryParser should be using the query analyzer in general, but in this case, it's looking at the index analyzer to see if it indexed reverse tokens.

        Show
        Yonik Seeley added a comment - btw shouldnt the qp be looking at .getQueryAnalyzer() versus .getAnalyzer()? The QueryParser should be using the query analyzer in general, but in this case, it's looking at the index analyzer to see if it indexed reverse tokens.
        Hide
        Grant Ingersoll added a comment -

        Problem seems to stem from : queries.

        Show
        Grant Ingersoll added a comment - Problem seems to stem from : queries.
        Hide
        Robert Muir added a comment - - edited

        Hmm, anyone else having troubles with the other tests?

        yes, I can confirm this... btw shouldnt the qp be looking at .getQueryAnalyzer() versus .getAnalyzer()?

        nevermind, since you dont want to reverse at query time...

        Show
        Robert Muir added a comment - - edited Hmm, anyone else having troubles with the other tests? yes, I can confirm this... btw shouldnt the qp be looking at .getQueryAnalyzer() versus .getAnalyzer()? nevermind, since you dont want to reverse at query time...
        Grant Ingersoll made changes -
        Attachment SOLR-1321.patch [ 12419230 ]
        Hide
        Grant Ingersoll added a comment -

        Notwithstanding the fact that other tests that use the QP fail with this patch (and the old one), here's a patch that uses char[] instead of Strings.

        Show
        Grant Ingersoll added a comment - Notwithstanding the fact that other tests that use the QP fail with this patch (and the old one), here's a patch that uses char[] instead of Strings.
        Hide
        Grant Ingersoll added a comment -

        Hmm, anyone else having troubles with the other tests? I don't think this is handling dynamic fields properly.

        Show
        Grant Ingersoll added a comment - Hmm, anyone else having troubles with the other tests? I don't think this is handling dynamic fields properly.
        Hide
        Robert Muir added a comment -

        one comment, i think you should be able to use char[] versus String here pretty easily. it might have some impact on indexing speed.

        but this could be a later improvement.

        Show
        Robert Muir added a comment - one comment, i think you should be able to use char[] versus String here pretty easily. it might have some impact on indexing speed. but this could be a later improvement.
        Hide
        Grant Ingersoll added a comment -

        OK, Andrzej. That makes sense. We may want to push down into Lucene eventually. The only outcome of this is that people who are expecting "efficient" wildcard searches on fields that don't use the Reverse stuff would see slower perf.

        I feel comfortable committing this patch now.

        Show
        Grant Ingersoll added a comment - OK, Andrzej. That makes sense. We may want to push down into Lucene eventually. The only outcome of this is that people who are expecting "efficient" wildcard searches on fields that don't use the Reverse stuff would see slower perf. I feel comfortable committing this patch now.
        Hide
        Yonik Seeley added a comment -

        I've always been in favor of just allowing leading wildcards for all fields, but I think others disagreed on that point.

        Show
        Yonik Seeley added a comment - I've always been in favor of just allowing leading wildcards for all fields, but I think others disagreed on that point.
        Hide
        Andrzej Bialecki added a comment -

        This comment refers to the limitation of Lucene's QueryParser - there is only a single flag there to decide whether it accepts leading wildcards or not, regardless of field. Consequently, after checking the schema in SolrQueryParser we turn on this flag if any field type supports leading wildcards. The end effect of this is that parsers for any field, which are created with IndexSchema.getSolrQueryParser(), will return true if any field type supports leading wildcards, not neccessarily the one for which the parser was created..

        I don't see a way to fix this. I can clarify the comment, though, so that it's clear that this is a limitation in Lucene QueryParser.

        Show
        Andrzej Bialecki added a comment - This comment refers to the limitation of Lucene's QueryParser - there is only a single flag there to decide whether it accepts leading wildcards or not, regardless of field. Consequently, after checking the schema in SolrQueryParser we turn on this flag if any field type supports leading wildcards. The end effect of this is that parsers for any field, which are created with IndexSchema.getSolrQueryParser(), will return true if any field type supports leading wildcards, not neccessarily the one for which the parser was created.. I don't see a way to fix this. I can clarify the comment, though, so that it's clear that this is a limitation in Lucene QueryParser.
        Grant Ingersoll made changes -
        Attachment SOLR-1321.patch [ 12419178 ]
        Hide
        Grant Ingersoll added a comment -

        Added ASL headers.

        I don't understand, in the Test, the comment:

        // XXX note: this should be false, but for now we return true for any field,
        // XXX if at least one field uses the reversing
        assertTrue(parserThree.getAllowLeadingWildcard());

        Seems like this needs to be fixed before committing.

        Show
        Grant Ingersoll added a comment - Added ASL headers. I don't understand, in the Test, the comment: // XXX note: this should be false, but for now we return true for any field, // XXX if at least one field uses the reversing assertTrue(parserThree.getAllowLeadingWildcard()); Seems like this needs to be fixed before committing.
        Andrzej Bialecki made changes -
        Attachment wildcards-3.patch [ 12419095 ]
        Hide
        Andrzej Bialecki added a comment -

        Updated patch that uses new TokenAttribute API and uses (as much as possible) the new ReverseStringFilter.

        Show
        Andrzej Bialecki added a comment - Updated patch that uses new TokenAttribute API and uses (as much as possible) the new ReverseStringFilter.
        Hide
        Grant Ingersoll added a comment -

        I can take care of the libs.

        Show
        Grant Ingersoll added a comment - I can take care of the libs.
        Hide
        Andrzej Bialecki added a comment -

        I'll update the patch, assuming the presence of the updated filter in Lucene, but I'd rather leave updating the libs to someone more intimate with Solr internals ...

        Show
        Andrzej Bialecki added a comment - I'll update the patch, assuming the presence of the updated filter in Lucene, but I'd rather leave updating the libs to someone more intimate with Solr internals ...
        Hide
        Grant Ingersoll added a comment -

        So, now that this is in Lucene, we likely just need to update the Factory in Solr, right? As well as the Lucene libs, if they haven't been already.

        Show
        Grant Ingersoll added a comment - So, now that this is in Lucene, we likely just need to update the Factory in Solr, right? As well as the Lucene libs, if they haven't been already.
        Hide
        Robert Muir added a comment -

        btw, i found apache harmony has a nice impl of in-place reversing that is unicode safe (AbstractStringBuilder.reverse0)
        it will treat surrogate pairs as one character for the op just like java 5 StringBuilder.reverse()

        since lucene wildcard does not properly handle these, it would probably be bad to be "unicode correct" at this point.
        but perhaps a TODO or comment is desirable, as I hope to improve this situation in the future!

        Show
        Robert Muir added a comment - btw, i found apache harmony has a nice impl of in-place reversing that is unicode safe (AbstractStringBuilder.reverse0) it will treat surrogate pairs as one character for the op just like java 5 StringBuilder.reverse() since lucene wildcard does not properly handle these, it would probably be bad to be "unicode correct" at this point. but perhaps a TODO or comment is desirable, as I hope to improve this situation in the future!
        Hide
        Yonik Seeley added a comment -

        Use resizeTermBuffer() first... and then do it in place. The buffer may already be big enough and so resizeTermBuffer would just be a check.

        Show
        Yonik Seeley added a comment - Use resizeTermBuffer() first... and then do it in place. The buffer may already be big enough and so resizeTermBuffer would just be a check.
        Hide
        Andrzej Bialecki added a comment -

        FWIW, it also seemed like the reverse code in ReverseStringFilter was faster than the patch, but I didn't compare quantitatively.

        It better be - it can reverse in-place, while we have to allocate a new buffer because of the marker char in front. That's what I meant by messy code - we would need both the in-place and the out-of-place method depending on an option.

        Show
        Andrzej Bialecki added a comment - FWIW, it also seemed like the reverse code in ReverseStringFilter was faster than the patch, but I didn't compare quantitatively. It better be - it can reverse in-place, while we have to allocate a new buffer because of the marker char in front. That's what I meant by messy code - we would need both the in-place and the out-of-place method depending on an option.
        Hide
        Grant Ingersoll added a comment -

        After adding the marker-related stuff the code in ReverseStringFilter won't be so nice as it is now

        Of course, it would be optional.

        Yeah, I can return char[] and convert to String only in QP

        Right, no need to materialize a String if we don't need to. FWIW, it also seemed like the reverse code in ReverseStringFilter was faster than the patch, but I didn't compare quantitatively.

        Show
        Grant Ingersoll added a comment - After adding the marker-related stuff the code in ReverseStringFilter won't be so nice as it is now Of course, it would be optional. Yeah, I can return char[] and convert to String only in QP Right, no need to materialize a String if we don't need to. FWIW, it also seemed like the reverse code in ReverseStringFilter was faster than the patch, but I didn't compare quantitatively.
        Hide
        Andrzej Bialecki added a comment -

        Since this is a new filter, we might as well use the new incrementToken capability and reusable stuff as well as avoiding other deprecated analysis calls.

        Indeed, I'll fix this.

        Also no need to do the string round trip in the reverse method, right? See the ReverseStringFilter in Lucene contrib/analysis.

        Roundtrip ... you mean the allocation of new char[] buffer, or conversion to String? I assume the latter - the former is needed because we add the marker char in front. Yeah, I can return char[] and convert to String only in QP.

        Perhaps we should just patch that and add some config options to it? Then all Solr would need is the QP change and the FilterFactory change, no?

        Hmm. After adding the marker-related stuff the code in ReverseStringFilter won't be so nice as it is now. I'd keep in mind the specific use case of this filter ...

        Show
        Andrzej Bialecki added a comment - Since this is a new filter, we might as well use the new incrementToken capability and reusable stuff as well as avoiding other deprecated analysis calls. Indeed, I'll fix this. Also no need to do the string round trip in the reverse method, right? See the ReverseStringFilter in Lucene contrib/analysis. Roundtrip ... you mean the allocation of new char[] buffer, or conversion to String? I assume the latter - the former is needed because we add the marker char in front. Yeah, I can return char[] and convert to String only in QP. Perhaps we should just patch that and add some config options to it? Then all Solr would need is the QP change and the FilterFactory change, no? Hmm. After adding the marker-related stuff the code in ReverseStringFilter won't be so nice as it is now. I'd keep in mind the specific use case of this filter ...
        Hide
        Robert Muir added a comment -

        i do have one comment on the reverse() present here: it is not unicode-safe (it will create unpaired surrogates).
        We need to think thru the implications of this, fyi i have a patch for lucene's ReverseStringFilter (similar problem) sitting in LUCENE-1689

        Show
        Robert Muir added a comment - i do have one comment on the reverse() present here: it is not unicode-safe (it will create unpaired surrogates). We need to think thru the implications of this, fyi i have a patch for lucene's ReverseStringFilter (similar problem) sitting in LUCENE-1689
        Hide
        Grant Ingersoll added a comment -

        Also no need to do the string round trip in the reverse method, right? See the ReverseStringFilter in Lucene contrib/analysis. Perhaps we should just patch that and add some config options to it? Then all Solr would need is the QP change and the FilterFactory change, no?

        Show
        Grant Ingersoll added a comment - Also no need to do the string round trip in the reverse method, right? See the ReverseStringFilter in Lucene contrib/analysis. Perhaps we should just patch that and add some config options to it? Then all Solr would need is the QP change and the FilterFactory change, no?
        Hide
        Grant Ingersoll added a comment -

        Since this is a new filter, we might as well use the new incrementToken capability and reusable stuff as well as avoiding other deprecated analysis calls.

        Show
        Grant Ingersoll added a comment - Since this is a new filter, we might as well use the new incrementToken capability and reusable stuff as well as avoiding other deprecated analysis calls.
        Grant Ingersoll made changes -
        Assignee Grant Ingersoll [ gsingers ]
        Hide
        Robert Muir added a comment -

        andrzej, thanks, I like this design.

        Show
        Robert Muir added a comment - andrzej, thanks, I like this design.
        Andrzej Bialecki made changes -
        Attachment wildcards-2.patch [ 12415416 ]
        Hide
        Andrzej Bialecki added a comment -

        Previous patch mistakenly included other stuff instead of ReversedWildcardFilterFactory.

        Show
        Andrzej Bialecki added a comment - Previous patch mistakenly included other stuff instead of ReversedWildcardFilterFactory.
        Andrzej Bialecki made changes -
        Attachment wildcards-2.patch [ 12415404 ]
        Hide
        Robert Muir added a comment -

        Andrzej, did you accidentally leave out ReversedWildcardsFilterFactory.java in the patch?

        Show
        Robert Muir added a comment - Andrzej, did you accidentally leave out ReversedWildcardsFilterFactory.java in the patch?
        Andrzej Bialecki made changes -
        Attachment wildcards-2.patch [ 12415404 ]
        Hide
        Andrzej Bialecki added a comment -

        Updated patch with more configurable knobs. See javadoc of ReversedWildcardsFilterFactory and unit tests.

        Show
        Andrzej Bialecki added a comment - Updated patch with more configurable knobs. See javadoc of ReversedWildcardsFilterFactory and unit tests.
        Hide
        Robert Muir added a comment -

        sounds perfect, great idea. Thanks!

        Show
        Robert Muir added a comment - sounds perfect, great idea. Thanks!
        Hide
        Andrzej Bialecki added a comment -

        Exactly, that's the reason to put this logic in an isolated well-defined place, with some configurable knobs. One parameter would be the max. position of the leading wildcard, another would be relative cost of ? and *, or whether we allow wildcards at any positions except the 0-th (pure suffix search).

        Show
        Andrzej Bialecki added a comment - Exactly, that's the reason to put this logic in an isolated well-defined place, with some configurable knobs. One parameter would be the max. position of the leading wildcard, another would be relative cost of ? and *, or whether we allow wildcards at any positions except the 0-th (pure suffix search).
        Hide
        Robert Muir added a comment -

        Andrzej, with the costs logic, you wouldn't have to limit the reversing to just 0th or 1st position wildcards either, right?

        for example if a term was 10 characters long with a wildcard in the 3rd position, reversing it is helpful.

        Show
        Robert Muir added a comment - Andrzej, with the costs logic, you wouldn't have to limit the reversing to just 0th or 1st position wildcards either, right? for example if a term was 10 characters long with a wildcard in the 3rd position, reversing it is helpful.
        Hide
        Andrzej Bialecki added a comment -

        I think your example of g?abcde* could be handled if we assigned different "costs" of expanding ? and *, the latter being more costly. There could be also a rule that prevents the reversing if a trailing "costly" wildcard is used.

        This quickly gets more and more complicated, so ultimately we may want to put this logic elsewhere, in a class that knows the best how to make such decisions (ReversedWildcardFilter ?). I'll try to modify the patch in this direction.

        Show
        Andrzej Bialecki added a comment - I think your example of g?abcde* could be handled if we assigned different "costs" of expanding ? and *, the latter being more costly. There could be also a rule that prevents the reversing if a trailing "costly" wildcard is used. This quickly gets more and more complicated, so ultimately we may want to put this logic elsewhere, in a class that knows the best how to make such decisions (ReversedWildcardFilter ?). I'll try to modify the patch in this direction.
        Hide
        Robert Muir added a comment - - edited

        andrzej i see what you are saying. I think its a great feature the way it is!

        In the future I will take a look at finding a way to do both, this way complex cases like *abcde?f get reversed by this feature into \u0001f?edcba*, 
        but then implemented with automaton so that it doesn't have to enumerate all tokens that start with \u0001f. 
        

        this is bad example hope you see what i mean. the biggest challenge would be preventing suboptimal cases, like reversing g?abcde* into \u2001*edcba?g, (at least I think).
        the first is actually more efficient, I think regardless of the wildcard impl.

        I wonder if in your patch you could have an additional check, if something is in the 1st position but the last character is also a wildcard, not to reverse it?
        in the example above even with the default lucene wildcard query, at least it would only enumerate the tokens starting with g, so its better not to reverse it.

        if its in the 0th position it doesnt matter if you reverse it or not but I think that one case can be optimized.

        Thanks,
        Robert

        Show
        Robert Muir added a comment - - edited andrzej i see what you are saying. I think its a great feature the way it is! In the future I will take a look at finding a way to do both, this way complex cases like *abcde?f get reversed by this feature into \u0001f?edcba*, but then implemented with automaton so that it doesn't have to enumerate all tokens that start with \u0001f. this is bad example hope you see what i mean. the biggest challenge would be preventing suboptimal cases, like reversing g?abcde* into \u2001*edcba?g, (at least I think). the first is actually more efficient, I think regardless of the wildcard impl. I wonder if in your patch you could have an additional check, if something is in the 1st position but the last character is also a wildcard, not to reverse it? in the example above even with the default lucene wildcard query, at least it would only enumerate the tokens starting with g, so its better not to reverse it. if its in the 0th position it doesnt matter if you reverse it or not but I think that one case can be optimized. Thanks, Robert
        Hide
        Andrzej Bialecki added a comment -

        If you follow the logic in getWildcardQuery, a field has to meet specific requirements for this reversal to occur - namely, it needs to declare in its indexing analysis chain that it uses ReversedWildcardFilter. This filter does very special kind of reversal (prepending the marker) so it's unlikely that anyone would use it for other purpose than to explicitly support leading wildcards. So for now I'd say that users should consciously choose between this method of supporting leading wildcards and the automaton wildcard query.

        Show
        Andrzej Bialecki added a comment - If you follow the logic in getWildcardQuery, a field has to meet specific requirements for this reversal to occur - namely, it needs to declare in its indexing analysis chain that it uses ReversedWildcardFilter. This filter does very special kind of reversal (prepending the marker) so it's unlikely that anyone would use it for other purpose than to explicitly support leading wildcards. So for now I'd say that users should consciously choose between this method of supporting leading wildcards and the automaton wildcard query.
        Hide
        Robert Muir added a comment -

        andrej, i really like this feature.

        one question though, the way you have overridden getWildcardQuery() i am not certain how I can have both this feature and say, override getWildCardQuery to produce an AutomatonWildcardQuery: LUCENE-1606
        what I want is both , because AutomatonWildCardQuery cannot help at all with leading *, (it can with leading ? for example).

        but don't let this really get in your way, just a side note.

        Show
        Robert Muir added a comment - andrej, i really like this feature. one question though, the way you have overridden getWildcardQuery() i am not certain how I can have both this feature and say, override getWildCardQuery to produce an AutomatonWildcardQuery: LUCENE-1606 what I want is both , because AutomatonWildCardQuery cannot help at all with leading *, (it can with leading ? for example). but don't let this really get in your way, just a side note.
        Andrzej Bialecki made changes -
        Field Original Value New Value
        Attachment wildcards.patch [ 12415125 ]
        Hide
        Andrzej Bialecki added a comment -

        Patch containing the new filter, example schema and unit tests.

        Show
        Andrzej Bialecki added a comment - Patch containing the new filter, example schema and unit tests.
        Andrzej Bialecki created issue -

          People

          • Assignee:
            Grant Ingersoll
            Reporter:
            Andrzej Bialecki
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development