Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: 2.9.4, 3.0.3, 3.1, 3.2, 3.3, 3.4, 4.0-ALPHA
    • Fix Version/s: 3.5, 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      If 2-gram is used and the length of query string is 4, for example q="ABCD", QueryParser generates (when autoGeneratePhraseQueries is true) PhraseQuery("AB BC CD") with slop 0. But it can be optimized PhraseQuery("AB CD") with appropriate positions.

      The idea came from the Japanese paper "N.M-gram: Implementation of Inverted Index Using N-gram with Hash Values" by Mikio Hirabayashi, et al. (The main theme of the paper is different from the idea that I'm using here, though)

      1. LUCENE-3426.patch
        2 kB
        Koji Sekiguchi
      2. LUCENE-3426.patch
        4 kB
        Koji Sekiguchi
      3. PerfTest.java
        5 kB
        Koji Sekiguchi
      4. LUCENE-3426.patch
        7 kB
        Koji Sekiguchi
      5. LUCENE-3426.patch
        7 kB
        Koji Sekiguchi
      6. PerfTest.java
        5 kB
        Koji Sekiguchi
      7. LUCENE-3426.patch
        8 kB
        Koji Sekiguchi
      8. LUCENE-3426.patch
        8 kB
        Koji Sekiguchi

        Activity

        Hide
        Koji Sekiguchi added a comment -

        First draft. The patch doesn't include test yet.

        I added optimizeForNgram() methods in the patch that will remove "redundant" terms and positions.

        Optimizing PhraseQuery will change score.

        Show
        Koji Sekiguchi added a comment - First draft. The patch doesn't include test yet. I added optimizeForNgram() methods in the patch that will remove "redundant" terms and positions. Optimizing PhraseQuery will change score.
        Hide
        Koji Sekiguchi added a comment -

        I added test code.

        Show
        Koji Sekiguchi added a comment - I added test code.
        Hide
        Koji Sekiguchi added a comment -

        The result of speed up is:

        n-gram/query length normal(ms) optimized(ms) speed up
        bi-gram/4 21,766 16,641 30.8%
        bi-gram/6 29,865 21,518 38.8%
        tri-gram/5 8,188 6,140 33.4%
        tri-gram/6 9,001 5,925 51.9%
        Show
        Koji Sekiguchi added a comment - The result of speed up is: n-gram/query length normal(ms) optimized(ms) speed up bi-gram/4 21,766 16,641 30.8% bi-gram/6 29,865 21,518 38.8% tri-gram/5 8,188 6,140 33.4% tri-gram/6 9,001 5,925 51.9%
        Hide
        Robert Muir added a comment -

        Hi Koji, I wonder if instead it would be cleaner as a subclass of PhraseQuery (NGramPhraseQuery or similar),
        that rewrites to the (possibly optimized) PhraseQuery in rewrite(). For example, it would build an optimized
        PhraseQuery when slop = 0, and there are enough terms to optimize, otherwise it would build a "normal" phrasequery.

        Then the optimization would be easy to apply, the user just uses NGramPhraseQuery instead of PhraseQuery.
        for example, from QueryParser:

          @Override
          protected PhraseQuery newPhraseQuery() {
            return new NGramPhraseQuery();
          }
        
        Show
        Robert Muir added a comment - Hi Koji, I wonder if instead it would be cleaner as a subclass of PhraseQuery (NGramPhraseQuery or similar), that rewrites to the (possibly optimized) PhraseQuery in rewrite(). For example, it would build an optimized PhraseQuery when slop = 0, and there are enough terms to optimize, otherwise it would build a "normal" phrasequery. Then the optimization would be easy to apply, the user just uses NGramPhraseQuery instead of PhraseQuery. for example, from QueryParser: @Override protected PhraseQuery newPhraseQuery() { return new NGramPhraseQuery(); }
        Hide
        Koji Sekiguchi added a comment -

        I like the idea of introducing the newly created class! Here is the new patch.

        Show
        Koji Sekiguchi added a comment - I like the idea of introducing the newly created class! Here is the new patch.
        Hide
        Robert Muir added a comment -

        I think I like it better too... though I wonder if its possible to keep the original NGramPhraseQuery unmodified?
        this way its not changed by Query.rewrite(), and if a user reuses the query (which we document they can do), they could then call add() again and everything works.

        Also, somewhat related to the issue might be SOLR-2660. We don't have to commit that patch, but we could separate
        out the queryparser refactoring to make it easier for such an optimization to be "automatic" in solr, because it allows
        SolrQueryParser to delegate creation of Phrase/MultiPhraseQuery to the FieldType.

        Show
        Robert Muir added a comment - I think I like it better too... though I wonder if its possible to keep the original NGramPhraseQuery unmodified? this way its not changed by Query.rewrite(), and if a user reuses the query (which we document they can do), they could then call add() again and everything works. Also, somewhat related to the issue might be SOLR-2660 . We don't have to commit that patch, but we could separate out the queryparser refactoring to make it easier for such an optimization to be "automatic" in solr, because it allows SolrQueryParser to delegate creation of Phrase/MultiPhraseQuery to the FieldType.
        Hide
        Koji Sekiguchi added a comment -

        I think I like it better too... though I wonder if its possible to keep the original NGramPhraseQuery unmodified?
        this way its not changed by Query.rewrite(), and if a user reuses the query (which we document they can do), they could then call add() again and everything works.

        I wonder it that too. Here is the new patch. This time I added assertSame()/NotSame() to check the rewritten Query to test code.

        Show
        Koji Sekiguchi added a comment - I think I like it better too... though I wonder if its possible to keep the original NGramPhraseQuery unmodified? this way its not changed by Query.rewrite(), and if a user reuses the query (which we document they can do), they could then call add() again and everything works. I wonder it that too. Here is the new patch. This time I added assertSame()/NotSame() to check the rewritten Query to test code.
        Hide
        Koji Sekiguchi added a comment - - edited

        For automatic in Solr, I wonder if we could move the feature to n-gram tokenizers, and we could have something like:

        <fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100"
                   autoGeneratePhraseQueries="true">
          <analyzer type="index">
            <tokenizer class="solr.CJKTokenizerFactory"/>
          </analyzer>
          <analyzer type="query">
            <tokenizer class="solr.CJKTokenizerFactory" optimizePhraseQuery="true"/>
          </analyzer>
        </fieldType>
        
        Show
        Koji Sekiguchi added a comment - - edited For automatic in Solr, I wonder if we could move the feature to n-gram tokenizers, and we could have something like: <fieldType name= "text_cjk" class= "solr.TextField" positionIncrementGap= "100" autoGeneratePhraseQueries= " true " > <analyzer type= "index" > <tokenizer class= "solr.CJKTokenizerFactory" /> </analyzer> <analyzer type= "query" > <tokenizer class= "solr.CJKTokenizerFactory" optimizePhraseQuery= " true " /> </analyzer> </fieldType>
        Hide
        Robert Muir added a comment -

        Well if we apply the refactoring part of SOLR-2660 (we can split out into a separate issue), we could add such a thing as an attribute to the fieldType?

        I like the way your patch looks now! A couple more questions:

        • doesn't the optimization also apply to MultiPhraseQuery? If so, NGramPhraseQuery could extend MultiPhraseQuery and just rewrite to the correct one (MultiPhrase or Phrase depending upon the situation after optimization)
        • what about hashCode/equals? Although the same results will be returned, scoring will differ, maybe it NGramPhraseQuery should implement these?
        Show
        Robert Muir added a comment - Well if we apply the refactoring part of SOLR-2660 (we can split out into a separate issue), we could add such a thing as an attribute to the fieldType? I like the way your patch looks now! A couple more questions: doesn't the optimization also apply to MultiPhraseQuery? If so, NGramPhraseQuery could extend MultiPhraseQuery and just rewrite to the correct one (MultiPhrase or Phrase depending upon the situation after optimization) what about hashCode/equals? Although the same results will be returned, scoring will differ, maybe it NGramPhraseQuery should implement these?
        Hide
        Koji Sekiguchi added a comment -

        I'm not sure it could apply MutiPhraseQuery. Let me take more time.

        Considering hashCode/equals is good point. I'll see.

        Show
        Koji Sekiguchi added a comment - I'm not sure it could apply MutiPhraseQuery. Let me take more time. Considering hashCode/equals is good point. I'll see.
        Hide
        Koji Sekiguchi added a comment -

        New patch. I added equals/hashCode in the patch.

        I think it is too complex to apply optimization to MultiPhraseQuery, so I'd like to stick with PhraseQuery in the patch.

        Show
        Koji Sekiguchi added a comment - New patch. I added equals/hashCode in the patch. I think it is too complex to apply optimization to MultiPhraseQuery, so I'd like to stick with PhraseQuery in the patch.
        Hide
        Robert Muir added a comment -

        I think I agree Koji, the patch looks good.

        Though we should be able to keep PhraseQuery's internal members 'private' since NGramPhraseQuery now uses getter methods to access positions, terms, slop ?

        Show
        Robert Muir added a comment - I think I agree Koji, the patch looks good. Though we should be able to keep PhraseQuery's internal members 'private' since NGramPhraseQuery now uses getter methods to access positions, terms, slop ?
        Hide
        Michael McCandless added a comment -

        Patch looks great! What a nice opto

        Show
        Michael McCandless added a comment - Patch looks great! What a nice opto
        Hide
        Koji Sekiguchi added a comment -

        Thank you for your continuous review the patch, Robert! Here is the new patch. Now I don't touch the existing PhraseQuery as I use getter methods.

        I think this is ready to commit.

        Show
        Koji Sekiguchi added a comment - Thank you for your continuous review the patch, Robert! Here is the new patch. Now I don't touch the existing PhraseQuery as I use getter methods. I think this is ready to commit.
        Hide
        Koji Sekiguchi added a comment -

        trunk: Committed revision 1170586.
        3x: Committed revision 1170593.

        Show
        Koji Sekiguchi added a comment - trunk: Committed revision 1170586. 3x: Committed revision 1170593.
        Hide
        Uwe Schindler added a comment -

        Bulk close after release of 3.5

        Show
        Uwe Schindler added a comment - Bulk close after release of 3.5
        Hide
        Bill Bell added a comment -

        Is this automatic in SOLR? Or do we need to add a feature to support his in SOLR?

        Show
        Bill Bell added a comment - Is this automatic in SOLR? Or do we need to add a feature to support his in SOLR?
        Hide
        Koji Sekiguchi added a comment -

        Is this automatic in SOLR?

        No. I've opened SOLR-3055.

        Show
        Koji Sekiguchi added a comment - Is this automatic in SOLR? No. I've opened SOLR-3055 .

          People

          • Assignee:
            Koji Sekiguchi
            Reporter:
            Koji Sekiguchi
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development