Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Schema and Analysis, search
    • Labels:
      None

      Description

      Solr should use NGramPhraseQuery when searching with default slop on n-gram field.

      1. schema.xml
        50 kB
        Tomoko Uchida
      2. SOLR-3055.patch
        5 kB
        Koji Sekiguchi
      3. SOLR-3055-1.patch
        7 kB
        Tomoko Uchida
      4. SOLR-3055-2.patch
        6 kB
        Tomoko Uchida
      5. solrconfig.xml
        72 kB
        Tomoko Uchida

        Activity

        Hide
        Koji Sekiguchi added a comment -

        How about introducing something like GramSizeAttribute?

        I attached just an idea and draft level patch.

        Show
        Koji Sekiguchi added a comment - How about introducing something like GramSizeAttribute? I attached just an idea and draft level patch.
        Hide
        Robert Muir added a comment -

        Hi Koji: I think as far as attribute+QP, it might not be the best way to go.

        For example, another way (and customization of phrase query) is on SOLR-2660.
        In that patch i added factory methods to QueryParser so you can override this:
        then hooks to solr's fieldtype.

        But with the attribute approach, what happens if I omit positions AND use n-grams?
        This is a totally reasonable thing to do, since positions are redundantly encoded
        in the n-gram term text, it makes sense i might not index any positions at all
        and approximate my phrase queries with boolean AND

        I think subclassing is a better approach: because otherwise how do we
        determine which would run first in the case of multiple conflicting attributes?

        In this case then the consumer (e.g. Solr) is forced to decide and its more consistent
        with the way other queries are generated: getXXXQuery() etc.

        Show
        Robert Muir added a comment - Hi Koji: I think as far as attribute+QP, it might not be the best way to go. For example, another way (and customization of phrase query) is on SOLR-2660 . In that patch i added factory methods to QueryParser so you can override this: then hooks to solr's fieldtype. But with the attribute approach, what happens if I omit positions AND use n-grams? This is a totally reasonable thing to do, since positions are redundantly encoded in the n-gram term text, it makes sense i might not index any positions at all and approximate my phrase queries with boolean AND I think subclassing is a better approach: because otherwise how do we determine which would run first in the case of multiple conflicting attributes? In this case then the consumer (e.g. Solr) is forced to decide and its more consistent with the way other queries are generated: getXXXQuery() etc.
        Hide
        Robert Muir added a comment -

        But, an advantage to the approach of this patch, is that it would work when not all text is n-grammed right?
        E.g. the case of CJKAnalyzer, where english does not form ngrams. I think this is important.

        Maybe there is some way to have the best of both...

        Show
        Robert Muir added a comment - But, an advantage to the approach of this patch, is that it would work when not all text is n-grammed right? E.g. the case of CJKAnalyzer, where english does not form ngrams. I think this is important. Maybe there is some way to have the best of both...
        Hide
        Tomoko Uchida added a comment -

        Hi,
        It seems not active for 3 years... I would like to rework this issue.

        I think this should be integrated to Lucene because,

        • NGramPhraseQuery is tightly related to NGramTokenizer, and it seems to be natural that coupling them at Lucene layer.
        • It would be also good for all Lucene based search engines to gain performance improvement.
          https://issues.apache.org/jira/browse/LUCENE-3426

        So the patch (adding new attribute to Lucene) looks good for me at first glance... more discussion is needed of course.

        Would anyone approve me? if so, I'd like to move the discussion to (new) LUCENE issue.
        Or any suggestions are appreciated.

        Thanks.

        Show
        Tomoko Uchida added a comment - Hi, It seems not active for 3 years... I would like to rework this issue. I think this should be integrated to Lucene because, NGramPhraseQuery is tightly related to NGramTokenizer, and it seems to be natural that coupling them at Lucene layer. It would be also good for all Lucene based search engines to gain performance improvement. https://issues.apache.org/jira/browse/LUCENE-3426 So the patch (adding new attribute to Lucene) looks good for me at first glance... more discussion is needed of course. Would anyone approve me? if so, I'd like to move the discussion to (new) LUCENE issue. Or any suggestions are appreciated. Thanks.
        Hide
        Koji Sekiguchi added a comment -

        Thank you for paying attention to this ticket! It's good to me you start this in Lucene.

        Show
        Koji Sekiguchi added a comment - Thank you for paying attention to this ticket! It's good to me you start this in Lucene.
        Hide
        Tomoko Uchida added a comment -

        Again, I think there are three strategies for implementation.

        1. embed gram size information in TokenStraem by adding new attribute (taken by first patch)

        • Pros: fully integrated with Lucene, so any application have not to write additional codes to optimize n-gram based phrase query
        • Pros: no configuration is needed because query parser create NGramPhraseQuery automatically
        • Pros: maybe most simple to implement
        • Cons: there might be some kind of conflicts with other attributes?

        2. NGramTokenizer expose "gramSize" for later use, and Solr's QueryParser create NGramPhraseQuery

        • Pros: no effect to Lucene's default behavior
        • Pros: no configuration is needed because query parser create NGramPhraseQuery automatically
        • Cons: extra codes are needed to use NGramPhraseQuery per each query parser

        3. add "gramSize" (or something like) attribute to schema.xml, and Solr's query parser create NGramPhraseQuery using given gramSize by user

        • Pros: no effect to Lucene's and Solr's default behavior
        • Cons: new configuration attribute will be introduced
        • Cons: what's happen if user give gramSize value inconsistent with minGramSize or maxGramSize given to NGramTokenizer? maybe it's problematic.

        I attach two patches, one (SOLR-3055-1.patch) for strategy 1 and other (SOLR-3055-2.patch) for strategy 2.
        Reviews / suggestions will be appreciated.

        Show
        Tomoko Uchida added a comment - Again, I think there are three strategies for implementation. 1. embed gram size information in TokenStraem by adding new attribute (taken by first patch) Pros: fully integrated with Lucene, so any application have not to write additional codes to optimize n-gram based phrase query Pros: no configuration is needed because query parser create NGramPhraseQuery automatically Pros: maybe most simple to implement Cons: there might be some kind of conflicts with other attributes? 2. NGramTokenizer expose "gramSize" for later use, and Solr's QueryParser create NGramPhraseQuery Pros: no effect to Lucene's default behavior Pros: no configuration is needed because query parser create NGramPhraseQuery automatically Cons: extra codes are needed to use NGramPhraseQuery per each query parser 3. add "gramSize" (or something like) attribute to schema.xml, and Solr's query parser create NGramPhraseQuery using given gramSize by user Pros: no effect to Lucene's and Solr's default behavior Cons: new configuration attribute will be introduced Cons: what's happen if user give gramSize value inconsistent with minGramSize or maxGramSize given to NGramTokenizer? maybe it's problematic. I attach two patches, one ( SOLR-3055 -1.patch) for strategy 1 and other ( SOLR-3055 -2.patch) for strategy 2. Reviews / suggestions will be appreciated.
        Hide
        Tomoko Uchida added a comment -

        I performed brief benchmark by JMeter for Solr 5.0.0 trunk and strategy 1.
        There seems to be significant performance gain for n-gram based phrase query.

        • Hardware : MacBook Pro, 2.8GHz Intel Core i5
        • Java version : 1.7.0_71
        • Solr version : 5.0.0 SNAPSHOT / 5.0.0 SNAPSHOT with SOLR-3055-1.patch
        • Java heap : 500MB
        • Documents : Wikipedia (Japanese) 100000 docs
        • Solr config : attached solrconfig.xml (query result cache disabled)
        • Schema : attached schema.xml (NGramTokenizer's maxGramSize=3, minGramSIze=2)
        • Queries : "python", "javascript", "windows", "プログラミング", "インターネット", "スマートフォン" (japanese)
        • JMeter scenario : execute each 6 queries above 1000 times (i.e. perform 6000 queries)
        • JMeter Threads : 1

        To warm up, I performed 2 times JMeter scinario for both settings.
        2nd round results are:

        Solr Avg. response time Throughput
        5.0.0-SNAPSHOT 7msec 137.8/sec
        5.0.0-SNAPSHOT with patch-1 4msec 201.3/sec
        Show
        Tomoko Uchida added a comment - I performed brief benchmark by JMeter for Solr 5.0.0 trunk and strategy 1. There seems to be significant performance gain for n-gram based phrase query. Hardware : MacBook Pro, 2.8GHz Intel Core i5 Java version : 1.7.0_71 Solr version : 5.0.0 SNAPSHOT / 5.0.0 SNAPSHOT with SOLR-3055 -1.patch Java heap : 500MB Documents : Wikipedia (Japanese) 100000 docs Solr config : attached solrconfig.xml (query result cache disabled) Schema : attached schema.xml (NGramTokenizer's maxGramSize=3, minGramSIze=2) Queries : "python", "javascript", "windows", "プログラミング", "インターネット", "スマートフォン" (japanese) JMeter scenario : execute each 6 queries above 1000 times (i.e. perform 6000 queries) JMeter Threads : 1 To warm up, I performed 2 times JMeter scinario for both settings. 2nd round results are: Solr Avg. response time Throughput 5.0.0-SNAPSHOT 7msec 137.8/sec 5.0.0-SNAPSHOT with patch-1 4msec 201.3/sec
        Hide
        Koji Sekiguchi added a comment -

        Hi Uchida-san, thank you for your effort for reworking this issue!

        According to your observation (pros and cons), I like the 1st strategy to go on. And if you agree, why don't you add test cases for that one? And also, don't we need to consider other n-gram type Tokenizers even TokenFilters, such as NGramTokenFilter and CJKBigramFilter?

        And, I think there is a restriction when minGramSize != maxGramSize. If it's not significant, I think we can examine the restriction separately from this issue because we rarely set different values to those for searching CJK words. But we use a lot NGramTokenizer with fixed gram size for searching CJK words, and we could get a nice performance gain by the patch as you've showed us.

        Show
        Koji Sekiguchi added a comment - Hi Uchida-san, thank you for your effort for reworking this issue! According to your observation (pros and cons), I like the 1st strategy to go on. And if you agree, why don't you add test cases for that one? And also, don't we need to consider other n-gram type Tokenizers even TokenFilters, such as NGramTokenFilter and CJKBigramFilter? And, I think there is a restriction when minGramSize != maxGramSize. If it's not significant, I think we can examine the restriction separately from this issue because we rarely set different values to those for searching CJK words. But we use a lot NGramTokenizer with fixed gram size for searching CJK words, and we could get a nice performance gain by the patch as you've showed us.
        Hide
        Tomoko Uchida added a comment -

        Thank you for your response.

        I will add test codes and updated patch to consider other Tokenizers / TokenFilters.

        My patch seems to work well for both case, minGramSize == maxGramSize and minGramSize != maxGramSize. But not optimized for maxGramSize.
        In the case of minGramSize != maxGramSize, using maxGramSize for optimization derives the best performance improvement. We can examine about that (maybe need another issue.) In practice, we often set fixed gram size for CJK words as you pointed, so I think it is beneficial even if it is not optimized for maxGramSize.

        Show
        Tomoko Uchida added a comment - Thank you for your response. I will add test codes and updated patch to consider other Tokenizers / TokenFilters. My patch seems to work well for both case, minGramSize == maxGramSize and minGramSize != maxGramSize. But not optimized for maxGramSize. In the case of minGramSize != maxGramSize, using maxGramSize for optimization derives the best performance improvement. We can examine about that (maybe need another issue.) In practice, we often set fixed gram size for CJK words as you pointed, so I think it is beneficial even if it is not optimized for maxGramSize.
        Hide
        Tomoko Uchida added a comment -

        I've created LUCENE-6163 and added a patch (because this patch affects lucene-core and lucene-analysis, does not solr.)
        I hope to keep working there.

        Show
        Tomoko Uchida added a comment - I've created LUCENE-6163 and added a patch (because this patch affects lucene-core and lucene-analysis, does not solr.) I hope to keep working there.

          People

          • Assignee:
            Unassigned
            Reporter:
            Koji Sekiguchi
          • Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development