Lucene - Core
  1. Lucene - Core
  2. LUCENE-1370

Add ShingleFilter option to output unigrams if no shingles can be generated

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.9.3, 3.0.2, 3.1, 4.0-ALPHA
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Currently if ShingleFilter.outputUnigrams==false and the underlying token stream is only one token long, then ShingleFilter.next() won't return any tokens. This patch provides a new option, outputUnigramIfNoNgrams; if this option is set and the underlying stream is only one token long, then ShingleFilter will return that token, regardless of the setting of outputUnigrams.

      My use case here is speeding up phrase queries. The technique is as follows:

      First, doing index-time analysis using ShingleFilter (using outputUnigrams==true), thereby expanding things as follows:

      "please divide this sentence into shingles" ->
      "please", "please divide"
      "divide", "divide this"
      "this", "this sentence"
      "sentence", "sentence into"
      "into", "into shingles"
      "shingles"

      Second, do query-time analysis using ShingleFilter (using outputUnigrams==false and outputUnigramIfNoNgrams==true). If the user enters a phrase query, it will get tokenized in the following manner:

      "please divide this sentence into shingles" ->
      "please divide"
      "divide this"
      "this sentence"
      "sentence into"
      "into shingles"

      By doing phrase queries with bigrams like this, I can gain a very considerable speedup. Without the outputUnigramIfNoNgrams option, then a single word query would tokenize like this:

      "please" ->
      [no tokens]

      But thanks to outputUnigramIfNoNgrams, single words will now tokenize like this:

      "please" ->
      "please"

      ****

      The patch also adds a little to the pre-outputUnigramIfNoNgrams option tests.

      ****

      I'm not sure if the patch in this state is useful to anyone else, but I thought I should throw it up here and try to find out.

      1. LUCENE-1370.patch
        12 kB
        Steve Rowe
      2. LUCENE-1370.patch
        9 kB
        Steve Rowe
      3. LUCENE-1370.patch
        8 kB
        Steve Rowe
      4. LUCENE-1370.patch
        8 kB
        Chris Harris
      5. LUCENE-1370.patch
        8 kB
        Chris Harris
      6. LUCENE-1370.patch
        7 kB
        Chris Harris
      7. LUCENE-1370.patch
        7 kB
        Chris Harris
      8. ShingleFilter.patch
        6 kB
        Chris Harris

        Issue Links

          Activity

          Chris Harris created issue -
          Chris Harris made changes -
          Field Original Value New Value
          Attachment ShingleFilter.patch [ 12389206 ]
          Chris Harris made changes -
          Attachment LUCENE-1370.patch [ 12389313 ]
          Chris Harris made changes -
          Attachment LUCENE-1370.patch [ 12389314 ]
          Karl Wettin made changes -
          Assignee Karl Wettin [ karl.wettin ]
          Chris Harris made changes -
          Link This issue is related to SOLR-744 [ SOLR-744 ]
          Chris Harris made changes -
          Attachment LUCENE-1370.patch [ 12419077 ]
          Chris Harris made changes -
          Attachment LUCENE-1370.patch [ 12419233 ]
          Michael McCandless made changes -
          Fix Version/s 3.0 [ 12312889 ]
          Uwe Schindler made changes -
          Fix Version/s 3.1 [ 12314025 ]
          Fix Version/s 3.0 [ 12312889 ]
          Steve Rowe made changes -
          Assignee Karl Wettin [ karl.wettin ] Steven Rowe [ steve_rowe ]
          Steve Rowe made changes -
          Fix Version/s 3.1 [ 12314822 ]
          Affects Version/s 3.0.2 [ 12314798 ]
          Affects Version/s 2.9.3 [ 12314799 ]
          Affects Version/s 3.1 [ 12314822 ]
          Affects Version/s 4.0 [ 12314025 ]
          Steve Rowe made changes -
          Attachment LUCENE-1370.patch [ 12456455 ]
          Steve Rowe made changes -
          Summary Patch to make ShingleFilter output a unigram if no ngrams can be generated Add ShingleFilter option to output unigrams if no shingles can be generated
          Steve Rowe made changes -
          Attachment LUCENE-1370.patch [ 12456456 ]
          Steve Rowe made changes -
          Attachment LUCENE-1370.patch [ 12456477 ]
          Steve Rowe made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Mark Thomas made changes -
          Workflow jira [ 12441109 ] Default workflow, editable Closed status [ 12562626 ]
          Mark Thomas made changes -
          Workflow Default workflow, editable Closed status [ 12562626 ] jira [ 12584797 ]
          Grant Ingersoll made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Shai Erera made changes -
          Component/s modules/analysis [ 12310230 ]
          Component/s contrib/analyzers [ 12312333 ]

            People

            • Assignee:
              Steve Rowe
              Reporter:
              Chris Harris
            • Votes:
              3 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development