Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1370

Add ShingleFilter option to output unigrams if no shingles can be generated

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.9.3, 3.0.2, 3.1, 4.0-ALPHA
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Currently if ShingleFilter.outputUnigrams==false and the underlying token stream is only one token long, then ShingleFilter.next() won't return any tokens. This patch provides a new option, outputUnigramIfNoNgrams; if this option is set and the underlying stream is only one token long, then ShingleFilter will return that token, regardless of the setting of outputUnigrams.

      My use case here is speeding up phrase queries. The technique is as follows:

      First, doing index-time analysis using ShingleFilter (using outputUnigrams==true), thereby expanding things as follows:

      "please divide this sentence into shingles" ->
      "please", "please divide"
      "divide", "divide this"
      "this", "this sentence"
      "sentence", "sentence into"
      "into", "into shingles"
      "shingles"

      Second, do query-time analysis using ShingleFilter (using outputUnigrams==false and outputUnigramIfNoNgrams==true). If the user enters a phrase query, it will get tokenized in the following manner:

      "please divide this sentence into shingles" ->
      "please divide"
      "divide this"
      "this sentence"
      "sentence into"
      "into shingles"

      By doing phrase queries with bigrams like this, I can gain a very considerable speedup. Without the outputUnigramIfNoNgrams option, then a single word query would tokenize like this:

      "please" ->
      [no tokens]

      But thanks to outputUnigramIfNoNgrams, single words will now tokenize like this:

      "please" ->
      "please"

      ****

      The patch also adds a little to the pre-outputUnigramIfNoNgrams option tests.

      ****

      I'm not sure if the patch in this state is useful to anyone else, but I thought I should throw it up here and try to find out.

        Attachments

        1. LUCENE-1370.patch
          12 kB
          Steve Rowe
        2. LUCENE-1370.patch
          9 kB
          Steve Rowe
        3. LUCENE-1370.patch
          8 kB
          Steve Rowe
        4. LUCENE-1370.patch
          8 kB
          Chris Harris
        5. LUCENE-1370.patch
          8 kB
          Chris Harris
        6. LUCENE-1370.patch
          7 kB
          Chris Harris
        7. LUCENE-1370.patch
          7 kB
          Chris Harris
        8. ShingleFilter.patch
          6 kB
          Chris Harris

          Issue Links

            Activity

              People

              • Assignee:
                steve_rowe Steve Rowe
                Reporter:
                ryguasu Chris Harris
              • Votes:
                3 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: