Lucene - Core
  1. Lucene - Core
  2. LUCENE-1370

Add ShingleFilter option to output unigrams if no shingles can be generated

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.9.3, 3.0.2, 3.1, 4.0-ALPHA
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Currently if ShingleFilter.outputUnigrams==false and the underlying token stream is only one token long, then ShingleFilter.next() won't return any tokens. This patch provides a new option, outputUnigramIfNoNgrams; if this option is set and the underlying stream is only one token long, then ShingleFilter will return that token, regardless of the setting of outputUnigrams.

      My use case here is speeding up phrase queries. The technique is as follows:

      First, doing index-time analysis using ShingleFilter (using outputUnigrams==true), thereby expanding things as follows:

      "please divide this sentence into shingles" ->
      "please", "please divide"
      "divide", "divide this"
      "this", "this sentence"
      "sentence", "sentence into"
      "into", "into shingles"
      "shingles"

      Second, do query-time analysis using ShingleFilter (using outputUnigrams==false and outputUnigramIfNoNgrams==true). If the user enters a phrase query, it will get tokenized in the following manner:

      "please divide this sentence into shingles" ->
      "please divide"
      "divide this"
      "this sentence"
      "sentence into"
      "into shingles"

      By doing phrase queries with bigrams like this, I can gain a very considerable speedup. Without the outputUnigramIfNoNgrams option, then a single word query would tokenize like this:

      "please" ->
      [no tokens]

      But thanks to outputUnigramIfNoNgrams, single words will now tokenize like this:

      "please" ->
      "please"

      ****

      The patch also adds a little to the pre-outputUnigramIfNoNgrams option tests.

      ****

      I'm not sure if the patch in this state is useful to anyone else, but I thought I should throw it up here and try to find out.

      1. LUCENE-1370.patch
        12 kB
        Steve Rowe
      2. LUCENE-1370.patch
        9 kB
        Steve Rowe
      3. LUCENE-1370.patch
        8 kB
        Steve Rowe
      4. LUCENE-1370.patch
        8 kB
        Chris Harris
      5. LUCENE-1370.patch
        8 kB
        Chris Harris
      6. LUCENE-1370.patch
        7 kB
        Chris Harris
      7. LUCENE-1370.patch
        7 kB
        Chris Harris
      8. ShingleFilter.patch
        6 kB
        Chris Harris

        Issue Links

          Activity

          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1
          Hide
          Steve Rowe added a comment -

          Committed: trunk revision 1006187, branch_3x revision 1006195

          Show
          Steve Rowe added a comment - Committed: trunk revision 1006187, branch_3x revision 1006195
          Hide
          Steve Rowe added a comment -

          Looks good to me, I tested it and had no problems.

          Thanks for testing, Robert.

          Maybe we should add the option to ShingleAnalyzerWrapper too?

          Yes, I missed that one - I've added support for it in this version of the patch, along with one test in ShingleAnalyzerWrapperTest (for the single-token case).

          Show
          Steve Rowe added a comment - Looks good to me, I tested it and had no problems. Thanks for testing, Robert. Maybe we should add the option to ShingleAnalyzerWrapper too? Yes, I missed that one - I've added support for it in this version of the patch, along with one test in ShingleAnalyzerWrapperTest (for the single-token case).
          Hide
          Robert Muir added a comment -

          Looks good to me, I tested it and had no problems.

          Maybe we should add the option to ShingleAnalyzerWrapper too?
          This would be very convenient for Lucene users.

          Show
          Robert Muir added a comment - Looks good to me, I tested it and had no problems. Maybe we should add the option to ShingleAnalyzerWrapper too? This would be very convenient for Lucene users.
          Hide
          Steve Rowe added a comment - - edited

          Added lucene/CHANGES.txt entry, and slightly modified reset() code switching back the unigram output option.

          All tests pass.

          If there are no objections, I'll commit this in a couple of days.

          edit: I added an entry to *modules/analysis/*CHANGES.txt, not *lucene/*CHANGES.txt

          Show
          Steve Rowe added a comment - - edited Added lucene/CHANGES.txt entry, and slightly modified reset() code switching back the unigram output option. All tests pass. If there are no objections, I'll commit this in a couple of days. edit : I added an entry to *modules/analysis/*CHANGES.txt, not *lucene/*CHANGES.txt
          Hide
          Steve Rowe added a comment -

          The previous patch predates the rewrite ShingleFilter was subjected to as a result of LUCENE-2218, so it needed to be rejiggered somewhat.

          Changes from the previous patch:

          1. The new patch simply enables unigram output if the number of input tokens is less than minShingleSize. The existing code then handles the situation appropriately, and reset() restores the original unigram output option.
          2. I renamed the option "outputUnigramIfNoNgrams" to "outputUnigramsIfNoShingles", because:
            • Unigram -> Unigrams: the output could result in more than one unigram if minShingleSize is greater than the default 2; and
            • Ngrams -> Shingles: for consistency with the class's name.
          3. I renamed "returnedAnyTokensYet" to "noShingleOutput", and reversed its (boolean) sense, because:
            • unigrams should be output only if no shingles can be output, rather than no tokens; and
            • reversing the sense allowed the test using it to avoid negation, and allowed the name to be shorter.
          4. I added a note to the setOutputUnigramsIfNoShingles() method javadoc to the effect that if outputUnigram == true, unigrams will always be output regardless of the setting of outputUnigramsIfNoShingles.
          5. I added a test that makes sure that when minShingleSize > 2 and the number of input tokens is less than minShingleSize, (multiple) unigrams are output
          Show
          Steve Rowe added a comment - The previous patch predates the rewrite ShingleFilter was subjected to as a result of LUCENE-2218 , so it needed to be rejiggered somewhat. Changes from the previous patch: The new patch simply enables unigram output if the number of input tokens is less than minShingleSize. The existing code then handles the situation appropriately, and reset() restores the original unigram output option. I renamed the option "outputUnigramIfNoNgrams" to "outputUnigramsIfNoShingles", because: Unigram -> Unigrams: the output could result in more than one unigram if minShingleSize is greater than the default 2; and Ngrams -> Shingles: for consistency with the class's name. I renamed "returnedAnyTokensYet" to "noShingleOutput", and reversed its (boolean) sense, because: unigrams should be output only if no shingles can be output, rather than no tokens ; and reversing the sense allowed the test using it to avoid negation, and allowed the name to be shorter. I added a note to the setOutputUnigramsIfNoShingles() method javadoc to the effect that if outputUnigram == true, unigrams will always be output regardless of the setting of outputUnigramsIfNoShingles. I added a test that makes sure that when minShingleSize > 2 and the number of input tokens is less than minShingleSize, (multiple) unigrams are output
          Hide
          Robert Muir added a comment -

          I think this patch is really important to get into 3.x and trunk...

          its a dead simple way (analysis configuration only) for folks to get a massive performance improvement for phrasequeries.

          Show
          Robert Muir added a comment - I think this patch is really important to get into 3.x and trunk... its a dead simple way (analysis configuration only) for folks to get a massive performance improvement for phrasequeries.
          Hide
          Uwe Schindler added a comment -

          I move this to 3.1 as it is a new feature.

          Show
          Uwe Schindler added a comment - I move this to 3.1 as it is a new feature.
          Hide
          Uwe Schindler added a comment -

          This patch is very straightforward, I think we could add it to 3.0, too. But 3.0 should not conatin new features, so move to 3.1?

          Show
          Uwe Schindler added a comment - This patch is very straightforward, I think we could add it to 3.0, too. But 3.0 should not conatin new features, so move to 3.1?
          Hide
          Simon Willnauer added a comment -

          Karl, are you working on this? Seems like this should not block 3.0 - feel free to move it to 3.1

          simon

          Show
          Simon Willnauer added a comment - Karl, are you working on this? Seems like this should not block 3.0 - feel free to move it to 3.1 simon
          Hide
          Karl Wettin added a comment -

          Oups, I seem to have assigned this to me and then forgotten about it. Sorry!
          I'll check it out this weekend!

          Show
          Karl Wettin added a comment - Oups, I seem to have assigned this to me and then forgotten about it. Sorry! I'll check it out this weekend!
          Hide
          Chris Harris added a comment -

          At the risk of being annoying, is there any chance this patch (perhaps slightly refined, if you guys want) could make it into Lucene 3.0? I think I'm not the only person who wants to use the ShingleFilter in this slightly way. For example, I just noticed that the new Solr 1.4 Enterprise Search Server book (on p. 288) makes brief mention of SOLR-744, which depends on this patch. I'm fine keeping this out of the Lucene distribution if people think ShingleFilter.outputUnigrams is a silly hack and there's a better way to get the job done. But if this captures a legitimate use case, are there compelling reasons to keep it out?

          Thanks.

          Show
          Chris Harris added a comment - At the risk of being annoying, is there any chance this patch (perhaps slightly refined, if you guys want) could make it into Lucene 3.0? I think I'm not the only person who wants to use the ShingleFilter in this slightly way. For example, I just noticed that the new Solr 1.4 Enterprise Search Server book (on p. 288) makes brief mention of SOLR-744 , which depends on this patch. I'm fine keeping this out of the Lucene distribution if people think ShingleFilter.outputUnigrams is a silly hack and there's a better way to get the job done. But if this captures a legitimate use case, are there compelling reasons to keep it out? Thanks.
          Hide
          Chris Harris added a comment -

          Update patch to avoid an unnecessary State.clone(), as suggested by Robert.

          Show
          Chris Harris added a comment - Update patch to avoid an unnecessary State.clone(), as suggested by Robert.
          Hide
          Uwe Schindler added a comment -

          AttributeSource.State objects are unmodifiable (as the contents are private and so you have no access to the wrapped attributes).

          Show
          Uwe Schindler added a comment - AttributeSource.State objects are unmodifiable (as the contents are private and so you have no access to the wrapped attributes).
          Hide
          Robert Muir added a comment -

          Chris, yeah I was thinking something like your first method...
          i'm not concerned with any micro-optimization just reducing a clone if we can )

          Show
          Robert Muir added a comment - Chris, yeah I was thinking something like your first method... i'm not concerned with any micro-optimization just reducing a clone if we can )
          Hide
          Chris Harris added a comment -

          here i think you could save a clone() by not calling captureState twice?
          even though it doesnt have to recompute the state, captureState does have to clone it.

          So is the idea to replace

                if (getNextToken()) {
                  if (outputUnigramIfNoNgrams && firstToken == null) {
                    firstToken = captureState();
                  }
                  shingleBuf.add(captureState());
          

          with

                if (getNextToken()) {
                  State curState = captureState();
                  if (outputUnigramIfNoNgrams && firstToken == null) {
                    firstToken = curState;
                  }
                  shingleBuf.add(curState);
          

          That seems fine, unless there's some hidden reason why you can't share State objects.

          I'd guess you could optimize more than that, but I think you run into diminishing returns, making the code harder to read more than you're making it faster. For example:

               if (getNextToken()) {
                  if (outputUnigramIfNoNgrams && firstToken == null) {
                    firstToken = captureState();
                    shingleBuf.add(firstToken);
                  }
                  else {
                    shingleBuf.add(captureState());
                  }
          
          Show
          Chris Harris added a comment - here i think you could save a clone() by not calling captureState twice? even though it doesnt have to recompute the state, captureState does have to clone it. So is the idea to replace if (getNextToken()) { if (outputUnigramIfNoNgrams && firstToken == null ) { firstToken = captureState(); } shingleBuf.add(captureState()); with if (getNextToken()) { State curState = captureState(); if (outputUnigramIfNoNgrams && firstToken == null ) { firstToken = curState; } shingleBuf.add(curState); That seems fine, unless there's some hidden reason why you can't share State objects. I'd guess you could optimize more than that, but I think you run into diminishing returns, making the code harder to read more than you're making it faster. For example: if (getNextToken()) { if (outputUnigramIfNoNgrams && firstToken == null ) { firstToken = captureState(); shingleBuf.add(firstToken); } else { shingleBuf.add(captureState()); }
          Hide
          Robert Muir added a comment -

          Uwe, I see... i had not looked far enough yet to see how this was done, thanks.

          but i think my comment still holds true, we can save a clone() here on the first token.
          for long fields does not matter so much, but for short fields a small savings.

          Show
          Robert Muir added a comment - Uwe, I see... i had not looked far enough yet to see how this was done, thanks. but i think my comment still holds true, we can save a clone() here on the first token. for long fields does not matter so much, but for short fields a small savings.
          Hide
          Uwe Schindler added a comment -

          even though it doesnt have to recompute the state, captureState does have to clone it.

          The recomputation is only done once for the lifetime of the TokenStream (as far as there are no modifications in the number of attributes). The State linked list is just another representation of the instances LinkedHashMap for faster iteration during cloning (no new Iterators to create and so on).

          Show
          Uwe Schindler added a comment - even though it doesnt have to recompute the state, captureState does have to clone it. The recomputation is only done once for the lifetime of the TokenStream (as far as there are no modifications in the number of attributes). The State linked list is just another representation of the instances LinkedHashMap for faster iteration during cloning (no new Iterators to create and so on).
          Hide
          Robert Muir added a comment -

          Chris, just a small comment:

            if (outputUnigramIfNoNgrams && firstToken == null) {
                    firstToken = captureState();
            }
            shingleBuf.add(captureState());
          

          here i think you could save a clone() by not calling captureState twice?
          even though it doesnt have to recompute the state, captureState does have to clone it.

          (I am thinking about trying out your patch where i have some very short fields, this is why i mentioned it)

          Show
          Robert Muir added a comment - Chris, just a small comment: if (outputUnigramIfNoNgrams && firstToken == null ) { firstToken = captureState(); } shingleBuf.add(captureState()); here i think you could save a clone() by not calling captureState twice? even though it doesnt have to recompute the state, captureState does have to clone it. (I am thinking about trying out your patch where i have some very short fields, this is why i mentioned it)
          Hide
          Chris Harris added a comment -

          Any ideas on whether it will ever make sense for this patch to make it into the trunk? Some random thoughts:

          • The latest patch does make ShingleFilter marginally less clean. (Maybe my least favorite part is how fillShingleBuffer() is now responsible for setting firstToken; that it should do so is not obvious from the method name.)
          • This patch's functionality could potentially be implemented without modifying ShingleFilter itself. For example, maybe instead of patching ShingleFilter we could have a ShingleFilterUnigramWrapper class, that would delegate to ShingleFilter, except when ShingleFilter failed to produce ngrams. I'm a little worried that this would require using a CachingTokenFilter, and that might not be ideal from an efficiency perspective.
          • It might be good to rename outputUnigramIfNoNgrams to something like forceUnigramIfNoNgrams. With the current naming scheme, you end up setting some contradictory-sounding options, e.g. setting outputUnigrams==false and outputUnigramIfNoNgrams==true. If you look at the code this might not be confusing, but it'd be nice if it were more straightforward without making you look at the code.
          • I gather that some people have an interest in making a minShingleSize option. (See http://www.nabble.com/shingle-filter-td25125367.html) I'm not sure how best to modify this patch should that get implemented. It might depend on the typical use cases for minShingleSize, and if there's any overlap with my use case here.
          Show
          Chris Harris added a comment - Any ideas on whether it will ever make sense for this patch to make it into the trunk? Some random thoughts: The latest patch does make ShingleFilter marginally less clean. (Maybe my least favorite part is how fillShingleBuffer() is now responsible for setting firstToken; that it should do so is not obvious from the method name.) This patch's functionality could potentially be implemented without modifying ShingleFilter itself. For example, maybe instead of patching ShingleFilter we could have a ShingleFilterUnigramWrapper class, that would delegate to ShingleFilter, except when ShingleFilter failed to produce ngrams. I'm a little worried that this would require using a CachingTokenFilter, and that might not be ideal from an efficiency perspective. It might be good to rename outputUnigramIfNoNgrams to something like forceUnigramIfNoNgrams. With the current naming scheme, you end up setting some contradictory-sounding options, e.g. setting outputUnigrams==false and outputUnigramIfNoNgrams==true. If you look at the code this might not be confusing, but it'd be nice if it were more straightforward without making you look at the code. I gather that some people have an interest in making a minShingleSize option. (See http://www.nabble.com/shingle-filter-td25125367.html ) I'm not sure how best to modify this patch should that get implemented. It might depend on the typical use cases for minShingleSize, and if there's any overlap with my use case here.
          Hide
          Chris Harris added a comment -

          Updating the patch to apply successfully against r813015, and hopefully thereby against the upcoming Lucene 2.9 release.

          Show
          Chris Harris added a comment - Updating the patch to apply successfully against r813015, and hopefully thereby against the upcoming Lucene 2.9 release.
          Hide
          Chris Harris added a comment -

          Getting rid of Windows-style newlines

          Show
          Chris Harris added a comment - Getting rid of Windows-style newlines
          Hide
          Chris Harris added a comment - - edited

          Fixing to merge cleanly against changes made in r687357. The patch file will also now have a proper name, LUCENE-1370.patch.

          Show
          Chris Harris added a comment - - edited Fixing to merge cleanly against changes made in r687357. The patch file will also now have a proper name, LUCENE-1370 .patch.
          Hide
          Chris Harris added a comment -

          : Do you say it is 50x faster with shingle queries that only contains bigram
          : compared to shingle queries that contains uni- and bigrams? Or is it 50x
          : faster using shingles compared to phrase queries? (I've myself seen
          : performance gains similar to the latter.)

          I'm not sure I totally understand you, so let me try rephrasing things. What I mean is that phrase queries that have only bigrams, like this:

          PhraseQuery

          { "please divide" "divide this" "this sentence" "sentence into" "into shingles" }

          run maybe 50x as fast as phrase queries that have both bigrams and unigrams, like this:

          PhraseQuery

          { "please", "please divide" "divide", "divide this" "this", "this sentence" "sentence", "sentence into" "into", "into shingles" "shingles" }

          If it clarifies things any further, let me say that I'm handling all quoted phrase queries with the normal PhraseQuery class; I'm not, for instance, turning quoted phrases into some kind of BooleanQuery. (Technically it's not me that's making the PhraseQuery object but Solr and its query parser. But Solr is indeed turning my quoted phrase queries into normal PhraseQuery objects.)

          Show
          Chris Harris added a comment - : Do you say it is 50x faster with shingle queries that only contains bigram : compared to shingle queries that contains uni- and bigrams? Or is it 50x : faster using shingles compared to phrase queries? (I've myself seen : performance gains similar to the latter.) I'm not sure I totally understand you, so let me try rephrasing things. What I mean is that phrase queries that have only bigrams, like this: PhraseQuery { "please divide" "divide this" "this sentence" "sentence into" "into shingles" } run maybe 50x as fast as phrase queries that have both bigrams and unigrams, like this: PhraseQuery { "please", "please divide" "divide", "divide this" "this", "this sentence" "sentence", "sentence into" "into", "into shingles" "shingles" } If it clarifies things any further, let me say that I'm handling all quoted phrase queries with the normal PhraseQuery class; I'm not, for instance, turning quoted phrases into some kind of BooleanQuery. (Technically it's not me that's making the PhraseQuery object but Solr and its query parser. But Solr is indeed turning my quoted phrase queries into normal PhraseQuery objects.)
          Hide
          Karl Wettin added a comment -

          Do you say it is 50x faster with shingle queries that only contains bigram compared to shingle queries that contains uni- and bigrams? Or is it 50x faster using shingles compared to phrase queries? (I've myself seen performance gains similar to the latter.)

          Show
          Karl Wettin added a comment - Do you say it is 50x faster with shingle queries that only contains bigram compared to shingle queries that contains uni- and bigrams? Or is it 50x faster using shingles compared to phrase queries? (I've myself seen performance gains similar to the latter.)
          Hide
          Chris Harris added a comment -

          Karl, that's a good point about my setup being incompatible with non-0 slop. However, the performance gains I'm seeing with this patch on my data are substantial. When I last tested on the same index, same machine, same # of threads in my testing process, etc., and went from analyzing my queries with

          outputUnigrams==true

          to analyzing with

          using outputUnigrams==false and outputUnigramIfNoNgrams==true

          phrase queries ended up performing something like 50x as fast. Which is good, because the initial performance wasn't acceptable.

          The performance gains from outputUnigramIfNoNgrams were greater than those from when I instead tried moving the index to a solid state drive. (It was a a fairly entry-level SSD drive, but still.) It would be interesting to compare to moving to a machine with an obscene amount of RAM. (Not quite sure what would count as "obscene", but my index is 90+GB. Maybe half of that is taken up by stored fields.)

          Show
          Chris Harris added a comment - Karl, that's a good point about my setup being incompatible with non-0 slop. However, the performance gains I'm seeing with this patch on my data are substantial. When I last tested on the same index, same machine, same # of threads in my testing process, etc., and went from analyzing my queries with outputUnigrams==true to analyzing with using outputUnigrams==false and outputUnigramIfNoNgrams==true phrase queries ended up performing something like 50x as fast. Which is good, because the initial performance wasn't acceptable. The performance gains from outputUnigramIfNoNgrams were greater than those from when I instead tried moving the index to a solid state drive. (It was a a fairly entry-level SSD drive, but still.) It would be interesting to compare to moving to a machine with an obscene amount of RAM. (Not quite sure what would count as "obscene", but my index is 90+GB. Maybe half of that is taken up by stored fields.)
          Hide
          Karl Wettin added a comment -

          It's an OK filter setting if you ask me.

          However I'm curious to why you don't query for unigrams unless the input is a single token? That means you always require a 0-slop between any two tokens of the input. I know nothing about your needs, but it could be dangerous. You can always boost the bigrams a bit more than the unigrams if they cause a problem. I think you should benchmark the cost. I'm sure it's rather small and that you'll get better quality results by doing that. Users tend to never enter a query the way I want them to.

          Show
          Karl Wettin added a comment - It's an OK filter setting if you ask me. However I'm curious to why you don't query for unigrams unless the input is a single token? That means you always require a 0-slop between any two tokens of the input. I know nothing about your needs, but it could be dangerous. You can always boost the bigrams a bit more than the unigrams if they cause a problem. I think you should benchmark the cost. I'm sure it's rather small and that you'll get better quality results by doing that. Users tend to never enter a query the way I want them to.

            People

            • Assignee:
              Steve Rowe
              Reporter:
              Chris Harris
            • Votes:
              3 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development