Lucene - Core
  1. Lucene - Core
  2. LUCENE-1224

NGramTokenFilter creates bad TokenStream

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.3
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      With current trunk NGramTokenFilter(min=2,max=4) , I index "abcdef" string into an index, but I can't query it with "abc". If I query with "ab", I can get a hit result.

      The reason is that the NGramTokenFilter generates badly ordered TokenStream. Query is based on the Token order in the TokenStream, that how stemming or phrase should be anlayzed is based on the order (Token.positionIncrement).

      With current filter, query string "abc" is tokenized to : ab bc abc
      meaning "query a string that has ab bc abc in this order".
      Expected filter will generate : ab abc(positionIncrement=0) bc
      meaning "query a string that has (ab|abc) bc in this order"

      I'd like to submit a patch for this issue.

      1. LUCENE-1224.patch
        9 kB
        Hiroaki Kawai
      2. NGramTokenFilter.patch
        1 kB
        Hiroaki Kawai
      3. NGramTokenFilter.patch
        1 kB
        Hiroaki Kawai

        Issue Links

          Activity

          Hide
          Hiroaki Kawai added a comment -

          Modified to set a right start/end offset value in Token properties.

          Show
          Hiroaki Kawai added a comment - Modified to set a right start/end offset value in Token properties.
          Hide
          Grant Ingersoll added a comment -

          Please add unit tests to the patch demonstrating the issue.

          Show
          Grant Ingersoll added a comment - Please add unit tests to the patch demonstrating the issue.
          Hide
          Hiroaki Kawai added a comment -

          Patch updated with unit test.

          LUCENE-1225 is easier to understand this problem. This patch also includes token filter issues that is more complicated.

          Show
          Hiroaki Kawai added a comment - Patch updated with unit test. LUCENE-1225 is easier to understand this problem. This patch also includes token filter issues that is more complicated.
          Hide
          Grant Ingersoll added a comment -

          Hi Hiroaki,

          I have been reviewing the tests for this and have a couple of comments. First, I don't see why you need to bring indexing into the equation. Second, the changes to testNGrams still don't test the issue, namely they don't examine that the output ngrams are actually in the correct position. I think you deduce this later in testIndexAndQuery, but it is never explicitly stated. I'd drop testIndexAndQuery and just fix testNGrams such that it checks the positions appropriately.

          On a more philosophical level, it is a bit curious to me that if we have the strings "abcde fghi" that we are fine with "b" being at position 1, and not at position 0, but "ab" needs to be at position 0. I wonder if there is any thoughts on what the relative positions of ngrams should be. Should they all occur at the same position? It seems to me, that it doesn't make sense that the "f" ngrams don't start until some position other than 1. This would currently prevent doing phrase queries such as "ab fg" with no slop.

          I'm assuming this applies to LUCENE-1225 as well.

          I will link 1225 to this issue, and you can attach a single patch.

          Show
          Grant Ingersoll added a comment - Hi Hiroaki, I have been reviewing the tests for this and have a couple of comments. First, I don't see why you need to bring indexing into the equation. Second, the changes to testNGrams still don't test the issue, namely they don't examine that the output ngrams are actually in the correct position. I think you deduce this later in testIndexAndQuery, but it is never explicitly stated. I'd drop testIndexAndQuery and just fix testNGrams such that it checks the positions appropriately. On a more philosophical level, it is a bit curious to me that if we have the strings "abcde fghi" that we are fine with "b" being at position 1, and not at position 0, but "ab" needs to be at position 0. I wonder if there is any thoughts on what the relative positions of ngrams should be. Should they all occur at the same position? It seems to me, that it doesn't make sense that the "f" ngrams don't start until some position other than 1. This would currently prevent doing phrase queries such as "ab fg" with no slop. I'm assuming this applies to LUCENE-1225 as well. I will link 1225 to this issue, and you can attach a single patch.
          Hide
          Hiroaki Kawai added a comment -

          Q: Why it is necessary to index
          A: Because it was necessary to show how the query is performed.
          That is the point I wanted to address.

          Q: testNGrams don't test the issue
          A: Exactlly. it don't test the issue.
          I modified the test because it failed with my patch, that
          Token.toString() prints additional incrementPosition information.
          I read the existing test program, and found that current test
          program depends on Token.toString() method.
          I thought we'd better test it without Token.toString().
          Current test program tests that the Token have NO positionIncrement.

          testIndexAndQuery is the very test that address the issue.
          Please don't drop it. Think the case, we want to search a word
          that contain "abcd" with 2-gram index.
          The test does searching "abcd" with 2,3-gram.

          We have the 2gram of abcde; 'ab', 'bc', 'cd', 'de'.
          Reffering the current lucene implementation, the position gap
          of 'ab' and 'bc' must be 1.

          Show
          Hiroaki Kawai added a comment - Q: Why it is necessary to index A: Because it was necessary to show how the query is performed. That is the point I wanted to address. Q: testNGrams don't test the issue A: Exactlly. it don't test the issue. I modified the test because it failed with my patch, that Token.toString() prints additional incrementPosition information. I read the existing test program, and found that current test program depends on Token.toString() method. I thought we'd better test it without Token.toString(). Current test program tests that the Token have NO positionIncrement. testIndexAndQuery is the very test that address the issue. Please don't drop it. Think the case, we want to search a word that contain "abcd" with 2-gram index. The test does searching "abcd" with 2,3-gram. We have the 2gram of abcde; 'ab', 'bc', 'cd', 'de'. Reffering the current lucene implementation, the position gap of 'ab' and 'bc' must be 1.
          Hide
          Grant Ingersoll added a comment -

          OK, let me change the comment. You can test this problem without indexing and querying. All of the information is available on the token. I would suggest you revert the test to it's original and then modify testNGrams() by adding asserts that check that the positionIncrement value is set properly. By going the indexing/querying route, you are not only testing the token filters, but pretty much all of Lucene and are thus subject to any problems there. In other words, it ain't a unit test. If you set the posiitionIncrement properly and test for it, it will work in Lucene for the queries, etc. If it doesn't, we have much bigger problems than ngrams. That being said, if you want to fix testNgrams, and leave the query case in, that is fine by me.

          Show
          Grant Ingersoll added a comment - OK, let me change the comment. You can test this problem without indexing and querying. All of the information is available on the token. I would suggest you revert the test to it's original and then modify testNGrams() by adding asserts that check that the positionIncrement value is set properly. By going the indexing/querying route, you are not only testing the token filters, but pretty much all of Lucene and are thus subject to any problems there. In other words, it ain't a unit test. If you set the posiitionIncrement properly and test for it, it will work in Lucene for the queries, etc. If it doesn't, we have much bigger problems than ngrams. That being said, if you want to fix testNgrams, and leave the query case in, that is fine by me.
          Hide
          Grant Ingersoll added a comment -

          FWIW, I also think we should address the more philosophical question of what the intermediate positions should be of the tokens. The more I think about it, the more I think all "grams" of a given word should be at the same position, but I would like to hear from others on this before deciding.

          Show
          Grant Ingersoll added a comment - FWIW, I also think we should address the more philosophical question of what the intermediate positions should be of the tokens. The more I think about it, the more I think all "grams" of a given word should be at the same position, but I would like to hear from others on this before deciding.
          Hide
          Hiroaki Kawai added a comment -

          Umm..., if you don't like indexing and querying in the unit test, where should I place the join test that use NGramTokenizer? It might be nice if we could place that join test in a proper place.

          I placed the testIndexAndQuery in the code because the other code like KeywordAnalyzer (in the core) test code has index&query test code in its unit tests.

          I'm fine with separating the codes into different files.

          Show
          Hiroaki Kawai added a comment - Umm..., if you don't like indexing and querying in the unit test, where should I place the join test that use NGramTokenizer? It might be nice if we could place that join test in a proper place. I placed the testIndexAndQuery in the code because the other code like KeywordAnalyzer (in the core) test code has index&query test code in its unit tests. I'm fine with separating the codes into different files.
          Hide
          Hiroaki Kawai added a comment -

          In my understanding,
          ----------
          The sequence we have "This is an example"

          If we want to tokenize with white space tokenizer, the tokens are
          "This", "is", "an", "example"
          positions are 0,1,2,3

          If we want to tokenize with 2-gram, the tokens are
          "Th" "hi" "is" "s " " i" "is" "s " " a" "an" "n " " e" "ex" "xa" "am" "mp" "pl" "le"
          positions are 0,1,2,3,4,...

          Show
          Hiroaki Kawai added a comment - In my understanding, ---------- The sequence we have "This is an example" If we want to tokenize with white space tokenizer, the tokens are "This", "is", "an", "example" positions are 0,1,2,3 If we want to tokenize with 2-gram, the tokens are "Th" "hi" "is" "s " " i" "is" "s " " a" "an" "n " " e" "ex" "xa" "am" "mp" "pl" "le" positions are 0,1,2,3,4,...
          Hide
          Grant Ingersoll added a comment -

          Umm..., if you don't like indexing and querying in the unit test, where should I place the join test that use NGramTokenizer? It might be nice if we could place that join test in a proper place.

          My point is, I don't think the test needs to do any indexing/querying at all to satisfy the change. It adds absolutely nothing to the test and only complicates the matter.

          I placed the testIndexAndQuery in the code because the other code like KeywordAnalyzer (in the core) test code has index&query test code in its unit tests.

          Just because another does it doesn't make it right.

          If we want to tokenize with white space tokenizer, the tokens are
          "This", "is", "an", "example"
          positions are 0,1,2,3

          If we want to tokenize with 2-gram, the tokens are
          "Th" "hi" "is" "s " " i" "is" "s " " a" "an" "n " " e" "ex" "xa" "am" "mp" "pl" "le"
          positions are 0,1,2,3,4,...

          Yes, I understand how it currently works. My question is more along the lines of is this the right way of doing it? I don't know that it is, but it is a bigger question than you and me. I mean, if we are willing to accept that this issue is a bug, then it presents plenty of other problems in terms of position related queries. For example, I think it makes sense to search for "th ex" as a phrase query, but that is not possible do to the positions (at least not w/o a lot of slop)

          Show
          Grant Ingersoll added a comment - Umm..., if you don't like indexing and querying in the unit test, where should I place the join test that use NGramTokenizer? It might be nice if we could place that join test in a proper place. My point is, I don't think the test needs to do any indexing/querying at all to satisfy the change. It adds absolutely nothing to the test and only complicates the matter. I placed the testIndexAndQuery in the code because the other code like KeywordAnalyzer (in the core) test code has index&query test code in its unit tests. Just because another does it doesn't make it right. If we want to tokenize with white space tokenizer, the tokens are "This", "is", "an", "example" positions are 0,1,2,3 If we want to tokenize with 2-gram, the tokens are "Th" "hi" "is" "s " " i" "is" "s " " a" "an" "n " " e" "ex" "xa" "am" "mp" "pl" "le" positions are 0,1,2,3,4,... Yes, I understand how it currently works. My question is more along the lines of is this the right way of doing it? I don't know that it is, but it is a bigger question than you and me. I mean, if we are willing to accept that this issue is a bug, then it presents plenty of other problems in terms of position related queries. For example, I think it makes sense to search for "th ex" as a phrase query, but that is not possible do to the positions (at least not w/o a lot of slop)
          Hide
          DM Smith added a comment -

          My take as a user:

          Maybe, I don't understand the application of an n-gram filter, but my expectation is that words from the input that are indexed are positioned. Isn't that required to be able to do "near" searches?

          It would not matter to me if the n-grams have sub-positions to distinguish them (e.g. 1.a, 1.b, 1.c, 1,d, 2.a, ... for the example above. note: not implying any representation in this notation)

          Show
          DM Smith added a comment - My take as a user: Maybe, I don't understand the application of an n-gram filter, but my expectation is that words from the input that are indexed are positioned. Isn't that required to be able to do "near" searches? It would not matter to me if the n-grams have sub-positions to distinguish them (e.g. 1.a, 1.b, 1.c, 1,d, 2.a, ... for the example above. note: not implying any representation in this notation)
          Hide
          Hiroaki Kawai added a comment -

          About test code: I'm not going to say that "I'm right". I just wanted to address the issue and share what we should solve. If you don't like the code, please just tell me how I should do (the better way). I initially put the code there because I thought it was reasonable and proper, but I'm fine with changing it.

          For example, I think it makes sense to search for "th ex" as a phrase query

          For example, I think it makes sense to search for "example" as a phrase query instead.

          I want to address that NGramTokenizer is very useful for non-white-space-separated languages, for example Japanese. In that case, we won't search "th ex", because it assumes sentences are separated by whte space. I want to search by a fragment of a text sequence.

          I agree that this might be a big problem. IMHO, the issues comes from concept mismatch of TokenFilter and TermPosition. The discussion should moved to mailing-list?

          Show
          Hiroaki Kawai added a comment - About test code: I'm not going to say that "I'm right". I just wanted to address the issue and share what we should solve. If you don't like the code, please just tell me how I should do (the better way). I initially put the code there because I thought it was reasonable and proper, but I'm fine with changing it. For example, I think it makes sense to search for "th ex" as a phrase query For example, I think it makes sense to search for "example" as a phrase query instead. I want to address that NGramTokenizer is very useful for non-white-space-separated languages, for example Japanese. In that case, we won't search "th ex", because it assumes sentences are separated by whte space. I want to search by a fragment of a text sequence. I agree that this might be a big problem. IMHO, the issues comes from concept mismatch of TokenFilter and TermPosition. The discussion should moved to mailing-list?
          Hide
          Grant Ingersoll added a comment -

          I think the right way is simply to change the existing test to check
          that the term positions are correct per the changes. Right now, it
          doesn't check the position increment and it should. This can be done
          by looking at the positionIncrement on the Token that is produced by
          the TokenStream and doesn't require indexing.

          Show
          Grant Ingersoll added a comment - I think the right way is simply to change the existing test to check that the term positions are correct per the changes. Right now, it doesn't check the position increment and it should. This can be done by looking at the positionIncrement on the Token that is produced by the TokenStream and doesn't require indexing.
          Hide
          Otis Gospodnetic added a comment - - edited

          Hiroaki:
          I agree with Grant about unit tests. I looked at the unit tests and thought the same thing as Grant - why is Hiroaki adding indexing/searching into the mix? Your change is about modifying the positions of n-grams, and you don't need to index or search for that. The test will be a lot simpler if you just test for positions, like Grant suggested.

          Also, once you change the unit test this way, it will be a lot easier to play with positions and figure out what the "right" way to handle positions is.

          Finally, it might turn out that people have different needs or different expectations for n-gram positions. Thus, when making changes, perhaps you can think of a mechanism that allows the caller to instruct the n-gram tokenizer which token positioning approach to take (e.g. the "incremental" one, or the one based on the position of the originating token, or...)

          Show
          Otis Gospodnetic added a comment - - edited Hiroaki: I agree with Grant about unit tests. I looked at the unit tests and thought the same thing as Grant - why is Hiroaki adding indexing/searching into the mix? Your change is about modifying the positions of n-grams, and you don't need to index or search for that. The test will be a lot simpler if you just test for positions, like Grant suggested. Also, once you change the unit test this way, it will be a lot easier to play with positions and figure out what the "right" way to handle positions is. Finally, it might turn out that people have different needs or different expectations for n-gram positions. Thus, when making changes, perhaps you can think of a mechanism that allows the caller to instruct the n-gram tokenizer which token positioning approach to take (e.g. the "incremental" one, or the one based on the position of the originating token, or...)
          Hide
          Hiroaki Kawai added a comment -

          After all, where should I place the testIndexAndQuery? Does anybody have a suggestion?

          Show
          Hiroaki Kawai added a comment - After all, where should I place the testIndexAndQuery? Does anybody have a suggestion?
          Hide
          Todd Feak added a comment -

          This bug caused me major headaches trying to figure out why substring matching with an NGramTokenFilter wasn't working for anything other then when setting min and max to the same values.

          The patch seems to fix the issue when applied locally, however it also has a bug in it. It will stop parsing a token stream if a token comes through that is less then the minGramSize, even if there are tokens yet in the stream that are greater then minGramSize.

          Show
          Todd Feak added a comment - This bug caused me major headaches trying to figure out why substring matching with an NGramTokenFilter wasn't working for anything other then when setting min and max to the same values. The patch seems to fix the issue when applied locally, however it also has a bug in it. It will stop parsing a token stream if a token comes through that is less then the minGramSize, even if there are tokens yet in the stream that are greater then minGramSize.
          Hide
          Mark Miller added a comment -

          Sounds like this should really be addressed ... along with LUCENE-1225

          Show
          Mark Miller added a comment - Sounds like this should really be addressed ... along with LUCENE-1225
          Hide
          Robert Muir added a comment -

          I too think its really important we fix this. I have to agree with Hiroaki's analysis of the situation, and the problems can be seen by looking at the code in both the filter/tokenizers and the tests themselves.

          Currently the tokenizers are limited to 1024 characters (LUCENE-1227), this is very related to this issue.
          Look at the test for 1,3 ngrams of "abcde":

          public void testNgrams() throws Exception {
                  NGramTokenizer tokenizer = new NGramTokenizer(input, 1, 3);
                  assertTokenStreamContents(tokenizer,
                    new String[]{"a","b","c","d","e", "ab","bc","cd","de", "abc","bcd","cde"}, 
          

          in my opinion the output should instead be: a, ab, ...
          Otherwise the tokenizer will either always be limited to 1024 chars or must read the entire document into RAM.
          This same problem exists for the EdgeNgram variants.

          I agree with Grant's comment about the philosophical discussion about positions of the tokens, perhaps we need an option for this (where they are all posInc=1, or the posInc=0 is generated based on whitespace). I guess I think we could accomodate both needs by having tokenizer/filter variants too, but I'm not sure.

          The general problem i have with trying to determine a fix is that it will break backwards compatibility, and I also know that EdgeNGram is being used for some purposes such as "auto-suggest". So I don't really have any idea beyond making new filters/tokenizers, as I think there is another use case where the old behavior fits?

          Show
          Robert Muir added a comment - I too think its really important we fix this. I have to agree with Hiroaki's analysis of the situation, and the problems can be seen by looking at the code in both the filter/tokenizers and the tests themselves. Currently the tokenizers are limited to 1024 characters ( LUCENE-1227 ), this is very related to this issue. Look at the test for 1,3 ngrams of "abcde": public void testNgrams() throws Exception { NGramTokenizer tokenizer = new NGramTokenizer(input, 1, 3); assertTokenStreamContents(tokenizer, new String []{ "a" , "b" , "c" , "d" , "e" , "ab" , "bc" , "cd" , "de" , "abc" , "bcd" , "cde" }, in my opinion the output should instead be: a, ab, ... Otherwise the tokenizer will either always be limited to 1024 chars or must read the entire document into RAM. This same problem exists for the EdgeNgram variants. I agree with Grant's comment about the philosophical discussion about positions of the tokens, perhaps we need an option for this (where they are all posInc=1, or the posInc=0 is generated based on whitespace). I guess I think we could accomodate both needs by having tokenizer/filter variants too, but I'm not sure. The general problem i have with trying to determine a fix is that it will break backwards compatibility, and I also know that EdgeNGram is being used for some purposes such as "auto-suggest". So I don't really have any idea beyond making new filters/tokenizers, as I think there is another use case where the old behavior fits?
          Hide
          Adrien Grand added a comment -

          All n-grams now have the same position and offsets as the original token (LUCENE-4955).

          Show
          Adrien Grand added a comment - All n-grams now have the same position and offsets as the original token ( LUCENE-4955 ).
          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.

            People

            • Assignee:
              Unassigned
              Reporter:
              Hiroaki Kawai
            • Votes:
              2 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development