First of all, my comment No.3 was not wrong, sorry. We don't have to insert $^ token in the ngram stream.
I don't want separate fields for the prefix, inner and suffix grams, I want to use the same single filter at query time.
I agree with that.
Then, let's consider about the phrase query.
1. At store time, we want to store a sentence "This is a pen"
2. At query time, we want to query with "This is"
At store time, with WhitespaceTokenizer+CombinedNGramTokenFilter(2,2), we get:
^T Th hi is s$ ^i is s$ ^a a$ ^p pe en n$
At query time, with WhitespaceTokenizer+CombinedNGramTokenFilter(2,2), we get:
^T Th hi is s$ ^i is s$
We can find that the stored sequence because it contains the query sequence.
If you are creating ngrams over multiple words, say a sentence, then I state that there should only be a prefix in the start of the senstance and a suffix in the end of the sentance and that grams will contain whitespace.
If so, at query time, with WhitespaceTokenizer+CombinedNGramTokenFilter(2,2), we get:
"^T","Th","hi","is","s "," i","is","s$"
We can't find the stored sequence because it does not contain the query sequence. n-gram query is always phrase query in the micro scope.
+1 for prefix and suffix markers in the token.
Note, also, that one could use the "flags" to indicate what the token is. I know that's a little up in the air just yet, but it does exist.
Yes, there is a flags. Of cource, we can use it. But I can't find the way to use them efficiently in THIS CASE, right now.
This would mean that no stripping of special chars is required.
Unfortunately, stripping is done outside of the ngram filter by WhitespaceTokenizer.