Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Trivial Trivial
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Alternative NGram filter that produce tokens with composite prefix and suffix markers.

      ts = new WhitespaceTokenizer(new StringReader("hello"));
      ts = new CombinedNGramTokenFilter(ts, 2, 2);
      assertNext(ts, "^h");
      assertNext(ts, "he");
      assertNext(ts, "el");
      assertNext(ts, "ll");
      assertNext(ts, "lo");
      assertNext(ts, "o$");
      assertNull(ts.next());
      
      1. LUCENE-1306.txt
        15 kB
        Karl Wettin
      2. LUCENE-1306.txt
        9 kB
        Karl Wettin

        Activity

        Hide
        Hiroaki Kawai added a comment -

        I'm sorry I could not see what it means that, "combined" + "ngram" from the code above. Can I ask you to let me know the intension?

        Show
        Hiroaki Kawai added a comment - I'm sorry I could not see what it means that, "combined" + "ngram" from the code above. Can I ask you to let me know the intension?
        Hide
        Karl Wettin added a comment -

        The current NGram analysis in trunk is split in two, on for edge-grams and one for inner grams.

        This patch combines them both in a single filter that uses ^prefix and suffix$ tokens if they are some sort of edge gram, or both around the complete token if n is great enough. There is also method to extend if you want to add a payload (more boost to edge grams or something) or do something to the gram tokens depending on what part of the original token they contain.

        Show
        Karl Wettin added a comment - The current NGram analysis in trunk is split in two, on for edge-grams and one for inner grams. This patch combines them both in a single filter that uses ^prefix and suffix$ tokens if they are some sort of edge gram, or both around the complete token if n is great enough. There is also method to extend if you want to add a payload (more boost to edge grams or something) or do something to the gram tokens depending on what part of the original token they contain.
        Hide
        Otis Gospodnetic added a comment -

        Should there be a way for the client of this class to specify the prefix and suffix char?

        Is having, for example, "^h" as the first bi-gram token really the right thing to do? Would "^he" make more sense? I know that makes it 3 characters long, but it's 2 chars from the input string. Not sure, so I'm asking.

        Is this primarily to distinguish between the edge and inner n-grams? If so, would it make more sense to just make use of Token type variable instead?

        Show
        Otis Gospodnetic added a comment - Should there be a way for the client of this class to specify the prefix and suffix char? Is having, for example, "^h" as the first bi-gram token really the right thing to do? Would "^he" make more sense? I know that makes it 3 characters long, but it's 2 chars from the input string. Not sure, so I'm asking. Is this primarily to distinguish between the edge and inner n-grams? If so, would it make more sense to just make use of Token type variable instead?
        Hide
        Hiroaki Kawai added a comment -

        After thinking for a week, I think this idea is nice.

        IMHO, this might be renamed to NGramTokenizer simply. A general n-gram tokenizer accepts a sequence that has no gap in it. By the concept, TokenFilter accepts a tokien stream (gapped sequence), and current NGramTokenFilter does not work well in that sense. CombinedNGramTokenFilter filles the gap with prefix(^) and suffix($), and the token stream becomes a simple stream again virtually, n-gram works nice agian.

        Comments:
        1. prefix and suffix chars should be configurable. Because user must choose a char that is not used in the terms.
        2. prefix and suffix might be a white space. Because most of the users are not interested in whitespace itself.
        3. If you want to do a phrase query (for example, "This is"), we have to generate $^ token in the gap to make the positions valid.
        4. n-gram algorithm should be rewritten to make the positions valid. Please see LUCENE-1225

        I think "^h" is OK, because prefix and suffix are the chars that was introduced as a workaround.

        Show
        Hiroaki Kawai added a comment - After thinking for a week, I think this idea is nice. IMHO, this might be renamed to NGramTokenizer simply. A general n-gram tokenizer accepts a sequence that has no gap in it. By the concept, TokenFilter accepts a tokien stream (gapped sequence), and current NGramTokenFilter does not work well in that sense. CombinedNGramTokenFilter filles the gap with prefix(^) and suffix($), and the token stream becomes a simple stream again virtually, n-gram works nice agian. Comments: 1. prefix and suffix chars should be configurable. Because user must choose a char that is not used in the terms. 2. prefix and suffix might be a white space. Because most of the users are not interested in whitespace itself. 3. If you want to do a phrase query (for example, "This is"), we have to generate $^ token in the gap to make the positions valid. 4. n-gram algorithm should be rewritten to make the positions valid. Please see LUCENE-1225 I think "^h" is OK, because prefix and suffix are the chars that was introduced as a workaround.
        Hide
        Grant Ingersoll added a comment -

        Note, also, that one could use the "flags" to indicate what the token is. I know that's a little up in the air just yet, but it does exist. This would mean that no stripping of special chars is required.

        Show
        Grant Ingersoll added a comment - Note, also, that one could use the "flags" to indicate what the token is. I know that's a little up in the air just yet, but it does exist. This would mean that no stripping of special chars is required.
        Hide
        Karl Wettin added a comment -

        I'll refine and document this patch soon. Terrible busy though. Hasty responses:

        Should there be a way for the client of this class to specify the prefix and suffix char?

        1. prefix and suffix chars should be configurable. Because user must choose a char that is not used in the terms.

        There are getters and setters, but nothing in the constructor.

        Is having, for example, "^h" as the first bi-gram token really the right thing to do? Would "^he" make more sense? I know that makes it 3 characters long, but it's 2 chars from the input string. Not sure, so I'm asking.

        I always considered 'start of word' and 'end of word' as a single character and a part of n. I might be wrong though. I'll have to take a look at what other people did. It would not be a very hard thing to include a setting for that.

        Is this primarily to distinguish between the edge and inner n-grams? If so, would it make more sense to just make use of Token type variable instead?

        one could use the "flags" to indicate what the token is.

        I might be missing something in your line of questioning. Don't understand what it would help to have the flag or token type as they are not stored in the index.

        I don't want separate fields for the prefix, inner and suffix grams, I want to use the same single filter at query time. I typically pass down the gram boost in the payload, evaluated on gram size, how far away it is from the prefix and suffix, et c.

        3. If you want to do a phrase query (for example, "This is"), we have to generate $^ token in the gap to make the positions valid.

        If you are creating ngrams over multiple words, say a sentence, then I state that there should only be a prefix in the start of the senstance and a suffix in the end of the sentance and that grams will contain whitespace. I never did phrase queries using grams but I'd probably want prefix and suffix around each token. This is another good reason to keep them in the same field with prefix and suffix markers in the token, or?

        Show
        Karl Wettin added a comment - I'll refine and document this patch soon. Terrible busy though. Hasty responses: Should there be a way for the client of this class to specify the prefix and suffix char? 1. prefix and suffix chars should be configurable. Because user must choose a char that is not used in the terms. There are getters and setters, but nothing in the constructor. Is having, for example, "^h" as the first bi-gram token really the right thing to do? Would "^he" make more sense? I know that makes it 3 characters long, but it's 2 chars from the input string. Not sure, so I'm asking. I always considered 'start of word' and 'end of word' as a single character and a part of n. I might be wrong though. I'll have to take a look at what other people did. It would not be a very hard thing to include a setting for that. Is this primarily to distinguish between the edge and inner n-grams? If so, would it make more sense to just make use of Token type variable instead? one could use the "flags" to indicate what the token is. I might be missing something in your line of questioning. Don't understand what it would help to have the flag or token type as they are not stored in the index. I don't want separate fields for the prefix, inner and suffix grams, I want to use the same single filter at query time. I typically pass down the gram boost in the payload, evaluated on gram size, how far away it is from the prefix and suffix, et c. 3. If you want to do a phrase query (for example, "This is"), we have to generate $^ token in the gap to make the positions valid. If you are creating ngrams over multiple words, say a sentence, then I state that there should only be a prefix in the start of the senstance and a suffix in the end of the sentance and that grams will contain whitespace. I never did phrase queries using grams but I'd probably want prefix and suffix around each token. This is another good reason to keep them in the same field with prefix and suffix markers in the token, or?
        Hide
        Hiroaki Kawai added a comment -

        First of all, my comment No.3 was not wrong, sorry. We don't have to insert $^ token in the ngram stream.

        I don't want separate fields for the prefix, inner and suffix grams, I want to use the same single filter at query time.

        I agree with that.

        Then, let's consider about the phrase query.
        1. At store time, we want to store a sentence "This is a pen"
        2. At query time, we want to query with "This is"

        At store time, with WhitespaceTokenizer+CombinedNGramTokenFilter(2,2), we get:
        ^T Th hi is s$ ^i is s$ ^a a$ ^p pe en n$

        At query time, with WhitespaceTokenizer+CombinedNGramTokenFilter(2,2), we get:
        ^T Th hi is s$ ^i is s$

        We can find that the stored sequence because it contains the query sequence.

        If you are creating ngrams over multiple words, say a sentence, then I state that there should only be a prefix in the start of the senstance and a suffix in the end of the sentance and that grams will contain whitespace.

        If so, at query time, with WhitespaceTokenizer+CombinedNGramTokenFilter(2,2), we get:
        "^T","Th","hi","is","s "," i","is","s$"

        We can't find the stored sequence because it does not contain the query sequence. n-gram query is always phrase query in the micro scope.

        +1 for prefix and suffix markers in the token.

        Note, also, that one could use the "flags" to indicate what the token is. I know that's a little up in the air just yet, but it does exist.

        Yes, there is a flags. Of cource, we can use it. But I can't find the way to use them efficiently in THIS CASE, right now.

        This would mean that no stripping of special chars is required.

        Unfortunately, stripping is done outside of the ngram filter by WhitespaceTokenizer.

        Show
        Hiroaki Kawai added a comment - First of all, my comment No.3 was not wrong, sorry. We don't have to insert $^ token in the ngram stream. I don't want separate fields for the prefix, inner and suffix grams, I want to use the same single filter at query time. I agree with that. Then, let's consider about the phrase query. 1. At store time, we want to store a sentence "This is a pen" 2. At query time, we want to query with "This is" At store time, with WhitespaceTokenizer+CombinedNGramTokenFilter(2,2), we get: ^T Th hi is s$ ^i is s$ ^a a$ ^p pe en n$ At query time, with WhitespaceTokenizer+CombinedNGramTokenFilter(2,2), we get: ^T Th hi is s$ ^i is s$ We can find that the stored sequence because it contains the query sequence. If you are creating ngrams over multiple words, say a sentence, then I state that there should only be a prefix in the start of the senstance and a suffix in the end of the sentance and that grams will contain whitespace. If so, at query time, with WhitespaceTokenizer+CombinedNGramTokenFilter(2,2), we get: "^T","Th","hi","is","s "," i","is","s$" We can't find the stored sequence because it does not contain the query sequence. n-gram query is always phrase query in the micro scope. +1 for prefix and suffix markers in the token. Note, also, that one could use the "flags" to indicate what the token is. I know that's a little up in the air just yet, but it does exist. Yes, there is a flags. Of cource, we can use it. But I can't find the way to use them efficiently in THIS CASE, right now. This would mean that no stripping of special chars is required. Unfortunately, stripping is done outside of the ngram filter by WhitespaceTokenizer.
        Hide
        Karl Wettin added a comment -

        New in this patch:

        • offsets as in NGramTokenFilter
        • token type "^gram", "gram$", "^gram$" and "gram"
        • a bit of javadocs

        There is also a todo I'll have to look in to some other day.

        //  todo
        //  /**
        //   * if true, prefix and suffix does not count as a part of the ngram size.
        //   * E.g. '^he' has as n of 2 if true and 3 if false
        //   */
        //  private boolean usingBoundaryCharsPartOfN = true;
        

        This was not quite as simple to add as I hoped it would be and will try to find some time to fix that before I commit it.

        Show
        Karl Wettin added a comment - New in this patch: offsets as in NGramTokenFilter token type "^gram", "gram$", "^gram$" and "gram" a bit of javadocs There is also a todo I'll have to look in to some other day. // todo // /** // * if true , prefix and suffix does not count as a part of the ngram size. // * E.g. '^he' has as n of 2 if true and 3 if false // */ // private boolean usingBoundaryCharsPartOfN = true ; This was not quite as simple to add as I hoped it would be and will try to find some time to fix that before I commit it.
        Hide
        Hiroaki Kawai added a comment -

        The files looks good for me.

        Show
        Hiroaki Kawai added a comment - The files looks good for me.
        Hide
        Otis Gospodnetic added a comment -

        Could/should this not be folded into the existing Ngram code in contrib?

        Show
        Otis Gospodnetic added a comment - Could/should this not be folded into the existing Ngram code in contrib?
        Hide
        Erick Erickson added a comment -

        SPRING_CLEANING_2013 We can reopen if necessary.

        Show
        Erick Erickson added a comment - SPRING_CLEANING_2013 We can reopen if necessary.

          People

          • Assignee:
            Karl Wettin
            Reporter:
            Karl Wettin
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development