Lucene - Core
  1. Lucene - Core
  2. LUCENE-4766

Pattern token filter which emits a token for every capturing group

    Details

    • Lucene Fields:
      New, Patch Available

      Description

      The PatternTokenizer either functions by splitting on matches, or allows you to specify a single capture group. This is insufficient for my needs. Quite often I want to capture multiple overlapping tokens in the same position.

      I've written a pattern token filter which accepts multiple patterns and emits tokens for every capturing group that is matched in any pattern.
      Patterns are not anchored to the beginning and end of the string, so each pattern can produce multiple matches.

      For instance a pattern like :

          "(([a-z]+)(\d*))"
      

      when matched against:

          "abc123def456"
      

      would produce the tokens:

          abc123, abc, 123, def456, def, 456
      

      Multiple patterns can be applied, eg these patterns could be used for camelCase analysis:

          "([A-Z]{2,})",
          "(?<![A-Z])([A-Z][a-z]+)",
          "(?:^|\\b|(?<=[0-9_])|(?<=[A-Z]{2}))([a-z]+)",
          "([0-9]+)"
      

      When matched against the string "letsPartyLIKEits1999_dude", they would produce the tokens:

          lets, Party, LIKE, its, 1999, dude
      

      If no token is emitted, the original token is preserved.
      If the preserveOriginal flag is true, it will output the full original token (ie "letsPartyLIKEits1999_dude") in addition to any matching tokens (but in this case, if a matching token is identical to the original, it will only emit one copy of the full token).

      Multiple patterns are required to allow overlapping captures, but also means that patterns are less dense and easier to understand.

      This is my first Java code, so apologies if I'm doing something stupid.

      1. LUCENE-4766.patch
        31 kB
        Simon Willnauer
      2. LUCENE-4766.patch
        29 kB
        Clinton Gormley
      3. LUCENE-4766.patch
        29 kB
        Clinton Gormley
      4. LUCENE-4766.patch
        29 kB
        Simon Willnauer
      5. LUCENE-4766.patch
        23 kB
        Clinton Gormley

        Activity

        Hide
        Clinton Gormley added a comment -

        Patch implementing org.apache.lucene.analysis.pattern.PatternCaptureGroupTokenFilter and org.apache.lucene.analysis.pattern.TestPatternCaptureGroupTokenFilter

        Show
        Clinton Gormley added a comment - Patch implementing org.apache.lucene.analysis.pattern.PatternCaptureGroupTokenFilter and org.apache.lucene.analysis.pattern.TestPatternCaptureGroupTokenFilter
        Hide
        Simon Willnauer added a comment -

        Hey Clinton, this looks very interesting and given this is your first java experience pretty impressive too. I am not sure how expensive this filter in practice will be but given that you can do stuff you can't do with any of the other filters I think folks just have to pay the price here. I like that all patterns operate on the same CharSequence and that you are setting offsets right. Cool stuff!

        Show
        Simon Willnauer added a comment - Hey Clinton, this looks very interesting and given this is your first java experience pretty impressive too. I am not sure how expensive this filter in practice will be but given that you can do stuff you can't do with any of the other filters I think folks just have to pay the price here. I like that all patterns operate on the same CharSequence and that you are setting offsets right. Cool stuff!
        Hide
        Adrien Grand added a comment -

        I like that all patterns operate on the same CharSequence and that you are setting offsets right.

        Does this filter need to set offsets? I'm worried that under certain circumstances filters that modify offsets might create inconsistent offset graphs (because they don't know what filters have been applied before, there is an exclusion list for filters that modify offsets in TestRandomChains).

        Show
        Adrien Grand added a comment - I like that all patterns operate on the same CharSequence and that you are setting offsets right. Does this filter need to set offsets? I'm worried that under certain circumstances filters that modify offsets might create inconsistent offset graphs (because they don't know what filters have been applied before, there is an exclusion list for filters that modify offsets in TestRandomChains).
        Hide
        Simon Willnauer added a comment -

        Does this filter need to set offsets? I'm worried that under certain circumstances filters that modify offsets might create inconsistent offset graphs (because they don't know what filters have been applied before, there is an exclusion list for filters that modify offsets in TestRandomChains).

        yeah I agree offsets are tricky here. I just wonder if we really should restrict our TF to not fix offsets? Kind of an odd thing though. What should a tokenfilter like this do instead?

        Show
        Simon Willnauer added a comment - Does this filter need to set offsets? I'm worried that under certain circumstances filters that modify offsets might create inconsistent offset graphs (because they don't know what filters have been applied before, there is an exclusion list for filters that modify offsets in TestRandomChains). yeah I agree offsets are tricky here. I just wonder if we really should restrict our TF to not fix offsets? Kind of an odd thing though. What should a tokenfilter like this do instead?
        Hide
        Simon Willnauer added a comment -

        I updated the patch and added a FilterFactory etc. to make tests pass. it still contains the offset fix and I personally don't really know what TF should do in those cases

        Show
        Simon Willnauer added a comment - I updated the patch and added a FilterFactory etc. to make tests pass. it still contains the offset fix and I personally don't really know what TF should do in those cases
        Hide
        Adrien Grand added a comment -

        I just wonder if we really should restrict our TF to not fix offsets? Kind of an odd thing though. What should a tokenfilter like this do instead?

        I think that for some examples, it makes sense not to fix offsets? In the case of the URL example ((https?://([a-zA-Z\-_0-9.]+))), I think it makes sense to highlight the whole URL (including the leading http(s)://) even if the query term is just www.mysite.com. On the other hand, it could be weird if the goal was to split a long CamelCase token (letsPartyLIKEits1999_dude), but maybe this should be done by a Tokenizer rather than a TokenFilter?

        (No strong feeling here, I'd just like to see if we can find a way to commit this patch without having to grow our TokenFilter exclusion list.)

        Show
        Adrien Grand added a comment - I just wonder if we really should restrict our TF to not fix offsets? Kind of an odd thing though. What should a tokenfilter like this do instead? I think that for some examples, it makes sense not to fix offsets? In the case of the URL example ( (https?://( [a-zA-Z\-_0-9.] +)) ), I think it makes sense to highlight the whole URL (including the leading http(s)://) even if the query term is just www.mysite.com . On the other hand, it could be weird if the goal was to split a long CamelCase token (letsPartyLIKEits1999_dude), but maybe this should be done by a Tokenizer rather than a TokenFilter? (No strong feeling here, I'd just like to see if we can find a way to commit this patch without having to grow our TokenFilter exclusion list.)
        Hide
        Robert Muir added a comment -

        (No strong feeling here, I'd just like to see if we can find a way to commit this patch without having to grow our TokenFilter exclusion list.)

        I dont think tokenfilters should change offsets in general. This is not possible to do correctly. In general if you are splitting and creating like this... its a tokenizer not a tokenfilter. And only a tokenizer can set offsets, because its the only one that has access to the charfilter correction data.

        besides, trying to be a token 'creator' over an incoming tokenstream graph is really hard to get right.

        So I would prefer if this filter either became a tokenizer or did not change offsets at all. Then we can probably get it committed without hassle.

        we cannot allow this exclusion list to grow. Its not an exclusion list that says 'its ok to add more broken filters'. Its a list of filters that will get deleted from lucene soon unless someone fixes them, because we have to stop indexwriter from writing invalid data into the term vectors here.

        Also the test should call checkRandomData()

        Show
        Robert Muir added a comment - (No strong feeling here, I'd just like to see if we can find a way to commit this patch without having to grow our TokenFilter exclusion list.) I dont think tokenfilters should change offsets in general. This is not possible to do correctly. In general if you are splitting and creating like this... its a tokenizer not a tokenfilter. And only a tokenizer can set offsets, because its the only one that has access to the charfilter correction data. besides, trying to be a token 'creator' over an incoming tokenstream graph is really hard to get right. So I would prefer if this filter either became a tokenizer or did not change offsets at all. Then we can probably get it committed without hassle. we cannot allow this exclusion list to grow. Its not an exclusion list that says 'its ok to add more broken filters'. Its a list of filters that will get deleted from lucene soon unless someone fixes them, because we have to stop indexwriter from writing invalid data into the term vectors here. Also the test should call checkRandomData()
        Hide
        Clinton Gormley added a comment -

        Is it OK for a tokenizer to create multiple tokens in the same positions but with different offsets? eg

        "foobar" -> ["foobar","foo","bar"] with positions [1,1,1] and startOffsets [0,0,3]?
        

        Having a look at the wordDelimiter token filter, with preserveOriginal set to true, it increments the position for each new offset, eg:

        "fooBar" -> ["fooBar","foo","Bar"] with positions [1,1,2] and startOffsets [0,0,3].
        

        I'm asking because I'm not sure exactly how positions and offsets get used elsewhere, and so what the correct behaviour is. From my naive understanding, the wordDelimiter filter can produce spurious results with phrase searches, eg matching "fooBar Bar"

        Show
        Clinton Gormley added a comment - Is it OK for a tokenizer to create multiple tokens in the same positions but with different offsets? eg "foobar" -> [ "foobar" , "foo" , "bar" ] with positions [1,1,1] and startOffsets [0,0,3]? Having a look at the wordDelimiter token filter, with preserveOriginal set to true, it increments the position for each new offset, eg: "fooBar" -> [ "fooBar" , "foo" , "Bar" ] with positions [1,1,2] and startOffsets [0,0,3]. I'm asking because I'm not sure exactly how positions and offsets get used elsewhere, and so what the correct behaviour is. From my naive understanding, the wordDelimiter filter can produce spurious results with phrase searches, eg matching "fooBar Bar"
        Hide
        Robert Muir added a comment -

        The positions are used for searching, offsets for highlighting.

        So you can (unfortunately) set the offsets to whatever you want, it wont affect searches. Instead it will only cause problems for highlighting. An example of this is: https://issues.apache.org/jira/browse/SOLR-4137

        For a tokenfilter, it doesnt make sense to change offsets, because a tokenizer already broke the document into words and mapped them back to their original location in the document.

        If a tokenfilter REALLY needs to change offsets, then its a sign its subclassing the wrong analysis type and should be a tokenizer: because its trying to break the document into words, not just alter existing tokenization

        Show
        Robert Muir added a comment - The positions are used for searching, offsets for highlighting. So you can (unfortunately) set the offsets to whatever you want, it wont affect searches. Instead it will only cause problems for highlighting. An example of this is: https://issues.apache.org/jira/browse/SOLR-4137 For a tokenfilter, it doesnt make sense to change offsets, because a tokenizer already broke the document into words and mapped them back to their original location in the document. If a tokenfilter REALLY needs to change offsets, then its a sign its subclassing the wrong analysis type and should be a tokenizer: because its trying to break the document into words, not just alter existing tokenization
        Hide
        Adrien Grand added a comment -

        Is it OK for a tokenizer to create multiple tokens in the same positions but with different offsets?

        Although it's not common, it is perfectly fine for a Tokenizer to generate multiple tokens in the same position.

        However, I think the correct way to tokenize your example would be:

        tokens: foo, foobar, bar
        positions: 1, 1, 2
        position lengths: 1, 2, 1
        start offsets: 0, 0, 3
        end offsets: 3, 6, 6
        

        I'm not sure WordDelimiterFilter is the best example to look at. I'm not familiar with it at all, but it's currently in the exclusion list for both positions and offsets (and is the culprit for SOLR-4137).

        Show
        Adrien Grand added a comment - Is it OK for a tokenizer to create multiple tokens in the same positions but with different offsets? Although it's not common, it is perfectly fine for a Tokenizer to generate multiple tokens in the same position. However, I think the correct way to tokenize your example would be: tokens: foo, foobar, bar positions: 1, 1, 2 position lengths: 1, 2, 1 start offsets: 0, 0, 3 end offsets: 3, 6, 6 I'm not sure WordDelimiterFilter is the best example to look at. I'm not familiar with it at all, but it's currently in the exclusion list for both positions and offsets (and is the culprit for SOLR-4137 ).
        Hide
        Uwe Schindler added a comment - - edited

        WDF is one of those "bad" examples. WDF should be included in a custom Tokenizer. WDF is always used together with WhitespaceTokenizer, so it should be included into some "WordDelimiterWhitespaceTokenizer".

        Show
        Uwe Schindler added a comment - - edited WDF is one of those "bad" examples. WDF should be included in a custom Tokenizer. WDF is always used together with WhitespaceTokenizer, so it should be included into some "WordDelimiterWhitespaceTokenizer".
        Hide
        Clinton Gormley added a comment -

        OK, so I should redo this as a tokenizer, and set positionLengths correctly.

        One issue is that, because there are multiple patterns, the emitted tokens can overlap, eg:

           "foobarbaz" -> foo, foobar, oba, bar, baz
        

        in which case I think I would need to emit:

            positions:         1, 1, 2, 3, 5
            position lengths:  2, 4, 2, 2, 1
            start offsets:     0, 0, 0, 0, 0
            end offsets:       3, 6, 3, 3, 3
        

        Is this correct? It's starting to look quite complex...

        Show
        Clinton Gormley added a comment - OK, so I should redo this as a tokenizer, and set positionLengths correctly. One issue is that, because there are multiple patterns, the emitted tokens can overlap, eg: "foobarbaz" -> foo, foobar, oba, bar, baz in which case I think I would need to emit: positions: 1, 1, 2, 3, 5 position lengths: 2, 4, 2, 2, 1 start offsets: 0, 0, 0, 0, 0 end offsets: 3, 6, 3, 3, 3 Is this correct? It's starting to look quite complex...
        Hide
        Simon Willnauer added a comment -

        I tend to disagree that this should be a tokenizer. IMO a tokenizer should only split tokens in a stream fashion and should not emit tokens on the same position really. This is what token filters should do. It also clashes with reuseability since you can't really reuse tokenizers you have to decide which one you want. At some point you need to know what you are doing here really. I don't have a definite answer but there is currently no clean way to do what clinton wants to do IMO.

        Show
        Simon Willnauer added a comment - I tend to disagree that this should be a tokenizer. IMO a tokenizer should only split tokens in a stream fashion and should not emit tokens on the same position really. This is what token filters should do. It also clashes with reuseability since you can't really reuse tokenizers you have to decide which one you want. At some point you need to know what you are doing here really. I don't have a definite answer but there is currently no clean way to do what clinton wants to do IMO.
        Hide
        Shawn Heisey added a comment -

        I use WDF with ICUTokenizer. ICUTokenizer is customized with an RBBI file for latin1 that only breaks on whitespace.

        Show
        Shawn Heisey added a comment - I use WDF with ICUTokenizer. ICUTokenizer is customized with an RBBI file for latin1 that only breaks on whitespace.
        Hide
        Adrien Grand added a comment -

        I tend to disagree that this should be a tokenizer.

        Maybe another option is to change this filter so that it doesn't change offsets? Let's imagine this filter is used to break TokenFilter into Token, TokenFilter and Filter, I think it's acceptable to highlight TokenFilter as a whole when searching for "Token" or "Filter"?

        Show
        Adrien Grand added a comment - I tend to disagree that this should be a tokenizer. Maybe another option is to change this filter so that it doesn't change offsets? Let's imagine this filter is used to break TokenFilter into Token , TokenFilter and Filter , I think it's acceptable to highlight TokenFilter as a whole when searching for "Token" or "Filter"?
        Hide
        Clinton Gormley added a comment -

        New patch which preserves the offsets of the original token. Includes Simons patch to create the filter factory

        Show
        Clinton Gormley added a comment - New patch which preserves the offsets of the original token. Includes Simons patch to create the filter factory
        Hide
        Robert Muir added a comment -

        The patch should add a test that uses checkRandomData.

        it would find bugs like this:

        charOffsetStart = offsetAttr.startOffset();
        charOffsetEnd = charOffsetStart + spare.length;
        

        charOffsetEnd should be offsetAttr.endOffset(). If there is a charfilter, the current calculation will be incorrect.

        Show
        Robert Muir added a comment - The patch should add a test that uses checkRandomData. it would find bugs like this: charOffsetStart = offsetAttr.startOffset(); charOffsetEnd = charOffsetStart + spare.length; charOffsetEnd should be offsetAttr.endOffset(). If there is a charfilter, the current calculation will be incorrect.
        Hide
        Simon Willnauer added a comment -

        Clinton, I think you can trash the offset attribute reference in there entirely just don't mess with them at all. Can you also call BaseTokenStreamTest#assertAnalyzesToReuse and BTST#checkRandomData in your tests please

        Show
        Simon Willnauer added a comment - Clinton, I think you can trash the offset attribute reference in there entirely just don't mess with them at all. Can you also call BaseTokenStreamTest#assertAnalyzesToReuse and BTST#checkRandomData in your tests please
        Hide
        Clinton Gormley added a comment -

        The charOffsetEnd now uses the correct offsetAttr.endOffset() and, added a test using checkRandomData()

        Show
        Clinton Gormley added a comment - The charOffsetEnd now uses the correct offsetAttr.endOffset() and, added a test using checkRandomData()
        Hide
        Uwe Schindler added a comment -

        Clinton, I think you can trash the offset attribute reference in there entirely just don't mess with them at all.

        That's part of a bigger problem in the current code. The idea of this filter is to make from one input Token multiple output Tokens. To make this work correct, the new output tokens must be produced based on the original token (means the filter must reset the new produced token to a clean state, otherwise it might happen that unrelated and unknown attributes stay alive with wrong values - especiall if later TokenFilter change attributes, e.g. a Synonymfilter is inserting more synonyms). The problem Clinton had was that he had to re-set the offset attribute (although he does not change it); but he missed possible other attributes on the stream he does not know about.

        If you look at other filters doing similar things like Synonymfilter, WDF, the way it has to work is like that:

        • The first token emmitted is the "original one, maybe modified
        • All "inserted Tokens" are cloned from the original (first) token, use captureState/restoreState to do that. This will initialize the attribute source to the exact same token like the original (unmodified one). After you called restoreState, you can modify the attribute (like term text) and setPositionIncrement(0). You can then leave the the offset (and other unknown attributes that may be on the token stream) unchanged - don't reference them at all.
        Show
        Uwe Schindler added a comment - Clinton, I think you can trash the offset attribute reference in there entirely just don't mess with them at all. That's part of a bigger problem in the current code. The idea of this filter is to make from one input Token multiple output Tokens. To make this work correct, the new output tokens must be produced based on the original token (means the filter must reset the new produced token to a clean state, otherwise it might happen that unrelated and unknown attributes stay alive with wrong values - especiall if later TokenFilter change attributes, e.g. a Synonymfilter is inserting more synonyms). The problem Clinton had was that he had to re-set the offset attribute (although he does not change it); but he missed possible other attributes on the stream he does not know about. If you look at other filters doing similar things like Synonymfilter, WDF, the way it has to work is like that: The first token emmitted is the "original one, maybe modified All "inserted Tokens" are cloned from the original (first) token, use captureState/restoreState to do that. This will initialize the attribute source to the exact same token like the original (unmodified one). After you called restoreState, you can modify the attribute (like term text) and setPositionIncrement(0). You can then leave the the offset (and other unknown attributes that may be on the token stream) unchanged - don't reference them at all.
        Hide
        Simon Willnauer added a comment -

        here is a new patch, updated to trunk and moved over to use capture&restore state. I think this one is ready, I will commit this today if nobody objects.

        Show
        Simon Willnauer added a comment - here is a new patch, updated to trunk and moved over to use capture&restore state. I think this one is ready, I will commit this today if nobody objects.
        Hide
        Simon Willnauer added a comment -

        argh... I missed svn add... here is the right patch

        Show
        Simon Willnauer added a comment - argh... I missed svn add... here is the right patch
        Hide
        Adrien Grand added a comment -

        +1

        Show
        Adrien Grand added a comment - +1
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] simonw
        http://svn.apache.org/viewvc?view=revision&revision=1471347

        LUCENE-4766: Added a PatternCaptureGroupTokenFilter that uses Java regexes to emit multiple tokens one for each capture group in one or more patterns

        Show
        Commit Tag Bot added a comment - [trunk commit] simonw http://svn.apache.org/viewvc?view=revision&revision=1471347 LUCENE-4766 : Added a PatternCaptureGroupTokenFilter that uses Java regexes to emit multiple tokens one for each capture group in one or more patterns
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] simonw
        http://svn.apache.org/viewvc?view=revision&revision=1471352

        LUCENE-4766: Added a PatternCaptureGroupTokenFilter that uses Java regexes to emit multiple tokens one for each capture group in one or more patterns

        Show
        Commit Tag Bot added a comment - [branch_4x commit] simonw http://svn.apache.org/viewvc?view=revision&revision=1471352 LUCENE-4766 : Added a PatternCaptureGroupTokenFilter that uses Java regexes to emit multiple tokens one for each capture group in one or more patterns
        Hide
        Simon Willnauer added a comment -

        committed, thanks clinton

        Show
        Simon Willnauer added a comment - committed, thanks clinton
        Hide
        Jack Krupansky added a comment -

        I just happened to notice that the underlying token filter accepts a list of patterns, but the factory only accepts a single pattern.

        Was this intentional or an oversight?

        In fact, the main example for the filter requires multiple patterns, which the factory does not support.

        Show
        Jack Krupansky added a comment - I just happened to notice that the underlying token filter accepts a list of patterns, but the factory only accepts a single pattern. Was this intentional or an oversight? In fact, the main example for the filter requires multiple patterns, which the factory does not support.
        Hide
        Steve Rowe added a comment -

        Bulk close resolved 4.4 issues

        Show
        Steve Rowe added a comment - Bulk close resolved 4.4 issues

          People

          • Assignee:
            Simon Willnauer
            Reporter:
            Clinton Gormley
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development