Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1775

Change remaining contrib streams/filters to use new TokenStream API

Details

    • Task
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 2.9
    • modules/analysis
    • None
    • New

    Description

      All other contrib streams/filters have already been converted with LUCENE-1460.

      The two shingle filters are the last ones we need to convert.

      Attachments

        1. lucene-contrib-sinks.patch
          19 kB
          Michael Busch
        2. lucene-contrib-sinks.patch
          19 kB
          Robert Muir
        3. lucene-1775.patch
          16 kB
          Michael Busch
        4. lucene-1775.patch
          47 kB
          Michael Busch
        5. lucene-1775.patch
          54 kB
          Michael Busch
        6. lucene-1775.patch
          68 kB
          Michael Busch

        Issue Links

          Activity

            michaelbusch Michael Busch added a comment -

            First half is done:
            ShingleFilter and ShingleFilterTest are converted to the new API.

            ShingleFilter is much more efficient now, it clones much less often and computes the tokens mostly on the fly now.

            Would be good if someone could review my changes.

            We still need to convert ShingleMatrixFilter.

            michaelbusch Michael Busch added a comment - First half is done: ShingleFilter and ShingleFilterTest are converted to the new API. ShingleFilter is much more efficient now, it clones much less often and computes the tokens mostly on the fly now. Would be good if someone could review my changes. We still need to convert ShingleMatrixFilter.
            michaelbusch Michael Busch added a comment -

            Converted TestShingleMatrixFilter. Not the actual filter yet.

            michaelbusch Michael Busch added a comment - Converted TestShingleMatrixFilter. Not the actual filter yet.
            michaelbusch Michael Busch added a comment -

            ShingleMatrixFilter is a very complicated filter. It seems that it is implemented in a very inefficient way, it does lots of cloning. While I was able to fully convert ShingleFilter in a way, so that it is now much more efficient now, I'm not going to do that with the ShingleMatrixFilter.
            I don't know the code well enough to even try and with 1000 LOC it's very complex.

            The drawback of not fully converting it is that if someone uses custom Attributes, i. e. ones that are not in core Lucene, it is undefined what the filter will do with those Attributes. However, I don't even know what the behavior should be. If only core Attributes are used, everything is working fine, as the passing junits show.

            I added a corresponding comment to the javadocs of that class.

            michaelbusch Michael Busch added a comment - ShingleMatrixFilter is a very complicated filter. It seems that it is implemented in a very inefficient way, it does lots of cloning. While I was able to fully convert ShingleFilter in a way, so that it is now much more efficient now, I'm not going to do that with the ShingleMatrixFilter. I don't know the code well enough to even try and with 1000 LOC it's very complex. The drawback of not fully converting it is that if someone uses custom Attributes, i. e. ones that are not in core Lucene, it is undefined what the filter will do with those Attributes. However, I don't even know what the behavior should be. If only core Attributes are used, everything is working fine, as the passing junits show. I added a corresponding comment to the javadocs of that class.
            michaelbusch Michael Busch added a comment -

            I'll commit this in a couple of days if nobody objects.

            michaelbusch Michael Busch added a comment - I'll commit this in a couple of days if nobody objects.
            uschindler Uwe Schindler added a comment -

            This ShingleMatrixFilter is really a pain!

            I think the ShingeMatrix is very "special" and only produce tokens with few correlation to the original input stream, so it is not so bad, if the extra attributes get lost.

            You could use a simple AttributeSource instead of EmptyTokenStream and create it with the same AttributeFactory as the filter isself. Because of this, you could copyTo the extra Tokens (currently implemented by the Token instance). This reuseableToken could also be an AttributeSource? For me it is not really clear what all this copying between the attributes and the Token instance does, but it seems that it could be converted to Attributes, too. If you do it that way, would it be not work also with custom attributes? One possibility would be to copyTo the Tokens around (or use States) and then modify the shingle speicfic things.

            uschindler Uwe Schindler added a comment - This ShingleMatrixFilter is really a pain! I think the ShingeMatrix is very "special" and only produce tokens with few correlation to the original input stream, so it is not so bad, if the extra attributes get lost. You could use a simple AttributeSource instead of EmptyTokenStream and create it with the same AttributeFactory as the filter isself. Because of this, you could copyTo the extra Tokens (currently implemented by the Token instance). This reuseableToken could also be an AttributeSource? For me it is not really clear what all this copying between the attributes and the Token instance does, but it seems that it could be converted to Attributes, too. If you do it that way, would it be not work also with custom attributes? One possibility would be to copyTo the Tokens around (or use States) and then modify the shingle speicfic things.
            michaelbusch Michael Busch added a comment -

            I noticed that a shingle test uses PrefixAndSuffixAwareTokenFilter.

            So I undeprecated it and also PrefixAwareTokenFitler by using the same trick as in ShingleMatrixFilter.

            All tests pass.

            michaelbusch Michael Busch added a comment - I noticed that a shingle test uses PrefixAndSuffixAwareTokenFilter. So I undeprecated it and also PrefixAwareTokenFitler by using the same trick as in ShingleMatrixFilter. All tests pass.
            michaelbusch Michael Busch added a comment -

            Hi Uwe,

            I just noticed your comment after I attached the latest patch...

            There's probably a better way to do it. Feel free to update the patch
            My brain is tired and I really need to sleep now...

            I used the same trick in the Prefix/Suffix filters - one of the reasons I did it in those filters is it to not change the updateSuffixToken(Token, Token) method, which I assume is there so that users can overwrite it.

            michaelbusch Michael Busch added a comment - Hi Uwe, I just noticed your comment after I attached the latest patch... There's probably a better way to do it. Feel free to update the patch My brain is tired and I really need to sleep now... I used the same trick in the Prefix/Suffix filters - one of the reasons I did it in those filters is it to not change the updateSuffixToken(Token, Token) method, which I assume is there so that users can overwrite it.
            michaelbusch Michael Busch added a comment -

            Committed revision 800195.

            michaelbusch Michael Busch added a comment - Committed revision 800195.
            rcmuir Robert Muir added a comment -

            Michael, I looked at your patch, really glad you were successful here!

            One last question, we didnt do anything with analysis/sinks (DateRecognizerSinkTokenizer, etc).
            These are deprecated as they extend SinkTokenizer... will they be removed with the old api or do we need to implement alternatives?

            rcmuir Robert Muir added a comment - Michael, I looked at your patch, really glad you were successful here! One last question, we didnt do anything with analysis/sinks (DateRecognizerSinkTokenizer, etc). These are deprecated as they extend SinkTokenizer... will they be removed with the old api or do we need to implement alternatives?
            michaelbusch Michael Busch added a comment -

            One last question, we didnt do anything with analysis/sinks (DateRecognizerSinkTokenizer, etc).

            Yeah I guess we should. Here is the patch.

            michaelbusch Michael Busch added a comment - One last question, we didnt do anything with analysis/sinks (DateRecognizerSinkTokenizer, etc). Yeah I guess we should. Here is the patch.
            rcmuir Robert Muir added a comment -

            same patch as Michael's, just removed a stray import org.apache.tools.ant.filters.FixCrLfFilter.AddAsisRemove;

            rcmuir Robert Muir added a comment - same patch as Michael's, just removed a stray import org.apache.tools.ant.filters.FixCrLfFilter.AddAsisRemove;
            michaelbusch Michael Busch added a comment -

            Thanks, Robert. I'll commit this later today.

            michaelbusch Michael Busch added a comment - Thanks, Robert. I'll commit this later today.
            michaelbusch Michael Busch added a comment -

            Committed revision 800606.

            michaelbusch Michael Busch added a comment - Committed revision 800606.
            tomoko Tomoko Uchida added a comment -

            This issue was moved to GitHub issue: #2849.

            tomoko Tomoko Uchida added a comment - This issue was moved to GitHub issue: #2849 .

            People

              michaelbusch Michael Busch
              michaelbusch Michael Busch
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: