Lucene - Core
  1. Lucene - Core
  2. LUCENE-1460

Change all contrib TokenStreams/Filters to use the new TokenStream API

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Now that we have the new TokenStream API (LUCENE-1422) we should change all contrib modules to use it.

      1. LUCENE-1460_contrib_partial.txt
        118 kB
        Robert Muir
      2. LUCENE-1460_contrib_partial.txt
        72 kB
        Robert Muir
      3. LUCENE-1460_contrib_partial.txt
        47 kB
        Robert Muir
      4. LUCENE-1460_core.txt
        12 kB
        Robert Muir
      5. LUCENE-1460_partial.txt
        58 kB
        Robert Muir
      6. lucene-1460.patch
        268 kB
        Michael Busch
      7. lucene-1460.patch
        259 kB
        Michael Busch
      8. lucene-1460.patch
        244 kB
        Michael Busch
      9. lucene-1460.patch
        191 kB
        Michael Busch
      10. LUCENE-1460.patch
        267 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          is anyone working on this? I have some functionality that needs some of these to be new-api so i have at least half of them done.

          Show
          Robert Muir added a comment - is anyone working on this? I have some functionality that needs some of these to be new-api so i have at least half of them done.
          Hide
          Mark Miller added a comment -

          wanna post what you have Robert? perhaps then someone will finish off the other half.

          Show
          Mark Miller added a comment - wanna post what you have Robert? perhaps then someone will finish off the other half.
          Hide
          Robert Muir added a comment -

          only partial solution...
          some of the analyzers don't have any tests, so I think thats a bit more work!
          the AsciiFoldingFilter fix is in here too, i know its not in contrib but it doesnt support new API either.

          Show
          Robert Muir added a comment - only partial solution... some of the analyzers don't have any tests, so I think thats a bit more work! the AsciiFoldingFilter fix is in here too, i know its not in contrib but it doesnt support new API either.
          Hide
          Simon Willnauer added a comment -

          Robert, would you divide your patch into contrib/core. That would make it easier to help out and eventually commit it.
          Thanks

          Show
          Simon Willnauer added a comment - Robert, would you divide your patch into contrib/core. That would make it easier to help out and eventually commit it. Thanks
          Hide
          Robert Muir added a comment -

          split patch: core/contrib

          Show
          Robert Muir added a comment - split patch: core/contrib
          Hide
          Uwe Schindler added a comment -

          I added the ASCIIFoldingFilter patch into LUCENE-1693 (patch comes shortly). So this part is finished.

          Show
          Uwe Schindler added a comment - I added the ASCIIFoldingFilter patch into LUCENE-1693 (patch comes shortly). So this part is finished.
          Hide
          Michael Busch added a comment -

          Now that LUCENE-1693 is almost committed, we can tackle this one.

          We need to keep these points in mind that Uwe posted on 1693:

          • rewrite and replace next(Token)/next() implementations by new API
          • if the class is final, no next(Token)/next() methods needed (must be removed!!!)
          • if the class is non-final add the following methods to the class:
                /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should
                 * not be overridden. Delegates to the backwards compatibility layer. */
                public final Token next(final Token reusableToken) throws java.io.IOException {
                  return super.next(reusableToken);
                }
          
                /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should
                 * not be overridden. Delegates to the backwards compatibility layer. */
                public final Token next() throws java.io.IOException {
                  return super.next();
                }
          

          Also the incrementToken() method must be final in this case.

          Show
          Michael Busch added a comment - Now that LUCENE-1693 is almost committed, we can tackle this one. We need to keep these points in mind that Uwe posted on 1693: rewrite and replace next(Token)/next() implementations by new API if the class is final, no next(Token)/next() methods needed (must be removed!!!) if the class is non-final add the following methods to the class: /** @deprecated Will be removed in Lucene 3.0. This method is final , as it should * not be overridden. Delegates to the backwards compatibility layer. */ public final Token next( final Token reusableToken) throws java.io.IOException { return super .next(reusableToken); } /** @deprecated Will be removed in Lucene 3.0. This method is final , as it should * not be overridden. Delegates to the backwards compatibility layer. */ public final Token next() throws java.io.IOException { return super .next(); } Also the incrementToken() method must be final in this case.
          Hide
          Robert Muir added a comment -

          Michael, I think we should try to coordinate timing on this with LUCENE-1728, where we are trying to reorganize contrib/analyzers...

          Show
          Robert Muir added a comment - Michael, I think we should try to coordinate timing on this with LUCENE-1728 , where we are trying to reorganize contrib/analyzers...
          Hide
          Michael Busch added a comment -

          Michael, I think we should try to coordinate timing on this with LUCENE-1728, where we are trying to reorganize contrib/analyzers...

          It seems like 1728 is ready to commit? Simon said on java-dev he will try to finish it by end of this week?

          In that case we should commit 1728 first. But we can finish the patch here I think... after 1728 is committed we just do a find&replace on the patch and replace "contrib/analyzers/" with "contrib/analyzers/common/" ?

          Show
          Michael Busch added a comment - Michael, I think we should try to coordinate timing on this with LUCENE-1728 , where we are trying to reorganize contrib/analyzers... It seems like 1728 is ready to commit? Simon said on java-dev he will try to finish it by end of this week? In that case we should commit 1728 first. But we can finish the patch here I think... after 1728 is committed we just do a find&replace on the patch and replace "contrib/analyzers/" with "contrib/analyzers/common/" ?
          Hide
          Simon Willnauer added a comment -

          It seems like 1728 is ready to commit? Simon said on java-dev he will try to finish it by end of this week?

          That is correct. I can commit it today I think. Will make this issue dependent on 1728 and finish it by the end of today.

          simon

          Show
          Simon Willnauer added a comment - It seems like 1728 is ready to commit? Simon said on java-dev he will try to finish it by end of this week? That is correct. I can commit it today I think. Will make this issue dependent on 1728 and finish it by the end of today. simon
          Hide
          Michael Busch added a comment -

          Cool! Thanks, Simon.

          Show
          Michael Busch added a comment - Cool! Thanks, Simon.
          Hide
          Robert Muir added a comment -

          Michael, after 1728 I can take another look at this. the reason is, that I added some tests to these analyzers and found a bug in the Thai offsets.

          When i submitted this, i only duplicated the existing behavior, but I don't want to reintroduce the bug into incrementToken()

          Show
          Robert Muir added a comment - Michael, after 1728 I can take another look at this. the reason is, that I added some tests to these analyzers and found a bug in the Thai offsets. When i submitted this, i only duplicated the existing behavior, but I don't want to reintroduce the bug into incrementToken()
          Hide
          Simon Willnauer added a comment -

          LUCENE-1728 is committed. You can go ahead on this.

          Thanks

          Show
          Simon Willnauer added a comment - LUCENE-1728 is committed. You can go ahead on this. Thanks
          Hide
          Robert Muir added a comment -

          I'll work on updating the patch.

          Question: instead of converting the tests to new api, would it be more beneficial to test both? testNewAPI() / testOldAPI()

          Show
          Robert Muir added a comment - I'll work on updating the patch. Question: instead of converting the tests to new api, would it be more beneficial to test both? testNewAPI() / testOldAPI()
          Hide
          Uwe Schindler added a comment - - edited

          Question: instead of converting the tests to new api, would it be more beneficial to test both? testNewAPI() / testOldAPI()

          No need to do this. The correct backwards-compatibility of the new API is checked separately by the extra test in LUCENE-1693.

          It is enough if you remove all next() methods, add incrementToken() and make the above mentioned changes to make the token streams final where possible (see the comment of Michael Busch above – which is the "howto" for the conversion).

          The testy should only test the new API, but until they are also converted, the old tests should also still work (because of the backwards-compatibility layer). Be sure to apply LUCENE-1693 which does not change anything in contrib before starting to create a patch to have the newest API. Hopefully Michael has committed 1693 tomorrow.

          LUCENE-1460_core.txt is already in LUCENE-1693 as it is part of core token streams.

          Show
          Uwe Schindler added a comment - - edited Question: instead of converting the tests to new api, would it be more beneficial to test both? testNewAPI() / testOldAPI() No need to do this. The correct backwards-compatibility of the new API is checked separately by the extra test in LUCENE-1693 . It is enough if you remove all next() methods, add incrementToken() and make the above mentioned changes to make the token streams final where possible (see the comment of Michael Busch above – which is the "howto" for the conversion). The testy should only test the new API, but until they are also converted, the old tests should also still work (because of the backwards-compatibility layer). Be sure to apply LUCENE-1693 which does not change anything in contrib before starting to create a patch to have the newest API. Hopefully Michael has committed 1693 tomorrow. LUCENE-1460 _core.txt is already in LUCENE-1693 as it is part of core token streams.
          Hide
          Robert Muir added a comment -

          Uwe, thanks!

          Yes, I will follow your guidelines that Michael re-posted here.

          additionally, some tokenstreams are not final in contrib, but have not been put into a release (example: arabic, my fault)
          Any objection to making these final, so I can just remove the old api instead of declaring final, deprecated next() methods?

          Show
          Robert Muir added a comment - Uwe, thanks! Yes, I will follow your guidelines that Michael re-posted here. additionally, some tokenstreams are not final in contrib, but have not been put into a release (example: arabic, my fault) Any objection to making these final, so I can just remove the old api instead of declaring final, deprecated next() methods?
          Hide
          Uwe Schindler added a comment -

          no, make them final. the same happened with asciifoldingfilter, was never released, so can be changed.

          In general, all tokenstreams should be final or have final impls (see LUCENE-1753 for explanation). The only real example of a non-final one is CharTokenizer, which is abstract but the tokenization impls are final

          Show
          Uwe Schindler added a comment - no, make them final. the same happened with asciifoldingfilter, was never released, so can be changed. In general, all tokenstreams should be final or have final impls (see LUCENE-1753 for explanation). The only real example of a non-final one is CharTokenizer, which is abstract but the tokenization impls are final
          Hide
          Uwe Schindler added a comment -

          Make next() final is only the last chance to prevent users from overriding never-called next-methods, if the class was not final before. This makes the backwards-break as small as possible (by only making these methods final). See the discussion in 1693.

          Show
          Uwe Schindler added a comment - Make next() final is only the last chance to prevent users from overriding never-called next-methods, if the class was not final before. This makes the backwards-break as small as possible (by only making these methods final). See the discussion in 1693.
          Hide
          Robert Muir added a comment -

          i updated the previous patch to the current reality, still incomplete.

          Show
          Robert Muir added a comment - i updated the previous patch to the current reality, still incomplete.
          Hide
          Robert Muir added a comment -

          some more done (the rest of the various language analyzers at least).

          Show
          Robert Muir added a comment - some more done (the rest of the various language analyzers at least).
          Hide
          Robert Muir added a comment -

          i have visitors coming in town, so if you feel like punishing yourself, just let me know so we don't duplicate efforts.
          I promise I left all the fun ones still to be converted.

          Show
          Robert Muir added a comment - i have visitors coming in town, so if you feel like punishing yourself, just let me know so we don't duplicate efforts. I promise I left all the fun ones still to be converted.
          Hide
          Michael Busch added a comment -

          i have visitors coming in town, so if you feel like punishing yourself, just let me know so we don't duplicate efforts.
          I promise I left all the fun ones still to be converted.

          Yay I'm excited!

          I think when these patches are committed I don't want to hear the word "TokenStream" for a while...

          I can work on this Friday and during the weekend. Let me know which ones you're gonna work on and which ones I should take.

          Btw: Thanks for all your work here, Robert!!

          Show
          Michael Busch added a comment - i have visitors coming in town, so if you feel like punishing yourself, just let me know so we don't duplicate efforts. I promise I left all the fun ones still to be converted. Yay I'm excited! I think when these patches are committed I don't want to hear the word "TokenStream" for a while... I can work on this Friday and during the weekend. Let me know which ones you're gonna work on and which ones I should take. Btw: Thanks for all your work here, Robert!!
          Hide
          Robert Muir added a comment -

          Michael, I'm not gonna be able to work on this at all until monday at least, that's what I am saying.
          So I will check with you Monday and see how things are.

          Show
          Robert Muir added a comment - Michael, I'm not gonna be able to work on this at all until monday at least, that's what I am saying. So I will check with you Monday and see how things are.
          Hide
          Robert Muir added a comment -

          Michael, oh also, as Uwe mentioned, I applied LUCENE-1693 before creating this patch.

          If you do this all the tests will pass, otherwise I am not so sure!

          Show
          Robert Muir added a comment - Michael, oh also, as Uwe mentioned, I applied LUCENE-1693 before creating this patch. If you do this all the tests will pass, otherwise I am not so sure!
          Hide
          Robert Muir added a comment -

          Michael/anyone else, if you apply the patch (it should go cleanly now that 1693 is committed), then the easiest way to see what remains to be done is to declare TokenStream.incrementToken abstract.

          This still causes my eclipse to light up like a christmas tree so there is a lot to do

          Show
          Robert Muir added a comment - Michael/anyone else, if you apply the patch (it should go cleanly now that 1693 is committed), then the easiest way to see what remains to be done is to declare TokenStream.incrementToken abstract. This still causes my eclipse to light up like a christmas tree so there is a lot to do
          Hide
          Michael Busch added a comment -

          Thanks, Robert!

          I'm currently working on LUCENE-1448, will finish that tonight.
          Tomorrow and/or Sunday I'll work on this one.

          Show
          Michael Busch added a comment - Thanks, Robert! I'm currently working on LUCENE-1448 , will finish that tonight. Tomorrow and/or Sunday I'll work on this one.
          Hide
          Michael Busch added a comment -

          I converted some more contribs...

          Show
          Michael Busch added a comment - I converted some more contribs...
          Hide
          Michael Busch added a comment -

          More progress...
          ngram was a bit tricky. But I think it is much more efficiently implemented now. It used to clone every token it returns. Now it only clones the term that it receives from the input stream.

          Would be good if someone could take a look at the ngram changes... well, the testcases pass.

          Show
          Michael Busch added a comment - More progress... ngram was a bit tricky. But I think it is much more efficiently implemented now. It used to clone every token it returns. Now it only clones the term that it receives from the input stream. Would be good if someone could take a look at the ngram changes... well, the testcases pass.
          Hide
          Michael Busch added a comment -

          Btw: I'm taking a tokenstream break today... so if anyone feels the sudden urge to convert some of the remaining streams: don't hesitate - it won't conflict with my work, the patch I posted late last night is still current.

          I'll try to continue tomorrow.

          Show
          Michael Busch added a comment - Btw: I'm taking a tokenstream break today... so if anyone feels the sudden urge to convert some of the remaining streams: don't hesitate - it won't conflict with my work, the patch I posted late last night is still current. I'll try to continue tomorrow.
          Hide
          Robert Muir added a comment -

          Michael, looks like you got a ton done. I'll take a look and see late sunday/monday at what you did with ngram for curiousity at least.

          if you get a moment maybe someone could review what I did with Thai, I didn't look to hard to see if save/restore state was worse than the previous cloning...

          thanks for tackling the tougher ones here

          Show
          Robert Muir added a comment - Michael, looks like you got a ton done. I'll take a look and see late sunday/monday at what you did with ngram for curiousity at least. if you get a moment maybe someone could review what I did with Thai, I didn't look to hard to see if save/restore state was worse than the previous cloning... thanks for tackling the tougher ones here
          Hide
          Michael Busch added a comment -

          Some more progress - mostly in contrib/memory.

          Show
          Michael Busch added a comment - Some more progress - mostly in contrib/memory.
          Hide
          Robert Muir added a comment -

          Michael, I looked at your patch.

          What do you think about the remaining ones? should they be left as is for now?
          or do you think some of these should still expose Token (i.e. in their public/protected methods) but just as back compat/convenience and work w/ the new api behind the scenes?

          Show
          Robert Muir added a comment - Michael, I looked at your patch. What do you think about the remaining ones? should they be left as is for now? or do you think some of these should still expose Token (i.e. in their public/protected methods) but just as back compat/convenience and work w/ the new api behind the scenes?
          Hide
          Robert Muir added a comment -

          with analyzers/compound

          Show
          Robert Muir added a comment - with analyzers/compound
          Hide
          Michael Busch added a comment -

          with analyzers/compound

          Cool. Thanks for all your work here!

          if you get a moment maybe someone could review what I did with Thai

          I'll review it before we commit this.

          Show
          Michael Busch added a comment - with analyzers/compound Cool. Thanks for all your work here! if you get a moment maybe someone could review what I did with Thai I'll review it before we commit this.
          Hide
          Robert Muir added a comment -

          Michael, sorry to leave it incomplete, I think I am not the best for the remaining ones.

          For example I am a little intimidated by things such as this note in ShingleMatrix:

            * This method exists in order to avoid reursive calls to the method
            * as the complexity of a fairlt small matrix then easily would require
            * a gigabyte sized stack per thread.
          
          Show
          Robert Muir added a comment - Michael, sorry to leave it incomplete, I think I am not the best for the remaining ones. For example I am a little intimidated by things such as this note in ShingleMatrix: * This method exists in order to avoid reursive calls to the method * as the complexity of a fairlt small matrix then easily would require * a gigabyte sized stack per thread.
          Hide
          Michael Busch added a comment -

          I wonder if we should just deprecate PrefixAwareTokenFilter and PrefixAndSuffixAwareTokenFilter. I don't really understand what they're supposed to be used for, and I find the implementation a bit strange.

          I don't think it's possible to convert them in a backwards-compatible way anyways, and writing a replacement seems not really worthwhile - or does someone use these?

          Show
          Michael Busch added a comment - I wonder if we should just deprecate PrefixAwareTokenFilter and PrefixAndSuffixAwareTokenFilter. I don't really understand what they're supposed to be used for, and I find the implementation a bit strange. I don't think it's possible to convert them in a backwards-compatible way anyways, and writing a replacement seems not really worthwhile - or does someone use these?
          Hide
          Michael Busch added a comment -

          For example I am a little intimidated by things such as this note in ShingleMatrix:

          Same here. The shingle ones are the only remaining ones... I'll try, but I don't know the code at all. They do a lot of caching and stuff - so probably the hardest to convert.

          Show
          Michael Busch added a comment - For example I am a little intimidated by things such as this note in ShingleMatrix: Same here. The shingle ones are the only remaining ones... I'll try, but I don't know the code at all. They do a lot of caching and stuff - so probably the hardest to convert.
          Hide
          Robert Muir added a comment -

          so probably the hardest to convert.

          I was thinking, there are varying degrees of conversion.
          I think w/ the new api you could reduce cloning and really optimize these, or the other extreme you are almost merely pushing the back compat logic up to that level... kinda like my compound hack.
          (i tried the latter and was unsuccessful even doing that for the shingles).

          Show
          Robert Muir added a comment - so probably the hardest to convert. I was thinking, there are varying degrees of conversion. I think w/ the new api you could reduce cloning and really optimize these, or the other extreme you are almost merely pushing the back compat logic up to that level... kinda like my compound hack. (i tried the latter and was unsuccessful even doing that for the shingles).
          Hide
          Michael Busch added a comment -

          I'll commit this current patch soon and then make a separate patch for the shingle filters.

          Show
          Michael Busch added a comment - I'll commit this current patch soon and then make a separate patch for the shingle filters.
          Hide
          Michael Busch added a comment -

          Latest patch that converts all streams, except the single ones.

          I'll commit this soon.

          Show
          Michael Busch added a comment - Latest patch that converts all streams, except the single ones. I'll commit this soon.
          Hide
          Michael Busch added a comment -

          Committed.

          Show
          Michael Busch added a comment - Committed.
          Hide
          Michael Busch added a comment -

          Robert, thank you for all your help here!!!

          Show
          Michael Busch added a comment - Robert, thank you for all your help here!!!
          Hide
          Uwe Schindler added a comment -

          You two guys are the best, great work! I had no time to completely review it, but I am happy that the new TokenStream API seems so successful

          Show
          Uwe Schindler added a comment - You two guys are the best, great work! I had no time to completely review it, but I am happy that the new TokenStream API seems so successful

            People

            • Assignee:
              Michael Busch
              Reporter:
              Michael Busch
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development