Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The addition of reusableTokenStream to the core analyzers unfortunately broke back compat of external subclasses:

      http://www.nabble.com/Extending-StandardAnalyzer-considered-harmful-td23863822.html

      On upgrading, such subclasses would silently not be used anymore, since Lucene's indexing invokes reusableTokenStream.

      I think we should should at least deprecate Analyzer.tokenStream, today, so that users see deprecation warnings if their classes override this method. But going forward when we want to change the API of core classes that are extended, I think we have to introduce entirely new classes, to keep back compatibility.

      1. LUCENE-1678.patch
        8 kB
        Michael McCandless

        Activity

        Hide
        Grant Ingersoll added a comment -

        I frankly don't like renaming something like this. This is, once again, a case of back compatibility biting us. If instead of working around back compat. we had just made Analyzer.tokenStream be reusable, we wouldn't have to do this. Now, instead, we are going to have a convoluted name for something (reusableTS).

        In my mind, better to just make .tokenStream do the right thing and get rid of reusableTokenStream.

        Show
        Grant Ingersoll added a comment - I frankly don't like renaming something like this. This is, once again, a case of back compatibility biting us. If instead of working around back compat. we had just made Analyzer.tokenStream be reusable, we wouldn't have to do this. Now, instead, we are going to have a convoluted name for something (reusableTS). In my mind, better to just make .tokenStream do the right thing and get rid of reusableTokenStream.
        Hide
        Earwin Burrfoot added a comment -

        Second this. Though I lost any hope for sane Lucene release/compat rules.

        Show
        Earwin Burrfoot added a comment - Second this. Though I lost any hope for sane Lucene release/compat rules.
        Hide
        Mark Miller added a comment -

        >>Second this. Though I lost any hope for sane Lucene release/compat rules.

        Why? Have you seen anyone arguing for anything else?

        If there are sane/smart ways to change our back compat policy, I think you have seen that no one would object.

        Its a complicated topic that has come up for discussion many times, but I don't think the current policy is insane. And I have seen most people supporting whatever is best for Lucene. But - see all of the posts on the topic. Its complicated. Nobody even really torpedoed anything, its more that enough issues were raised and no one with a proper amount of authority felt comfortable stepping up to the plate. Mike was gung ho for it for a while, and even he backed off. Thats a great indication to me that the issue is not simple. Back compat currently is not insane, but I think we all agree it should be loosened somehow in the future.

        The way Lucene stuff generally goes, if someone like Grant or Mike really wanted to push changes, the changes would happen. I think they both see that the effort involved in such a change is not small though. Back compat is like our constitution. Its a pain in the butt to change in a way that everyone could get on board with. Even still, if someone really wanted to, they could probably push through that. It seems we havn't gotten to such a point with anyone yet though.

        Giving up is really not the answer though - thats why the discussion has come and gone in the past. The effort to get anything done grew (in terms of ideas as much as any implementation), and one by one, the participants dropped out.

        Show
        Mark Miller added a comment - >>Second this. Though I lost any hope for sane Lucene release/compat rules. Why? Have you seen anyone arguing for anything else? If there are sane/smart ways to change our back compat policy, I think you have seen that no one would object. Its a complicated topic that has come up for discussion many times, but I don't think the current policy is insane. And I have seen most people supporting whatever is best for Lucene. But - see all of the posts on the topic. Its complicated. Nobody even really torpedoed anything, its more that enough issues were raised and no one with a proper amount of authority felt comfortable stepping up to the plate. Mike was gung ho for it for a while, and even he backed off. Thats a great indication to me that the issue is not simple. Back compat currently is not insane, but I think we all agree it should be loosened somehow in the future. The way Lucene stuff generally goes, if someone like Grant or Mike really wanted to push changes, the changes would happen. I think they both see that the effort involved in such a change is not small though. Back compat is like our constitution. Its a pain in the butt to change in a way that everyone could get on board with. Even still, if someone really wanted to, they could probably push through that. It seems we havn't gotten to such a point with anyone yet though. Giving up is really not the answer though - thats why the discussion has come and gone in the past. The effort to get anything done grew (in terms of ideas as much as any implementation), and one by one, the participants dropped out.
        Hide
        Earwin Burrfoot added a comment -

        If there are sane/smart ways to change our back compat policy, I think you have seen that no one would object.

        It's not a matter of finding a smart way. It is a matter of sacrifice that has to be made and readiness to take the blame for decision that can be unpopular with someone.
        You go zealously for back-compat - you sacrifice readability/maintainability of your code but free users from any troubles when they want to 'simply upgrade'. You adopt more relaxed policy - you sacrifice users' time, but in return you gain cleaner codebase and new stuff can be written and used faster.
        There's no way to ride two horses at once.

        Some people are comfortable with current policies. Few cringe when they hear things like above. Most theoretically want to relax the rules. Nobody's ready to give up something for it.

        Okay, there's an escape hatch I (and someone else) mentioned on the list before. Adopting a fixed release cycle with small intervals between releases (compared to what we have now). Fixed - as in, releases are made each N months instead of when everyone feels they finished and polished up all their pet projects and there's nothing else exciting to do. That way we can keep the current policy, but deletion-through-deprecation approach will work, at last!
        This solution is halfassed, I can already see discussions like "That was a big change, let's keep the deprecates around longer, say - for a couple of releases.", it doesn't solve good-name-thrashing problem, as you have to go through two rounds of deprecation to change semantics on something, but keep the name.
        But this is something better than what we have now, a-a-and this is something that needs commiter backing.

        Thats a great indication to me that the issue is not simple.

        The issue is simple, the choice is not. And maintaining status quo is free.

        Giving up is really not the answer though

        It is the answer. I have no moral right to hammer my ideals into heads that did tremendously more for the project, than I did. And maintaining a patch queue over Lucene trunk is not 'that' hard.

        Show
        Earwin Burrfoot added a comment - If there are sane/smart ways to change our back compat policy, I think you have seen that no one would object. It's not a matter of finding a smart way. It is a matter of sacrifice that has to be made and readiness to take the blame for decision that can be unpopular with someone. You go zealously for back-compat - you sacrifice readability/maintainability of your code but free users from any troubles when they want to 'simply upgrade'. You adopt more relaxed policy - you sacrifice users' time, but in return you gain cleaner codebase and new stuff can be written and used faster. There's no way to ride two horses at once. Some people are comfortable with current policies. Few cringe when they hear things like above. Most theoretically want to relax the rules. Nobody's ready to give up something for it. Okay, there's an escape hatch I (and someone else) mentioned on the list before. Adopting a fixed release cycle with small intervals between releases (compared to what we have now). Fixed - as in, releases are made each N months instead of when everyone feels they finished and polished up all their pet projects and there's nothing else exciting to do. That way we can keep the current policy, but deletion-through-deprecation approach will work, at last! This solution is halfassed, I can already see discussions like "That was a big change, let's keep the deprecates around longer, say - for a couple of releases.", it doesn't solve good-name-thrashing problem, as you have to go through two rounds of deprecation to change semantics on something, but keep the name. But this is something better than what we have now, a-a-and this is something that needs commiter backing. Thats a great indication to me that the issue is not simple. The issue is simple, the choice is not. And maintaining status quo is free. Giving up is really not the answer though It is the answer. I have no moral right to hammer my ideals into heads that did tremendously more for the project, than I did. And maintaining a patch queue over Lucene trunk is not 'that' hard.
        Hide
        Grant Ingersoll added a comment -

        If there are sane/smart ways to change our back compat policy, I think you have seen that no one would object.

        The sane/smart way is to do it on a case by case basis. Here is a specific case. Generalizing it a bit, the place where it should be more easily relaxable are the cases where we know very few people make customizations, as in implementing Fieldable or FieldCache.

        As for this specific case, the original change was the thing that broke back compat. So, given it is already broken, why not fix it the right way?

        Show
        Grant Ingersoll added a comment - If there are sane/smart ways to change our back compat policy, I think you have seen that no one would object. The sane/smart way is to do it on a case by case basis. Here is a specific case. Generalizing it a bit, the place where it should be more easily relaxable are the cases where we know very few people make customizations, as in implementing Fieldable or FieldCache. As for this specific case, the original change was the thing that broke back compat. So, given it is already broken, why not fix it the right way?
        Hide
        Michael McCandless added a comment -

        So, given it is already broken, why not fix it the right way?

        Because two wrongs don't make a right?

        (I assume you're suggesting changing tokenStream to match reusableTokenStream, ie allowing it to return a reused TokenStream between calls, and then deprecating reusableTokenStream).

        Apps that get multiple TokenStreams from a single Analyzer and then iterate through them, would silently break, if we up and made this 2nd non-back-compatible change.

        Show
        Michael McCandless added a comment - So, given it is already broken, why not fix it the right way? Because two wrongs don't make a right? (I assume you're suggesting changing tokenStream to match reusableTokenStream, ie allowing it to return a reused TokenStream between calls, and then deprecating reusableTokenStream). Apps that get multiple TokenStreams from a single Analyzer and then iterate through them, would silently break, if we up and made this 2nd non-back-compatible change.
        Hide
        Michael McCandless added a comment -

        The sane/smart way is to do it on a case by case basis.

        Right, and the huge periodic discussions on back-compat do soften
        "our" stance on these. For example LUCENE-1542 was just such a case,
        where we chose to simply fix the [rather nasty] bug at the expense of
        possible apps relying on the broken behavior.

        LUCENE-1679 is another (rather trivial) example, where we plan to
        change certain fields in WildcardTermEnum to be final.

        Show
        Michael McCandless added a comment - The sane/smart way is to do it on a case by case basis. Right, and the huge periodic discussions on back-compat do soften "our" stance on these. For example LUCENE-1542 was just such a case, where we chose to simply fix the [rather nasty] bug at the expense of possible apps relying on the broken behavior. LUCENE-1679 is another (rather trivial) example, where we plan to change certain fields in WildcardTermEnum to be final.
        Hide
        Michael McCandless added a comment -

        Mike was gung ho for it for a while, and even he backed off.

        Well... my particular itch (most recently!) was an addition to Lucene
        that'd let us conditionalize the default settings so that new users
        get the latest & greatest, but back-compat users can easily preserve
        old behavior.

        Ie, it was a software change, not a policy change; I tried hard to
        steer clear of any proposed changes to back-compat policy.

        But, for better or worse, back-compat policy is one of those
        "magnetic" topics: whenever you get too close to it, it suddenly
        sticks to you and takes over your thread.

        And in the end we arrived at a workable solution to my particular
        itch, which is to make such settings explicit or switch to new APIs
        that change the defaults (eg the new FSDir.open).

        That said, improving our back compat policy is an important and
        amazingly complex topic.

        Show
        Michael McCandless added a comment - Mike was gung ho for it for a while, and even he backed off. Well... my particular itch (most recently!) was an addition to Lucene that'd let us conditionalize the default settings so that new users get the latest & greatest, but back-compat users can easily preserve old behavior. Ie, it was a software change, not a policy change; I tried hard to steer clear of any proposed changes to back-compat policy. But, for better or worse, back-compat policy is one of those "magnetic" topics: whenever you get too close to it, it suddenly sticks to you and takes over your thread. And in the end we arrived at a workable solution to my particular itch, which is to make such settings explicit or switch to new APIs that change the defaults (eg the new FSDir.open). That said, improving our back compat policy is an important and amazingly complex topic.
        Hide
        Michael McCandless added a comment -

        The way Lucene stuff generally goes, if someone like Grant or Mike really wanted to push changes, the changes would happen.

        Well, it's consensus that we all need to reach (at least enough
        consensus to vote on it), and on complex topics it's not easy to get
        to consensus.

        Giving up is really not the answer though - thats why the discussion has come and gone in the past.

        I don't think anyone has given up. The issue still smoulders and
        flares up here and there (like, this issue). Eventually we'll get
        enough consensus for something concrete to change.

        I have no moral right to hammer my ideals into heads that did tremendously more for the project, than I did.

        In fact you do & should. This is exactly how change happens. Here's
        a great (though sexist) quote:

        "The reasonable man adapts himself to the world; the unreasonable one persists to adapt the world to himself. Therefore all progress depends on the unreasonable man." - George Bernard Shaw

        Show
        Michael McCandless added a comment - The way Lucene stuff generally goes, if someone like Grant or Mike really wanted to push changes, the changes would happen. Well, it's consensus that we all need to reach (at least enough consensus to vote on it), and on complex topics it's not easy to get to consensus. Giving up is really not the answer though - thats why the discussion has come and gone in the past. I don't think anyone has given up. The issue still smoulders and flares up here and there (like, this issue). Eventually we'll get enough consensus for something concrete to change. I have no moral right to hammer my ideals into heads that did tremendously more for the project, than I did. In fact you do & should. This is exactly how change happens. Here's a great (though sexist) quote: "The reasonable man adapts himself to the world; the unreasonable one persists to adapt the world to himself. Therefore all progress depends on the unreasonable man." - George Bernard Shaw
        Hide
        Michael McCandless added a comment -

        Adopting a fixed release cycle with small intervals between releases (compared to what we have now).

        I think this is almost a good solution, though instead of "fixed" it
        could be that we try [harder] to do major releases more frequently.
        Let's face it: Lucene is changing quite quickly now, so it seems
        reasonable that the major releases also come quickly.

        I say "almost" because.... alot of the pain in implementing our
        current policy is the need to have a "stepping stone" between old and
        new. Ie, we now must always do a release that deprecates old APIs and
        introduces new ones so that you can upgrade to that, fix deprecations,
        and you know you're set for the next major release. So eg changes to
        interfaces is a big problem. If we were free to suddenly make a new
        major releases, with instructions on how to migrate old -> new, that'd
        be very liberating.

        I think nearly everyone agrees our back-compat policy is exceptionally
        costly. On a given interesting change to Lucene, a very large part of
        the effort is spent on preserving back-compat. It causes all kinds of
        spooky code, pollutes the APIs, causes us to go forward with sub-par
        names, etc. The freedom Marvin has to make changes to Lucy is
        fabulous, though in exchange, it's not yet released...

        I think most would also agree that it's far from easy even carrying
        out the policy we have without making mistakes: this change (addition
        of reusableTokenStream) violated our policy (I did it by accident and
        nobody noticed until now). I actually believe programming languages /
        runtime envs need to provide more support for developers; we have
        inadequate tools now. But we can't wait for that...

        Show
        Michael McCandless added a comment - Adopting a fixed release cycle with small intervals between releases (compared to what we have now). I think this is almost a good solution, though instead of "fixed" it could be that we try [harder] to do major releases more frequently. Let's face it: Lucene is changing quite quickly now, so it seems reasonable that the major releases also come quickly. I say "almost" because.... alot of the pain in implementing our current policy is the need to have a "stepping stone" between old and new. Ie, we now must always do a release that deprecates old APIs and introduces new ones so that you can upgrade to that, fix deprecations, and you know you're set for the next major release. So eg changes to interfaces is a big problem. If we were free to suddenly make a new major releases, with instructions on how to migrate old -> new, that'd be very liberating. I think nearly everyone agrees our back-compat policy is exceptionally costly. On a given interesting change to Lucene, a very large part of the effort is spent on preserving back-compat. It causes all kinds of spooky code, pollutes the APIs, causes us to go forward with sub-par names, etc. The freedom Marvin has to make changes to Lucy is fabulous, though in exchange, it's not yet released... I think most would also agree that it's far from easy even carrying out the policy we have without making mistakes: this change (addition of reusableTokenStream) violated our policy (I did it by accident and nobody noticed until now). I actually believe programming languages / runtime envs need to provide more support for developers; we have inadequate tools now. But we can't wait for that...
        Hide
        Shai Erera added a comment -

        We've had this thread http://www.nabble.com/Lucene%27s-default-settings---back-compatibility-td23605466.html, and in the latest post (http://www.nabble.com/Re%3A-Lucene%27s-default-settings---back-compatibility-p23792927.html) I tried to put together some wording for a revised (and relaxed) back-compat policy. I believe it was Grant who asked for some writeup to get to the users' list, and I read also that we may want to discuss each item separately, to get to a consensus.

        Perhaps we can continue the discussion on that thread, and try to get to a consensus on any of the items? We don't necessarily need to change all of it in one day, but getting some feedback from you on any of the items can help bring that discussion back to life, and hopefully reach a consensus.

        As was said on this thread, persistence will eventually drive us to reach a consensus, so I'm being persistent .

        Show
        Shai Erera added a comment - We've had this thread http://www.nabble.com/Lucene%27s-default-settings---back-compatibility-td23605466.html , and in the latest post ( http://www.nabble.com/Re%3A-Lucene%27s-default-settings---back-compatibility-p23792927.html ) I tried to put together some wording for a revised (and relaxed) back-compat policy. I believe it was Grant who asked for some writeup to get to the users' list, and I read also that we may want to discuss each item separately, to get to a consensus. Perhaps we can continue the discussion on that thread, and try to get to a consensus on any of the items? We don't necessarily need to change all of it in one day, but getting some feedback from you on any of the items can help bring that discussion back to life, and hopefully reach a consensus. As was said on this thread, persistence will eventually drive us to reach a consensus, so I'm being persistent .
        Hide
        Michael McCandless added a comment -

        OK, inspired by Uwe's persistence on LUCENE-1693, I realized one clean
        way to fix the back-compat break here is by using reflection when
        creating the Analyzer as to whether the class overrides the
        tokenStream method. Then, in reusableTokenStream we forcefully
        fallback to tokenStream, if it does.

        Attached patch, with a test case showing the issue, implements this
        approach, and it works well. With this approach there's no reason to
        deprecate tokenStream.

        Show
        Michael McCandless added a comment - OK, inspired by Uwe's persistence on LUCENE-1693 , I realized one clean way to fix the back-compat break here is by using reflection when creating the Analyzer as to whether the class overrides the tokenStream method. Then, in reusableTokenStream we forcefully fallback to tokenStream, if it does. Attached patch, with a test case showing the issue, implements this approach, and it works well. With this approach there's no reason to deprecate tokenStream.
        Hide
        Uwe Schindler added a comment -

        Your solution is also cool, to fix the last problems with the core token streams in LUCENE-1693: If somebody overrides a deprecated method in one of the core tokenstreams (that are not final), the method is never called, because the indexer uses incrementToken per default. The same can be used to fix this problem in TokenStream, too.

        I will prepare a patch for this (I am currently preparing a new patch with some tests and the solution for the problems with number of attribute instances may not be equals number of attributes).

        Show
        Uwe Schindler added a comment - Your solution is also cool, to fix the last problems with the core token streams in LUCENE-1693 : If somebody overrides a deprecated method in one of the core tokenstreams (that are not final), the method is never called, because the indexer uses incrementToken per default. The same can be used to fix this problem in TokenStream, too. I will prepare a patch for this (I am currently preparing a new patch with some tests and the solution for the problems with number of attribute instances may not be equals number of attributes).

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development