Details
-
Task
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
New, Patch Available
Description
to fix bugs in some duplicate, handcoded impls of these stemmers (nl, fr, ru, etc) we should simply merge snowball and analyzers, and replace the buggy impls with the proper snowball stemfilters.
Attachments
Attachments
- LUCENE-2226.patch
- 8 kB
- Robert Muir
- LUCENE-2226-javadocs.patch
- 2 kB
- Uwe Schindler
Activity
hello, i know this is an extremely simple change technically, but i would like to hear if anyone is against this move.
if no one objects i'd like to commit in a few days, to make progress on fixing some of these problems.
+1
However this is a very minor break in bw compat: drop in jar replacement. The user will have to delete the snowball jar and use the contrib/analyzer one, if not already done.
However this is a very minor break in bw compat: drop in jar replacement. The user will have to delete the snowball jar and use the contrib/analyzer one, if not already done.
what else can we do though other than move it? the packages are the same
Half of me agrees with your comment on LUCENE-2055 that we should have solid back compat for contrib analyzers.
The other half says that this compat is already acheived via the build mechanism: simply use the old jar files, and that any work we do for compat here is wasted.
what i am certain about is that this stuff isn't stable like core lucene and to clean it up, its going to take a long time if we restrict ourselves with the same policy that a stable library (core lucene) has.
Contribs back compat policy is that there is no back compat policy unless that contrib specifically states one.
Robert, I'm suggesting that you move it. But that in CHANGES.txt that you make it clear that part of the user's responsibility in upgrading is to delete the snowball jar. I've been bit too many times by having both the old jar and a new jar in the classpath. I know better but ....
I'd more or less agree with you that one could stay with old jars if it weren't for bug fixes. Some bug fixes, such as in LUCENE-2055, will partially invalidate an index, requiring it to be rebuilt to work as expected. I think these need to be done and should not require bw compat maintenance of the bug.
Mark, that is my understanding too. I wasn't commenting on the policy but on the fact of the possible breakage. I think it is a courtesy to notify users of a change to which they might need to pay attention. I don't know that's spelled out in the policy, but I think it should be. Not that a lack of notice is a guarantee of no breakage but that a notice is a guarantee of breakage (at least under some circumstances).
Is there any contrib that specifically states one? I couldn't find it. That said, the last release went to great lengths to preserve bw compat in some contribs, especially contrib/analyzers/common. This is attested to by the Version attribute. Is that the implicit statement of the policy? It is clear that Snowball has broken backward compatibility in the past. I presume it will in the future. The smartcn package is marked experimental. The analysis/common is not clear as it has the Version stuff.
Part of what Robert is saying (or at least what I am hearing): Snowball should be trusted, but contrib/common stemmers should not. If the latter had an implicit bw compat policy (in view of the Version usage), does that indicate that Snowball should inherit the bw compat? (I'm being a bit argumentative, I know. ATM, I don't think it is that important. But after all the dust settles and this i18n stuff is solid, I think it might be reasonable to make a stronger bw compat statement.)
Regarding the policy, I think it is good and give users confidence in the stability of Lucene and the longevity of their indexes. As I see it, the bw compat policy covers English well, western European ok, but other languages poorly. Other than English it is likely that one is using a contrib to do analysis. I think that ultimately it will be important for bw compat to cover non-English well. I might be wrong here, any large project is likely to index non-English texts and will need index level compatibility. Right now the only option is to use contrib as a copy source.
As I see it the bw compat policy covers several things:
1) index compatibilty across 2 major releases.
2) api compatibility within 1 major release.
3) drop in jar compatibility from one release to the next. Going to a x.0 release requires deprecations to be addressed.
I think these are in the order that I care about them. WRT 2), I really only care about semantic changes. As long as there is a good mechanism to identify what I need to do, I really don't care that the call interface changes when there is semantic equivalence.
WRT 3) I don't care at all.
And regarding 1), I'd rather have correctness and have bugs fixed, causing a bw break than preserve a "bad" index.
But that's just me. Just one.....
Mark, that is my understanding too. I wasn't commenting on the policy but on the fact of the possible breakage. I think it is a courtesy to notify users of a change to which they might need to pay attention. I don't know that's spelled out in the policy, but I think it should be. Not that a lack of notice is a guarantee of no breakage but that a notice is a guarantee of breakage (at least under some circumstances).
Right - I was just pointing out that jar drop in is far from a requirement in contrib. We do always try and play nice anyway.
Is there any contrib that specifically states one? I couldn't find it.
Don't think so - meaning there is no back compat policy in contrib - I think as a contrib matures, its up to those working on it to decide that its reached a state that deserves a policy of some kind. The Highlighter could probably use one at this point, but at the same time, nothing has created too much of an outcry at this point.
The analysis/common is not clear as it has the Version stuff.
Right - just because there is no policy doesn't mean we shouldn't make any attempts at back compat - but the issue you brought up is not something easily addressed, nor I think, large enough to worry about with the proper warning in Changes. Users should be wary of contrib on upgrading - unless it presents a strong back compat policy.
But after all the dust settles and this i18n stuff is solid, I think it might be reasonable to make a stronger bw compat statement.
I agree - now that contrib has been getting some much needed love recently, I think it should start heading towards some back compat promises - especially concerning analyzers. We already do tend to bend over backwards when we can anyway.
I think we are on the same page - I'm just not very worried about the break you mention - I think its a perfectly acceptable growing pain. And I think our back compat has been so week because contrib has been a bit of a wasteland in the past - no one was willing to take ownership of a lot of this stuff - especially the language analyzers. That has change recently. As the devs clean up and consolidate this stuff properly, I think we can work towards stronger promises in the future.
Hi DM, I wanted to give my opinion on some of your comments:
Robert, I'm suggesting that you move it. But that in CHANGES.txt that you make it clear that part of the user's responsibility in upgrading is to delete the snowball jar. I've been bit too many times by having both the old jar and a new jar in the classpath. I know better but ...
ok, I'll improve the wording in CHANGES.
I'd more or less agree with you that one could stay with old jars if it weren't for bug fixes. Some bug fixes, such as in
LUCENE-2055, will partially invalidate an index, requiring it to be rebuilt to work as expected. I think these need to be done and should not require bw compat maintenance of the bug.
ok, we can discuss this on that issue when the time comes. personally I am for "fixing the bug" which means for this case using snowball instead. i'd like to remove the old cruft, maybe we decide we should keep it around with Version for a while, but not too long, i think 2 major releases is too long in this case, since lucene major releases don't seem to happen that often.
Part of what Robert is saying (or at least what I am hearing): Snowball should be trusted, but contrib/common stemmers should not.
no, I am not trying to imply this really. Snowball has its own problems (Karl Wettin and I both reported problems to their mailing list recently), but for these languages, its better to simply use snowball itself rather than some code that tries to implement one of its algorithms, but doesn't quite do it right.
I think that ultimately it will be important for bw compat to cover non-English well.
I agree, and we can continue to do this. but ultimately is a long time away, contrib/analyzers still needs some work, and even then, for non-english some stuff is simply outside of our control, i.e. Myanmar unicode model changing between 5.1 and 5.2, things like that.
but for this change, we simply need to move things from one contrib to another, the packages are left unchanged.
If i knew of an easy way to 'move things from one jar to another and preserve drop-in jar back compat', i would do it. But i think this concept doesn't even make sense for what we are doing here?
But i think this concept doesn't even make sense for what we are doing here?
I agree. A comment in CHANGES should be sufficient
Here the fix for the javadocs-all target after hudson build failure.
Committed revision: 901576
this patch simply moves snowball functionality into contrib/analyzers. it doesn't try to fix bugs or remove duplication yet.
no packages are changed so it doesnt cause any breaks.
before applying patch, run the following: