|
This is the result of re-compiling the JFlex fixed file. Not sure how useful this patch is, but I'm attaching it anyway.
I've found a way to do it (I think):
I've added a new type called ACRONYM_DEP that identifies the old ACRONYMs and fixed the current ACRONYM to identify proper ones. I also marked ACRONYM_DEP as deprecated. I added code to StandardTokenizer to set the type of a token to HOST if the type returned is ACRONYM_DEP. This behavior can be changed if you think the type should be set to ACRONYM, in case there are applications that count on the Token type. I wrote these 4 lines of code to verify it works: public static void parse(String text) throws Exception { The previous patch I put was incorrect since it would still break existing applications. The current patch does:
1. Introduces a new type ACRONYM_DEP which is deprecated and recognizes the old ACRONYM format. 2. Fixes ACRONYM to recognize LETTER + "." (LETTER + ".")+. 3. Added a public member to StandardTokenizer and StandardAnalyzer replaceDepAcronym which can be set if the application would like the deprecated acronym format to be treated as ACRONYM or HOST. The default behavior, if not set is to recognize the old ACRONYM as HOST. This is how it should be used: public static void parse(String text, boolean replaceDepAcronym) throws Exception { The member is marked deprecated so we can remove it in the next release. Applications that would like to new behavior need to do nothing, and therefore will not be impacted once we remove that member. Applications that want the old behavior need to explicitly set it and in the next major release remove it. I think that solves it. How should I proceed? Hi Shai,
Thanks for the patch. Can you please add unit tests in TestStandardAnalyzer? Also, if you run svn diff in the Lucene directory then it will generate a patch that doesn't need to be modified (your patch has references to D:/ etc.) Hi Grant,
I used Eclipse to generate the patch (right-click on – Shai Erera Hmmm, maybe there is a way in Eclipse to make the path relative to the
working directory? Otherwise, from the command line in the Lucene directory: svn diff > StandardTokenizer-4.patch -Grant -------------------------- Lucene Helpful Hints: Code fies under java and test packages. This should be applied under "src"
Doesn't this mean it is an API change if we make the new behavior the default? Apps that upgrade will see the new behavior unless they set they call replaceDepAcronym. To be fully backwards compatible I think this patch should use the old behavior as default. Then in 3.0 we can make the new behavior the default. Changed the default behavior to match the current behavior. Applications that want to use the new definitions of HOST and ACRONYM should call StandardAnalyzer.replaceDepAcronym = true.
Applied patch. Updated some documentation. Changed it to use a private boolean along with getters and setters, plus added some new constructors. All of these should be deprecated and marked as being removed in 3.x.
I will apply patch tomorrow or Friday unless I hear objections StandardTokenizer also incorrectly marks numbers as HOST.
For example, on line 108 of TestStandardAnalyzer, the type of 21.35 is HOST when I think it should be NUM. Even if you run testNumeric() on the trunk version, it recognizes "21.35" as HOST and not NUM ... The problem is that HOST is configured to recognized letters or digits. I'll check if there's a way to define precedence in JFlex, i.e., first detect NUM, then HOST (as every floating number is a HOST).
Another option would be to set HOST do detect series of xxx.yyy.(zzz .)+, meaning aaa.bbb won't be a HOST, but aaa.bbb.ccc will be. Do you see any problem with that? Are you aware of hosts that are of the form aa.bb? Maybe this is a separate issue?
Notice that IP addresses are also recognized as HOST, however StandardTokenizerImpl.jflex documentation specifies they should be recognized as NUM. // floating point, serial, model numbers, ip addresses, etc. // every other segment must have at least one digit NUM = ({ALPHANUM} {P} {HAS_DIGIT} | {HAS_DIGIT} {P} {ALPHANUM}
Let's commit this patch, and move the floating point issue to later.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ACRONYM = {ALPHA} "." ({ALPHA} ".")+
with
ACRONYM = {LETTER} "." ({LETTER} ".")+