Issue Details (XML | Word | Printable)

Key: LUCENE-1151
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Michael McCandless
Reporter: Michael McCandless
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default

Created: 25/Jan/08 11:02 AM   Updated: 11/Oct/08 12:49 PM
Return to search
Component/s: Analysis
Affects Version/s: None
Fix Version/s: 2.4

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works LUCENE-1151.patch 2008-01-29 01:29 PM Michael McCandless 6 kB
Issue Links:
Reference
 

Lucene Fields: New
Resolution Date: 03/Feb/08 03:33 PM


 Description  « Hide
Coming out of the discussion around back compatibility, it seems best to default StandardAnalyzer to properly fix LUCENE-1068, while preserving the ability to get the back-compatible behavior in the rare event that it's desired.

This just means changing the replaceInvalidAcronym = false with = true, and, adding a clear entry to CHANGES.txt that this very slight non back compatible change took place.

Spinoff from here:

http://www.gossamer-threads.com/lists/lucene/java-dev/57517#57517

I'll commit that change in a day or two.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Michael McCandless added a comment - 29/Jan/08 01:29 PM
Attached patch that fixes the original bug (LUCENE-1068) by default, but offers system property & static method to keep backwards compatible yet buggy behavior.

I'll commit in a day or two.


Grant Ingersoll added a comment - 29/Jan/08 03:20 PM
Not necessarily related, but can you think of a way that we can keep WikipediaTokenizer and StandardTokenizer in sync for these kind of things. I guess I need to go look in JFlex to see if there is a way to do inheritance. Essentially, I want the WikiTokenizer to be StandardTokenizer plus handle the Wiki syntax appropriately.

Michael McCandless added a comment - 29/Jan/08 05:10 PM
Very good question ... I don't know. It would be awesome (and, amazing) if JFlex enabled some kind of inheritance.

Since WikipediaTokenizer doesn't have the backwards compat requirement of StandardTokenizer, you can presumably just fix ACRONYM in WikipediaTokenizer to not match host names (ie hardwire to "true")?


Grant Ingersoll added a comment - 29/Jan/08 07:19 PM

Jörg Prante added a comment - 30/Jan/08 03:24 PM
Hi Grant,

have you looked at JFlex %implements and %extends directives?

I have used %implements successfully in building my parsers for inheritance, where the Tokens are all constants in an interface generated not by JFlex but by a parser generator.

For example

%%
%class ECQLLexer
%implements ECQLTokens
%unicode
%integer
%eofval{ return 0; %eofval}
%line
%column

I am quite sure %extends could also be used to build a tokenizer family.

See http://jflex.de/manual.html


Mark Lassau added a comment - 05/Sep/08 06:36 AM
Michael,
Great work. I am glad we are moving to have the bug fixed by default, rather than the other way around.

Please indulge me a couple of small nitpicks before I get to my main point in another comment

  • Your comment above the static initializer is not correct:
      // Default to false (fixed the bug), unless the system prop is set
    

    should read:

      // Default to true (fixed the bug), unless the system prop is set
    
  • The re-use of the variable a in TestStandardAnalyzer.testDomainNames() does not really guarantee that you are testing the default behaviour of StandardAnalyzer.
    I would recommend resetting a in setUp(), or explicitly constructing it in test method.

Given that the code is "temporary" until v3.0, feel free to ignore me


Mark Lassau added a comment - 05/Sep/08 06:47 AM
I love the solution you have come up with, but would suggest that it is moved to StandardTokenizer instead of StandardAnalyzer.
StandardTokenizer is the class with the actual problem. Fixing it there would mean that everyone that uses StandardTokenizer gets a default fix, not just StandardAnalyzer.

For instance, see LUCENE-1373, where most of the contrib Analyzers still suffer the buggy behavior with no workaround available.
I think that moving your "defaulting logic" to the tokenizer would fix all these Analyzers in one fell swoop.

I would provide suggested patches, but I am just about to go on holidays for 3 weeks. Is there a planned release date for v2.3.3 or v2.4?


Mark Lassau added a comment - 05/Sep/08 08:30 AM
Added a patch to LUCENE-1373 which moves the logic introduced here from StandardAnalyzer to StandardTokenizer.

Michael McCandless added a comment - 06/Sep/08 10:57 AM

Please indulge me a couple of small nitpicks before I get to my main point in another comment

Thanks for catching these Mark – I'll commit a fix shortly.