Lucene - Core
  1. Lucene - Core
  2. LUCENE-1151

Fix StandardAnalyzer to not mis-identify HOST as ACRONYM by default

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.4
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Coming out of the discussion around back compatibility, it seems best to default StandardAnalyzer to properly fix LUCENE-1068, while preserving the ability to get the back-compatible behavior in the rare event that it's desired.

      This just means changing the replaceInvalidAcronym = false with = true, and, adding a clear entry to CHANGES.txt that this very slight non back compatible change took place.

      Spinoff from here:

      http://www.gossamer-threads.com/lists/lucene/java-dev/57517#57517

      I'll commit that change in a day or two.

      1. LUCENE-1151.patch
        6 kB
        Michael McCandless

        Issue Links

          Activity

          Michael McCandless created issue -
          Hide
          Michael McCandless added a comment -

          Attached patch that fixes the original bug (LUCENE-1068) by default, but offers system property & static method to keep backwards compatible yet buggy behavior.

          I'll commit in a day or two.

          Show
          Michael McCandless added a comment - Attached patch that fixes the original bug ( LUCENE-1068 ) by default, but offers system property & static method to keep backwards compatible yet buggy behavior. I'll commit in a day or two.
          Michael McCandless made changes -
          Field Original Value New Value
          Attachment LUCENE-1151.patch [ 12374259 ]
          Hide
          Grant Ingersoll added a comment -

          Not necessarily related, but can you think of a way that we can keep WikipediaTokenizer and StandardTokenizer in sync for these kind of things. I guess I need to go look in JFlex to see if there is a way to do inheritance. Essentially, I want the WikiTokenizer to be StandardTokenizer plus handle the Wiki syntax appropriately.

          Show
          Grant Ingersoll added a comment - Not necessarily related, but can you think of a way that we can keep WikipediaTokenizer and StandardTokenizer in sync for these kind of things. I guess I need to go look in JFlex to see if there is a way to do inheritance. Essentially, I want the WikiTokenizer to be StandardTokenizer plus handle the Wiki syntax appropriately.
          Hide
          Michael McCandless added a comment -

          Very good question ... I don't know. It would be awesome (and, amazing) if JFlex enabled some kind of inheritance.

          Since WikipediaTokenizer doesn't have the backwards compat requirement of StandardTokenizer, you can presumably just fix ACRONYM in WikipediaTokenizer to not match host names (ie hardwire to "true")?

          Show
          Michael McCandless added a comment - Very good question ... I don't know. It would be awesome (and, amazing) if JFlex enabled some kind of inheritance. Since WikipediaTokenizer doesn't have the backwards compat requirement of StandardTokenizer, you can presumably just fix ACRONYM in WikipediaTokenizer to not match host names (ie hardwire to "true")?
          Hide
          Grant Ingersoll added a comment -
          Show
          Grant Ingersoll added a comment - Here's the thread on JFlex for completeness, not that it it effects this patch: http://sourceforge.net/mailarchive/forum.php?thread_name=272037D7-6EA1-4D19-902F-B425A5309C2A%40apache.org&forum_name=jflex-users
          Hide
          Jörg Prante added a comment -

          Hi Grant,

          have you looked at JFlex %implements and %extends directives?

          I have used %implements successfully in building my parsers for inheritance, where the Tokens are all constants in an interface generated not by JFlex but by a parser generator.

          For example

          %%
          %class ECQLLexer
          %implements ECQLTokens
          %unicode
          %integer
          %eofval

          { return 0; %eofval}

          %line
          %column

          I am quite sure %extends could also be used to build a tokenizer family.

          See http://jflex.de/manual.html

          Show
          Jörg Prante added a comment - Hi Grant, have you looked at JFlex %implements and %extends directives? I have used %implements successfully in building my parsers for inheritance, where the Tokens are all constants in an interface generated not by JFlex but by a parser generator. For example %% %class ECQLLexer %implements ECQLTokens %unicode %integer %eofval { return 0; %eofval} %line %column I am quite sure %extends could also be used to build a tokenizer family. See http://jflex.de/manual.html
          Michael McCandless committed 618001 (3 files)
          Reviews: none

          LUCENE-1151: don't mis-identify HOST as ACRONYM, but, provide static method/property to revert to backwards-compatible but buggy behavior

          Michael McCandless made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Mark Lassau made changes -
          Link This issue is related to LUCENE-1068 [ LUCENE-1068 ]
          Mark Lassau made changes -
          Link This issue is related to LUCENE-1373 [ LUCENE-1373 ]
          Hide
          Mark Lassau added a comment -

          Michael,
          Great work. I am glad we are moving to have the bug fixed by default, rather than the other way around.

          Please indulge me a couple of small nitpicks before I get to my main point in another comment

          • Your comment above the static initializer is not correct:
              // Default to false (fixed the bug), unless the system prop is set
            

            should read:

              // Default to true (fixed the bug), unless the system prop is set
            
          • The re-use of the variable a in TestStandardAnalyzer.testDomainNames() does not really guarantee that you are testing the default behaviour of StandardAnalyzer.
            I would recommend resetting a in setUp(), or explicitly constructing it in test method.

          Given that the code is "temporary" until v3.0, feel free to ignore me

          Show
          Mark Lassau added a comment - Michael, Great work. I am glad we are moving to have the bug fixed by default, rather than the other way around. Please indulge me a couple of small nitpicks before I get to my main point in another comment Your comment above the static initializer is not correct: // Default to false (fixed the bug), unless the system prop is set should read: // Default to true (fixed the bug), unless the system prop is set The re-use of the variable a in TestStandardAnalyzer.testDomainNames() does not really guarantee that you are testing the default behaviour of StandardAnalyzer. I would recommend resetting a in setUp(), or explicitly constructing it in test method. Given that the code is "temporary" until v3.0, feel free to ignore me
          Hide
          Mark Lassau added a comment -

          I love the solution you have come up with, but would suggest that it is moved to StandardTokenizer instead of StandardAnalyzer.
          StandardTokenizer is the class with the actual problem. Fixing it there would mean that everyone that uses StandardTokenizer gets a default fix, not just StandardAnalyzer.

          For instance, see LUCENE-1373, where most of the contrib Analyzers still suffer the buggy behavior with no workaround available.
          I think that moving your "defaulting logic" to the tokenizer would fix all these Analyzers in one fell swoop.

          I would provide suggested patches, but I am just about to go on holidays for 3 weeks. Is there a planned release date for v2.3.3 or v2.4?

          Show
          Mark Lassau added a comment - I love the solution you have come up with, but would suggest that it is moved to StandardTokenizer instead of StandardAnalyzer. StandardTokenizer is the class with the actual problem. Fixing it there would mean that everyone that uses StandardTokenizer gets a default fix, not just StandardAnalyzer. For instance, see LUCENE-1373 , where most of the contrib Analyzers still suffer the buggy behavior with no workaround available. I think that moving your "defaulting logic" to the tokenizer would fix all these Analyzers in one fell swoop. I would provide suggested patches, but I am just about to go on holidays for 3 weeks. Is there a planned release date for v2.3.3 or v2.4?
          Hide
          Mark Lassau added a comment -

          Added a patch to LUCENE-1373 which moves the logic introduced here from StandardAnalyzer to StandardTokenizer.

          Show
          Mark Lassau added a comment - Added a patch to LUCENE-1373 which moves the logic introduced here from StandardAnalyzer to StandardTokenizer.
          Hide
          Michael McCandless added a comment -

          Please indulge me a couple of small nitpicks before I get to my main point in another comment

          Thanks for catching these Mark – I'll commit a fix shortly.

          Show
          Michael McCandless added a comment - Please indulge me a couple of small nitpicks before I get to my main point in another comment Thanks for catching these Mark – I'll commit a fix shortly.
          Michael McCandless made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Mark Thomas made changes -
          Workflow jira [ 12422028 ] Default workflow, editable Closed status [ 12563029 ]
          Mark Thomas made changes -
          Workflow Default workflow, editable Closed status [ 12563029 ] jira [ 12583960 ]

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development