Lucene - Core
  1. Lucene - Core
  2. LUCENE-3765

trappy ignoreCase behavior with StopFilter/ignoreCase

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Spinoff from LUCENE-3751:

      * If <code>stopWords</code> is an instance of {@link CharArraySet} (true if
      * <code>makeStopSet()</code> was used to construct the set) it will be
      * directly used and <code>ignoreCase</code> will be ignored since
      * <code>CharArraySet</code> directly controls case sensitivity.
      

      This is really confusing and trappy... we need to change something here.

      1. LUCENE-3765.patch
        156 kB
        Robert Muir
      2. LUCENE-3765.patch
        145 kB
        Robert Muir
      3. LUCENE-3765.patch
        9 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        updated patch.

        Show
        Robert Muir added a comment - updated patch.
        Hide
        Robert Muir added a comment -

        I found a couple more Set<?> remaining (elisionfilter, solr factories). Ill update the patch to fix those too.

        Show
        Robert Muir added a comment - I found a couple more Set<?> remaining (elisionfilter, solr factories). Ill update the patch to fix those too.
        Hide
        Steve Rowe added a comment -

        Standard/ClassicAnalyzer had a ctor that takes File, i think we should deprecate this one, for the one that takes Reader.

        +1

        Show
        Steve Rowe added a comment - Standard/ClassicAnalyzer had a ctor that takes File, i think we should deprecate this one, for the one that takes Reader. +1
        Hide
        Robert Muir added a comment -

        Updated patch for trunk.

        I found two traps/bugs and fixed them here as well (these will go in the backport too along with the StopFilter deprecations):

        • DutchAnalyzer confusingly only used its default 'stem dictionary' (e.g. kind/kinder, fiets) for the no-arg ctor, for other ctors, it would remain empty. This means stemming would be different if you passed an empty stopset.
        • Standard/ClassicAnalyzer had a ctor that takes File, i think we should deprecate this one, for the one that takes Reader.
        Show
        Robert Muir added a comment - Updated patch for trunk. I found two traps/bugs and fixed them here as well (these will go in the backport too along with the StopFilter deprecations): DutchAnalyzer confusingly only used its default 'stem dictionary' (e.g. kind/kinder, fiets) for the no-arg ctor, for other ctors, it would remain empty. This means stemming would be different if you passed an empty stopset. Standard/ClassicAnalyzer had a ctor that takes File, i think we should deprecate this one, for the one that takes Reader.
        Hide
        Uwe Schindler added a comment -

        +1 to remove the Set<?> and hardcode method signatures to CAS.

        Changes on CAS should be separate (e.g. make it an interface, so we could have FSTCharArraySet and HashCharArraySet)

        Show
        Uwe Schindler added a comment - +1 to remove the Set<?> and hardcode method signatures to CAS. Changes on CAS should be separate (e.g. make it an interface, so we could have FSTCharArraySet and HashCharArraySet)
        Hide
        Steve Rowe added a comment -

        +1 to removing Set<?/Object> in favor of chararrayset, too.

        Show
        Steve Rowe added a comment - +1 to removing Set<?/Object> in favor of chararrayset, too.
        Hide
        Steve Rowe added a comment -

        +1

        Show
        Steve Rowe added a comment - +1
        Hide
        Robert Muir added a comment -

        Also, for 4.0 i think we should go a step further and remove all this Set<?>/Set<Object> crap/instanceof/copying

        instead stopfilter, etc should just take chararrayset, and this is what makestopset should return.

        I'll update the patch. for 3.x we can just deprecate the two confusing nuked' ctors from the first patch above...
        so we can still make some improvement there.

        Show
        Robert Muir added a comment - Also, for 4.0 i think we should go a step further and remove all this Set<?>/Set<Object> crap/instanceof/copying instead stopfilter, etc should just take chararrayset, and this is what makestopset should return. I'll update the patch. for 3.x we can just deprecate the two confusing nuked' ctors from the first patch above... so we can still make some improvement there.
        Hide
        Robert Muir added a comment -

        Proposed patch: I think its the simplest solution, nuke the confusing ctors.

        Show
        Robert Muir added a comment - Proposed patch: I think its the simplest solution, nuke the confusing ctors.

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development