Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Due to the use of CharArraySet by StopFilter, one can no longer efficiently pre-create a Set for use by future StopFilter instances.

      1. CharArraySet.patch
        14 kB
        Yonik Seeley
      2. CharArraySet.take2.patch
        16 kB
        Michael McCandless

        Activity

        Hide
        Yonik Seeley added a comment -

        Attaching patch. Here's the approach I took:

        • change makeStopSet return a CharArraySet
        • make CharArraySet a real Set (plus many other small changes)
        • the Iterator returns Strings for backward compatibility with people that
          expect a Set<String>
        • StopFilter constructor to directly use Set if instanceof CharArraySet

        All tests pass.

        Show
        Yonik Seeley added a comment - Attaching patch. Here's the approach I took: change makeStopSet return a CharArraySet make CharArraySet a real Set (plus many other small changes) the Iterator returns Strings for backward compatibility with people that expect a Set<String> StopFilter constructor to directly use Set if instanceof CharArraySet All tests pass.
        Hide
        Hoss Man added a comment -

        If the StopFilter constructor that takes in a Set no longer needs the "boolean ignoreCase" we should probably: deprecate that constructor, document that ignoreCase is ignoredand the Set must only contain lowercase items, add a new Set based constructor without that param.

        Show
        Hoss Man added a comment - If the StopFilter constructor that takes in a Set no longer needs the "boolean ignoreCase" we should probably: deprecate that constructor, document that ignoreCase is ignoredand the Set must only contain lowercase items, add a new Set based constructor without that param.
        Hide
        Michael McCandless added a comment -

        Woops, you're right, sorry about that, and thanks for fixing it Yonik!

        I made a few small changes to your patch (attached
        CharArraySet.take2.patch): there were some problems with rehashing, so
        I also added a new unit test (don't forget to "svn add" before
        committing!). All tests pass.

        I noticed that in your patch you now assume that String.hashCode() is
        always as described in Sun's javadocs, but that makes me a little
        nervous, eg:

        http://forum.java.sun.com/thread.jspa?threadID=762602&messageID=4351604

        Maybe for the String case we should continue computing hashCode
        ourselves?

        Show
        Michael McCandless added a comment - Woops, you're right, sorry about that, and thanks for fixing it Yonik! I made a few small changes to your patch (attached CharArraySet.take2.patch): there were some problems with rehashing, so I also added a new unit test (don't forget to "svn add" before committing!). All tests pass. I noticed that in your patch you now assume that String.hashCode() is always as described in Sun's javadocs, but that makes me a little nervous, eg: http://forum.java.sun.com/thread.jspa?threadID=762602&messageID=4351604 Maybe for the String case we should continue computing hashCode ourselves?
        Hide
        Yonik Seeley added a comment -

        > I noticed that in your patch you now assume that String.hashCode() is always as described in Sun's javadocs

        The fact that it's in their javadocs seems to make it part of the interface though.
        Using String.hash (for cached strings - hashCode already calculated) resulted in a 15% lookup improvement.

        Side note: looking up char[] was faster in any event... probably because the extra bounds checking code for every String.charAt()

        On the other hand... not being tied to String.hashCode() means we could select a faster method if we chose to do so.

        Show
        Yonik Seeley added a comment - > I noticed that in your patch you now assume that String.hashCode() is always as described in Sun's javadocs The fact that it's in their javadocs seems to make it part of the interface though. Using String.hash (for cached strings - hashCode already calculated) resulted in a 15% lookup improvement. Side note: looking up char[] was faster in any event... probably because the extra bounds checking code for every String.charAt() On the other hand... not being tied to String.hashCode() means we could select a faster method if we chose to do so.
        Hide
        Yonik Seeley added a comment -

        > If the StopFilter constructor that takes in a Set no longer needs the "boolean ignoreCase"

        But it does if the Set is not a CharArraySet.
        Docs should be clearer though that ignoreCase is ignored if passing a CharArraySet (or is there a more sane way?)

        Show
        Yonik Seeley added a comment - > If the StopFilter constructor that takes in a Set no longer needs the "boolean ignoreCase" But it does if the Set is not a CharArraySet. Docs should be clearer though that ignoreCase is ignored if passing a CharArraySet (or is there a more sane way?)
        Hide
        Hoss Man added a comment -

        > But it does if the Set is not a CharArraySet.
        > Docs should be clearer though that ignoreCase is ignored if passing a CharArraySet (or is there a more sane way?)

        Hmmm... actually .. does making the CharArraySet capable of doing case insensitive comparisons over complicate things? perhaps the CharArraySet should just be a set of strings that supports fast lookup lookup of strings, and StopFilter.next should be responsible for lowercasing the terms before doing the set lookup (and makeStopSet should be responsible for lowercasing hte tersm before putting them in the set). it would probably be good to change the sigs to something like StopFilter(TokenStream s, Set stopWords, boolean lowerCaseBeforeLookup) and makeStopSet(String[] stopWords, boolean lowerCaseBeforeAdding) so it was clear what expectations StopFilter has on the set if people make one from scratch.

        this is kind of an orthogonal API discussion to the CharArraySet issue though ... the same arguments could be made about the 2.2 instance of StopFilter ... i just bring it up as a potential point of confusion since people could expect these two work the same way, and they wont (unless i'm missing something)...

        Set a = ...; // something with lots of mixed case words
        Set b = new CharArraySet(a, false);
        StopFilter aaa = new StopFilter(stream, a, true)
        StopFilter bbb = new StopFilter(stream, b, true)

        Show
        Hoss Man added a comment - > But it does if the Set is not a CharArraySet. > Docs should be clearer though that ignoreCase is ignored if passing a CharArraySet (or is there a more sane way?) Hmmm... actually .. does making the CharArraySet capable of doing case insensitive comparisons over complicate things? perhaps the CharArraySet should just be a set of strings that supports fast lookup lookup of strings, and StopFilter.next should be responsible for lowercasing the terms before doing the set lookup (and makeStopSet should be responsible for lowercasing hte tersm before putting them in the set). it would probably be good to change the sigs to something like StopFilter(TokenStream s, Set stopWords, boolean lowerCaseBeforeLookup) and makeStopSet(String[] stopWords, boolean lowerCaseBeforeAdding) so it was clear what expectations StopFilter has on the set if people make one from scratch. this is kind of an orthogonal API discussion to the CharArraySet issue though ... the same arguments could be made about the 2.2 instance of StopFilter ... i just bring it up as a potential point of confusion since people could expect these two work the same way, and they wont (unless i'm missing something)... Set a = ...; // something with lots of mixed case words Set b = new CharArraySet(a, false); StopFilter aaa = new StopFilter(stream, a, true) StopFilter bbb = new StopFilter(stream, b, true)
        Hide
        Yonik Seeley added a comment -

        > and StopFilter.next should be responsible for lowercasing the terms before doing the set lookup

        That would force the user (StopFilter) to make a copy since it wouldn't want to touch the original.

        Show
        Yonik Seeley added a comment - > and StopFilter.next should be responsible for lowercasing the terms before doing the set lookup That would force the user (StopFilter) to make a copy since it wouldn't want to touch the original.
        Hide
        Hoss Man added a comment -

        Hmmm... i was going to say "would one String.toLowerCase() in StopFilter.next really be any worse then the Character.toLowerCase() that are currently done in CharArraySet?" but i suppose it might be in the typical stopword cases where most words on stop words (so the character based approach can short circuit early (not that i've actually benchmarked it).

        i suppose it's best to leave the API as it is ... but document the hell out of when/how ignoreCase is significant .. make it clear that the value passed to the StopFilter constructor must be consistent with the value passed to makeStopSet if both are used (actually: we could assert that in the StopFilter constructor if we add a CharArraySet.getIgnoreCase() method)

        Show
        Hoss Man added a comment - Hmmm... i was going to say "would one String.toLowerCase() in StopFilter.next really be any worse then the Character.toLowerCase() that are currently done in CharArraySet?" but i suppose it might be in the typical stopword cases where most words on stop words (so the character based approach can short circuit early (not that i've actually benchmarked it). i suppose it's best to leave the API as it is ... but document the hell out of when/how ignoreCase is significant .. make it clear that the value passed to the StopFilter constructor must be consistent with the value passed to makeStopSet if both are used (actually: we could assert that in the StopFilter constructor if we add a CharArraySet.getIgnoreCase() method)
        Hide
        Michael McCandless added a comment -

        Yonik, I think you missed my proposed update to your original patch, here?

        https://issues.apache.org/jira/browse/LUCENE-1040#action_12539319

        EG, there are some problems with the changes to rehash (and I added a unit-test to expose them).

        Show
        Michael McCandless added a comment - Yonik, I think you missed my proposed update to your original patch, here? https://issues.apache.org/jira/browse/LUCENE-1040#action_12539319 EG, there are some problems with the changes to rehash (and I added a unit-test to expose them).
        Hide
        Yonik Seeley added a comment -

        Indeed... thanks for catching that!

        Show
        Yonik Seeley added a comment - Indeed... thanks for catching that!
        Hide
        Grant Ingersoll added a comment -

        Appears to have been committed.

        Show
        Grant Ingersoll added a comment - Appears to have been committed.

          People

          • Assignee:
            Yonik Seeley
            Reporter:
            Yonik Seeley
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development