Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1688

Deprecating StopAnalyzer ENGLISH_STOP_WORDS - General replacement with an immutable Set

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      StopAnalyzer and StandartAnalyzer are using the static final array ENGLISH_STOP_WORDS by default in various places. Internally this array is converted into a mutable set which looks kind of weird to me.
      I think the way to go is to deprecate all use of the static final array and replace it with an immutable implementation of CharArraySet. Inside an analyzer it does not make sense to have a mutable set anyway and we could prevent set creation each time an analyzer is created. In the case of an immutable set we won't have multithreading issues either.
      in essence we get rid of a fair bit of "converting string array to set" code, do not have a PUBLIC static reference to an array (which is mutable) and reduce the overhead of analyzer creation.

      let me know what you think and I create a patch for it.

      simon

      1. LUCENE-1688.patch
        35 kB
        Mark Miller
      2. LUCENE-1688.patch
        19 kB
        Mark Miller
      3. StopWords.patch
        19 kB
        Simon Willnauer

        Activity

        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        Thanks Simon!

        Show
        markrmiller@gmail.com Mark Miller added a comment - Thanks Simon!
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        all tests pass

        Show
        markrmiller@gmail.com Mark Miller added a comment - all tests pass
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        To trunk. Still needs a bit of a look over.

        Show
        markrmiller@gmail.com Mark Miller added a comment - To trunk. Still needs a bit of a look over.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        If no one else claims this for 2.9, I guess I'll do it.

        Show
        markrmiller@gmail.com Mark Miller added a comment - If no one else claims this for 2.9, I guess I'll do it.
        Hide
        simonw Simon Willnauer added a comment -

        Attached a patch that marks the ENGLISH_STOP_WORDS as deprecated.
        I cleaned up in StopAnalyzer (final anyway) a little bit)
        Added a UnmodifiableCharArraySet impl as an private inner class + testcase

        Show
        simonw Simon Willnauer added a comment - Attached a patch that marks the ENGLISH_STOP_WORDS as deprecated. I cleaned up in StopAnalyzer (final anyway) a little bit) Added a UnmodifiableCharArraySet impl as an private inner class + testcase
        Hide
        mikemccand Michael McCandless added a comment -

        This sounds great Simon!

        Show
        mikemccand Michael McCandless added a comment - This sounds great Simon!

          People

          • Assignee:
            markrmiller@gmail.com Mark Miller
            Reporter:
            simonw Simon Willnauer
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development