Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None
    • Environment:

      Operating System: All
      Platform: PC

    • Bugzilla Id:
      28960

      Description

      In org.apache.lucene.analysis.StopAnalyzer, the ENGLISH_STOP_WORDS array
      contains "a" but not "an". So searching for "a fund" will get the same hits as
      "fund", but searching for "an investment" will get many more hits than "investment".

      This is true in the latest revision of the file, but appears to have always been
      the case. I'm amazed nobody's pointed it out before now, our users had only
      been testing for a few hours before they complained about it

        Activity

        Hide
        otis@apache.org Otis Gospodnetic added a comment -

        You are right, that's not really consistent. However, why ar eyou relying on
        that stop list? If you want to use English stop words, you should really have
        your own, more comprehensive list, and set that list.

        Adding the 'an' will break backwards compatibility for those who rely on the
        built-in stop word list.

        Show
        otis@apache.org Otis Gospodnetic added a comment - You are right, that's not really consistent. However, why ar eyou relying on that stop list? If you want to use English stop words, you should really have your own, more comprehensive list, and set that list. Adding the 'an' will break backwards compatibility for those who rely on the built-in stop word list.
        Hide
        ats37@hotmail.com Andrew Stevens added a comment -

        >Adding the 'an' will break backwards compatibility for those who rely on the
        built-in stop word list.

        In other words, it's not a bug, it's a feature?
        You could make the same argument about any bug in any system, "Oh no, we can't
        fix it, that would annoy anyone who relies on the broken behaviour..."

        Show
        ats37@hotmail.com Andrew Stevens added a comment - >Adding the 'an' will break backwards compatibility for those who rely on the built-in stop word list. In other words, it's not a bug, it's a feature? You could make the same argument about any bug in any system, "Oh no, we can't fix it, that would annoy anyone who relies on the broken behaviour..."
        Hide
        daniel.naber@t-online.de Daniel Naber added a comment -

        I agree that this bug should be fixed, i.e. "an" should be added to the
        stopword list and that change should be documented in CHANGES.txt. Anyone
        who uses Lucene seriously will need to read that file anyway when he updates.

        Show
        daniel.naber@t-online.de Daniel Naber added a comment - I agree that this bug should be fixed, i.e. "an" should be added to the stopword list and that change should be documented in CHANGES.txt. Anyone who uses Lucene seriously will need to read that file anyway when he updates.
        Hide
        otis@apache.org Otis Gospodnetic added a comment -

        Re-opening...

        Show
        otis@apache.org Otis Gospodnetic added a comment - Re-opening...
        Hide
        otis@apache.org Otis Gospodnetic added a comment -

        Fixed.

        Show
        otis@apache.org Otis Gospodnetic added a comment - Fixed.
        Hide
        cutting@apache.org cutting@apache.org added a comment -

        This is a can of worms I'm hesitant to open. If we add "an" then we'll be asked
        to add "its", and if we add "its" we'll be asked to add "do", and so on. This
        stop list was originally generated by looking at the most frequent terms in a
        collection. I guess "an" was less frequent than "a" or any other word in that
        collection. There are other, better, ways to define stop lists, but I don't
        think the Lucene project should be the business of providing high-quality stop
        lists. The Snowball project is a much better place for that sort of activity.

        If you want a good, big, English stop list, grab:

        http://snowball.tartarus.org/english/stop.txt

        I think the best long-term fix for this is to extend the Snowball library in the
        sandbox (http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/) so that
        it provides StopFilters for each of the stop lists provided by Snowball. Once
        we do this, we can deprecate uses of StopFilter and StopAnalysis that do not
        specify a custom stop list. The deprecation documentation can point folks to
        the Snowball stop filters. How does that sound?

        Any volunteers to implement Snowball-based StopFilters? I think this could just
        be a static method, something like:
        public static StopFilter getStopFilter(String language);
        The implementation could use ClasssLoader.getResource() to find a stop list file
        packaged in the jar file, then parse the file and construct a StopFilter from
        it. It should probably also cache these, so that every call doesn't re-parse
        the file.

        Show
        cutting@apache.org cutting@apache.org added a comment - This is a can of worms I'm hesitant to open. If we add "an" then we'll be asked to add "its", and if we add "its" we'll be asked to add "do", and so on. This stop list was originally generated by looking at the most frequent terms in a collection. I guess "an" was less frequent than "a" or any other word in that collection. There are other, better, ways to define stop lists, but I don't think the Lucene project should be the business of providing high-quality stop lists. The Snowball project is a much better place for that sort of activity. If you want a good, big, English stop list, grab: http://snowball.tartarus.org/english/stop.txt I think the best long-term fix for this is to extend the Snowball library in the sandbox ( http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/ ) so that it provides StopFilters for each of the stop lists provided by Snowball. Once we do this, we can deprecate uses of StopFilter and StopAnalysis that do not specify a custom stop list. The deprecation documentation can point folks to the Snowball stop filters. How does that sound? Any volunteers to implement Snowball-based StopFilters? I think this could just be a static method, something like: public static StopFilter getStopFilter(String language); The implementation could use ClasssLoader.getResource() to find a stop list file packaged in the jar file, then parse the file and construct a StopFilter from it. It should probably also cache these, so that every call doesn't re-parse the file.
        Hide
        daniel.naber@t-online.de Daniel Naber added a comment -

        "an" was added, so I'm not sure why this report is still open. Feel free to
        re-open it again if I overlooked something.

        Show
        daniel.naber@t-online.de Daniel Naber added a comment - "an" was added, so I'm not sure why this report is still open. Feel free to re-open it again if I overlooked something.

          People

          • Assignee:
            java-dev@lucene.apache.org Lucene Developers
            Reporter:
            ats37@hotmail.com Andrew Stevens
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development