Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-494

Analyzer for preventing overload of search service by queries with common terms in large indexes

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.4
    • Fix Version/s: 2.4
    • Component/s: modules/analysis
    • Labels:
      None

      Description

      An analyzer used primarily at query time to wrap another analyzer and provide a layer of protection
      which prevents very common words from being passed into queries. For very large indexes the cost
      of reading TermDocs for a very common word can be high. This analyzer was created after experience with
      a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for
      this term to take 2 seconds.

      Use the various "addStopWords" methods in this class to automate the identification and addition of
      stop words found in an already existing index.

        Activity

        Hide
        gsingers Grant Ingersoll added a comment -

        This seems generally useful and could go in contrib/analysis I suppose. Any thoughts on it, Mark, in hindsight? Do you still use it from time to time or do you now think there are better ways of doing it?

        Show
        gsingers Grant Ingersoll added a comment - This seems generally useful and could go in contrib/analysis I suppose. Any thoughts on it, Mark, in hindsight? Do you still use it from time to time or do you now think there are better ways of doing it?
        Hide
        gsingers Grant Ingersoll added a comment -

        I think it makes sense to add this in after the 2.3 release.

        Show
        gsingers Grant Ingersoll added a comment - I think it makes sense to add this in after the 2.3 release.
        Hide
        markh Mark Harwood added a comment -

        I personally don't use this but others may. It was easier to solve my particular problem by adding stop words to my XSL query templates (I added support to the XMLQueryParser for the "FuzzyLikeThisQuery" tag to take stop words). This was more about ease of configuration in my particular app.

        I know Nutch has something similar implemented elsewhere - maybe in the query parser.

        I also had the notion that wrapping IndexReader to auto-cache TermDocs for super-popular terms using a BitSet would be a good way to avoid the IO overhead. This Bitset wouldn't help resolve positional queries e.g. phrase/span queries which need a TermPositions implementation but would work for straight TermQueries.

        Show
        markh Mark Harwood added a comment - I personally don't use this but others may. It was easier to solve my particular problem by adding stop words to my XSL query templates (I added support to the XMLQueryParser for the "FuzzyLikeThisQuery" tag to take stop words). This was more about ease of configuration in my particular app. I know Nutch has something similar implemented elsewhere - maybe in the query parser. I also had the notion that wrapping IndexReader to auto-cache TermDocs for super-popular terms using a BitSet would be a good way to avoid the IO overhead. This Bitset wouldn't help resolve positional queries e.g. phrase/span queries which need a TermPositions implementation but would work for straight TermQueries.
        Hide
        gsingers Grant Ingersoll added a comment -

        Committed, thanks Mark!

        Show
        gsingers Grant Ingersoll added a comment - Committed, thanks Mark!

          People

          • Assignee:
            gsingers Grant Ingersoll
            Reporter:
            markh Mark Harwood
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development