Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-4628

Add common terms query to gracefully handle very high frequent terms dynamically

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.1, 6.0
    • Component/s: modules/other
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      I had this problem quite a couple of times the last couple of month that searches very often contained super high frequent terms and disjunction queries became way too slow. The main problem was that stopword filtering wasn't really an option since in the domain those high-freq terms where not really stopwords though. So for instance searching for a song title "this is it" or for a band "A" didn't really fly with stopwords. I thought about that for a while and came up with a query based solution that decides based on a threshold if something is considered a stopword or not and if so it moves the term in two boolean queries one for high-frequent and one for low-frequent such that those high frequent terms are only matched if the low-frequent sub-query produces a match. Yet if all terms are high frequent it makes the entire thing a Conjunction which gave me reasonable results as well as performance.

        Attachments

        1. LUCENE-4628.patch
          27 kB
          Simon Willnauer
        2. LUCENE-4628.patch
          24 kB
          Simon Willnauer

          Activity

            People

            • Assignee:
              simonw Simon Willnauer
              Reporter:
              simonw Simon Willnauer
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: