Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1853

SubPhraseQuery for matching and scoring sub phrase matches.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Won't Fix
    • None
    • None
    • core/search
    • None
    • Lucene/Java

    • New, Patch Available

    Description

      The goal is to give more control via configuration when searching using user entered queries against multiple fields where sub phrases have special significance.

      For a query like "homes in new york with swimming pool", if a document's field matches only "new york" it should get scored and it should get scored higher than two separate matches "new" and "york". Also, a 3 word sub phrase match must gets scored considerably higher than a 2 word sub phrase match. (boost factor should be configurable)

      Using shingles for this use case, means each field of each document needs to be indexed as shingles of all (1..N)-grams as well as the query. (Please correct me if I am wrong.)

      The query could also support

      • ignoring of idf and/or field norms, (so that factors outside the document don't influence scoring)
      • consider only the longest match (for example match on "new york" is scored and considered rather than "new" furniture and "york" city)
      • ignore duplicates ("new york" appearing twice or thrice does not make any difference)

      This kind of query could be combined with DisMax query. For example, something like solr's dismax request handler can be made to use this query where we run a user query as it is against all fields and configure each field with above configurations.

      I have also attached a patch with comments and test cases in case, my description is not clear enough. Would appreciate alternatives or feedback.

      Example Usage:

      <code>
      // sub phrase config
      SubPhraseQuery.SubPhraseConfig conf = new SubPhraseQuery.SubPhraseConfig();
      conf.ignoreIdf = true;
      conf.ignoreFieldNorms = true;
      conf.matchOnlyLongest = true;
      conf.ignoreDuplicates = true;
      conf.phraseBoost = 2;
      // phrase query as usual
      SubPhraseQuery pq = new SubPhraseQuery();
      pq.add(new Term("f", term));
      pq.add(new Term("f", term));
      pq.setSubPhraseConf(conf);
      Hits hits = searcher.search(pq);
      </code>

      Attachments

        1. LUCENE-1853.patch
          39 kB
          Preetam Rao
        2. LUCENE-1853.patch
          30 kB
          Preetam Rao

        Activity

          People

            Unassigned Unassigned
            preetam Preetam Rao
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: