Lucene - Core
  1. Lucene - Core
  2. LUCENE-1853

SubPhraseQuery for matching and scoring sub phrase matches.

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None
    • Environment:

      Lucene/Java

    • Lucene Fields:
      New, Patch Available

      Description

      The goal is to give more control via configuration when searching using user entered queries against multiple fields where sub phrases have special significance.

      For a query like "homes in new york with swimming pool", if a document's field matches only "new york" it should get scored and it should get scored higher than two separate matches "new" and "york". Also, a 3 word sub phrase match must gets scored considerably higher than a 2 word sub phrase match. (boost factor should be configurable)

      Using shingles for this use case, means each field of each document needs to be indexed as shingles of all (1..N)-grams as well as the query. (Please correct me if I am wrong.)

      The query could also support

      • ignoring of idf and/or field norms, (so that factors outside the document don't influence scoring)
      • consider only the longest match (for example match on "new york" is scored and considered rather than "new" furniture and "york" city)
      • ignore duplicates ("new york" appearing twice or thrice does not make any difference)

      This kind of query could be combined with DisMax query. For example, something like solr's dismax request handler can be made to use this query where we run a user query as it is against all fields and configure each field with above configurations.

      I have also attached a patch with comments and test cases in case, my description is not clear enough. Would appreciate alternatives or feedback.

      Example Usage:

      <code>
      // sub phrase config
      SubPhraseQuery.SubPhraseConfig conf = new SubPhraseQuery.SubPhraseConfig();
      conf.ignoreIdf = true;
      conf.ignoreFieldNorms = true;
      conf.matchOnlyLongest = true;
      conf.ignoreDuplicates = true;
      conf.phraseBoost = 2;
      // phrase query as usual
      SubPhraseQuery pq = new SubPhraseQuery();
      pq.add(new Term("f", term));
      pq.add(new Term("f", term));
      pq.setSubPhraseConf(conf);
      Hits hits = searcher.search(pq);
      </code>

      1. LUCENE-1853.patch
        30 kB
        Preetam Rao
      2. LUCENE-1853.patch
        39 kB
        Preetam Rao

        Activity

        Erick Erickson made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Won't Fix [ 2 ]
        Hide
        Erick Erickson added a comment -

        SPRING_CLEANING_2013 JIRA.

        OK, we'll close this given Shalin's comment.

        Show
        Erick Erickson added a comment - SPRING_CLEANING_2013 JIRA. OK, we'll close this given Shalin's comment.
        Hide
        Shalin Shekhar Mangar added a comment -

        Erick, SubPhraseQuery was written by Preetam for AOL Real Estate search. AFAIK, no one is working actively on it.

        Show
        Shalin Shekhar Mangar added a comment - Erick, SubPhraseQuery was written by Preetam for AOL Real Estate search. AFAIK, no one is working actively on it.
        Hide
        Erick Erickson added a comment -

        SPRING_CLEANING_2013 JIRAS Anyone want to comment whether this is still valid? Doubtless the patch is, at best, a guide.

        Show
        Erick Erickson added a comment - SPRING_CLEANING_2013 JIRAS Anyone want to comment whether this is still valid? Doubtless the patch is, at best, a guide.
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12563234 ] jira [ 12584309 ]
        Mark Thomas made changes -
        Workflow jira [ 12474065 ] Default workflow, editable Closed status [ 12563234 ]
        Preetam Rao made changes -
        Original Estimate 336h [ 1209600 ]
        Remaining Estimate 336h [ 1209600 ]
        Preetam Rao made changes -
        Summary PhraseQuery Scorer for scoring sub phrase matches SubPhraseQuery for matching and scoring sub phrase matches.
        Description For a query like "homes in new york with swimming pool", if a document's field matches only "new york" it should get scored and it should get scored higher than two separate matches "new" and "york". Also, a 3 word sub phrase match must gets scored considerably higher than a 2 word sub phrase match. (boost factor should be configurable)

        If a user query is taken as is without parsing and is searched against multiple fields, where each sub-phrase can match against a different field, this kind of query is useful.

        Using shingles for this use case, means each field of each document needs to be indexed as shingles of all (1..N)-grams as well as the query. (Please correct me if I am wrong.)

        The scorer could also support
        - ignoring of idf and/or field norms, (so that factors outside the document don't influence scoring)
        - consider only the longest match (for example match on "new york" is scored and considered rather than "new" furniture and "york" city)
        - ignore duplicates ("new york" appearing twice or thrice does not make any difference)

        This kind of query (Phrase Query with SubPhraseScorer) could be combined with DisMax query. For example, something like solr's dismax request handler can be made to use this query where we run a user query as it is against all fields and configure each field with above configurations.

        I have also attached a patch with comments and test cases in case, my description is not clear enough. Would appreciate alternatives or feedback. The goal is to give more control via configuration when searching using user entered queries against multiple fields where sub phrases have special significance.

        Example Usage:

        <code>
           // sub phrase config
            PhraseQuery.SubPhraseConfig conf = new PhraseQuery.SubPhraseConfig();
            conf.ignoreIdf = true;
            conf.ignoreFieldNorms = true;
            conf.matchOnlyLongest = true;
            conf.ignoreDuplicates = true;
            conf.phraseBoost = 2;
            // phrase query as usual
           PhraseQuery pq = new PhraseQuery();
           pq.add(new Term("f", term));
           pq.add(new Term("f", term));
            pq.setSubPhraseConf(conf);
            Hits hits = searcher.search(pq);
        </code>
        The goal is to give more control via configuration when searching using user entered queries against multiple fields where sub phrases have special significance.

        For a query like "homes in new york with swimming pool", if a document's field matches only "new york" it should get scored and it should get scored higher than two separate matches "new" and "york". Also, a 3 word sub phrase match must gets scored considerably higher than a 2 word sub phrase match. (boost factor should be configurable)

        Using shingles for this use case, means each field of each document needs to be indexed as shingles of all (1..N)-grams as well as the query. (Please correct me if I am wrong.)

        The query could also support
        - ignoring of idf and/or field norms, (so that factors outside the document don't influence scoring)
        - consider only the longest match (for example match on "new york" is scored and considered rather than "new" furniture and "york" city)
        - ignore duplicates ("new york" appearing twice or thrice does not make any difference)

        This kind of query could be combined with DisMax query. For example, something like solr's dismax request handler can be made to use this query where we run a user query as it is against all fields and configure each field with above configurations.

        I have also attached a patch with comments and test cases in case, my description is not clear enough. Would appreciate alternatives or feedback.

        Example Usage:

        <code>
           // sub phrase config
            SubPhraseQuery.SubPhraseConfig conf = new SubPhraseQuery.SubPhraseConfig();
            conf.ignoreIdf = true;
            conf.ignoreFieldNorms = true;
            conf.matchOnlyLongest = true;
            conf.ignoreDuplicates = true;
            conf.phraseBoost = 2;
            // phrase query as usual
           SubPhraseQuery pq = new SubPhraseQuery();
           pq.add(new Term("f", term));
           pq.add(new Term("f", term));
            pq.setSubPhraseConf(conf);
            Hits hits = searcher.search(pq);
        </code>
        Hide
        Preetam Rao added a comment -

        Removed the dependency on PhraseQuery so that this can be reviewed and used independently. Made it a separate query with configurations specific to sub phrase matches, The new patch makes no changes to any of existing files. Please let me know your thoughts.

        Show
        Preetam Rao added a comment - Removed the dependency on PhraseQuery so that this can be reviewed and used independently. Made it a separate query with configurations specific to sub phrase matches, The new patch makes no changes to any of existing files. Please let me know your thoughts.
        Preetam Rao made changes -
        Attachment LUCENE-1853.patch [ 12418119 ]
        Hide
        Preetam Rao added a comment -

        Remove the dependency on PhraseQuery. Create a new Query called "SubPhraseQuery". Created a new patch with seperate new source files, without any changes to existing files.

        Show
        Preetam Rao added a comment - Remove the dependency on PhraseQuery. Create a new Query called "SubPhraseQuery". Created a new patch with seperate new source files, without any changes to existing files.
        Preetam Rao made changes -
        Description For a query like "homes in new york with swimming pool", if a document's field matches only "new york" it should get scored and it should get scored higher than two separate matches "new" and "york". Also, a 3 word sub phrase match must gets scored considerably higher than a 2 word sub phrase match. (boost factor should be configurable)

        If a user query is taken as is without parsing and is searched against multiple fields, where each sub-phrase can match against a different field, this kind of query is useful.

        Using shingles for this use case, means each field of each document needs to be indexed as shingles of all (1..N)-grams as well as the query. (Please correct me if I am wrong.)

        The scorer could also support
        - ignoring of idf and/or field norms, (so that factors outside the document don't influence scoring)
        - consider only the longest match (for example match on "new york" is scored and considered rather than "new" furniture and "york" city)
        - ignore duplicates ("new york" appearing twice or thrice does not make any difference)

        This kind of query (Phrase Query with SubPhraseScorer) could be combined with DisMax query. For example, something like solr's dismax request handler can be made to use this query where we run a user query as it is against all fields and configure each field with above configurations.

        I have also attached a patch with comments and test cases in case, my description is not clear enough. Would appreciate alternatives or feedback. The goal is to give more control via configuration when searching using user entered queries against multiple fields where sub phrases have special significance.
        For a query like "homes in new york with swimming pool", if a document's field matches only "new york" it should get scored and it should get scored higher than two separate matches "new" and "york". Also, a 3 word sub phrase match must gets scored considerably higher than a 2 word sub phrase match. (boost factor should be configurable)

        If a user query is taken as is without parsing and is searched against multiple fields, where each sub-phrase can match against a different field, this kind of query is useful.

        Using shingles for this use case, means each field of each document needs to be indexed as shingles of all (1..N)-grams as well as the query. (Please correct me if I am wrong.)

        The scorer could also support
        - ignoring of idf and/or field norms, (so that factors outside the document don't influence scoring)
        - consider only the longest match (for example match on "new york" is scored and considered rather than "new" furniture and "york" city)
        - ignore duplicates ("new york" appearing twice or thrice does not make any difference)

        This kind of query (Phrase Query with SubPhraseScorer) could be combined with DisMax query. For example, something like solr's dismax request handler can be made to use this query where we run a user query as it is against all fields and configure each field with above configurations.

        I have also attached a patch with comments and test cases in case, my description is not clear enough. Would appreciate alternatives or feedback. The goal is to give more control via configuration when searching using user entered queries against multiple fields where sub phrases have special significance.

        Example Usage:

        <code>
           // sub phrase config
            PhraseQuery.SubPhraseConfig conf = new PhraseQuery.SubPhraseConfig();
            conf.ignoreIdf = true;
            conf.ignoreFieldNorms = true;
            conf.matchOnlyLongest = true;
            conf.ignoreDuplicates = true;
            conf.phraseBoost = 2;
            // phrase query as usual
           PhraseQuery pq = new PhraseQuery();
           pq.add(new Term("f", term));
           pq.add(new Term("f", term));
            pq.setSubPhraseConf(conf);
            Hits hits = searcher.search(pq);
        </code>
        Preetam Rao made changes -
        Fix Version/s 2.9 [ 12312682 ]
        Preetam Rao made changes -
        Field Original Value New Value
        Attachment LUCENE-1853.patch [ 12417595 ]
        Hide
        Preetam Rao added a comment -

        Attached a patch with test cases. Position increment and offset always assumed to be incremented by 1. May not work with increments other than

        Show
        Preetam Rao added a comment - Attached a patch with test cases. Position increment and offset always assumed to be incremented by 1. May not work with increments other than
        Preetam Rao created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Preetam Rao
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development