Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-7539

Add a QueryAutofilteringComponent for query introspection using indexed metadata

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 6.0
    • Component/s: None
    • Labels:
      None

      Description

      The Query Autofiltering Component provides a method of inferring user intent by matching noun phrases that are typically used for faceted-navigation into Solr filter or boost queries (depending on configuration settings) so that more precise user queries can be met with more precise results.

      The algorithm uses a "longest contiguous phrase match" strategy which allows it to disambiguate queries where single terms are ambiguous but phrases are not. It will work when there is structured information in the form of String fields that are normally used for faceted navigation. It works across fields by building a map of search term to index field using the Lucene FieldCache (UninvertingReader). This enables users to create free text, multi-term queries that combine attributes across facet fields - as if they had searched and then navigated through several facet layers. To address the problem of exact-match only semantics of String fields, support for synonyms (including multi-term synonyms) and stemming was added.

      1. SOLR-7539.patch
        110 kB
        Ted Sullivan
      2. SOLR-7539.patch
        84 kB
        Ted Sullivan
      3. SOLR-7539.patch
        80 kB
        Ted Sullivan
      4. SOLR-7539.patch
        77 kB
        Ted Sullivan

        Activity

        Hide
        tedsullivan Ted Sullivan added a comment -

        Initial patch uploaded. I have published a blog article explaining the rationale of this component, etc at

        http://lucidworks.com/blog/query-autofiltering-revisited-can-precise/

        Show
        tedsullivan Ted Sullivan added a comment - Initial patch uploaded. I have published a blog article explaining the rationale of this component, etc at http://lucidworks.com/blog/query-autofiltering-revisited-can-precise/
        Hide
        tedsullivan Ted Sullivan added a comment -

        Added logic that will determine if a field is single or multi-valued enabling the filter to "do the right thing" when users use logic phrases ("and" and "or") within queries. It turns out that knowing whether a field is single or multi-value can enable the query autofiltering component to disambiguate user entered boolean phrases. For example, the query "blue or red cars" means the same as "blue and red cars" since the color field is single valued - a single car can only be blue or red. Given that a property can only have one value for a given record, for a set of cars, with "or" we mean "either" and with "and" we mean "both", which both translate to a set UNION operation. In other words, for single valued fields when talking about a group of things, "and" and "or" are synonyms. The QueryAutofilteringComponent will detect that "red" and "blue" are both values of the "color" field, determine that "color" is not a multi-valued field and then translate this into an fq color:(blue OR red) no matter what logical term was entered in the query phrase. Using AND in this fq would of course yield 0 results because by definition, no item can have a value of both "red" and "blue".

        However, for multi-valued fields, "and" and "or" are antonyms in common parlance, as in "cars with GPS, voice-activated bluetooth and heated seats" - which means that I only want cars that have all three of these options (if i say "or" in this phrase instead of "and", it means something different - I want to see cars with any of these options). If "GPS", "voice-activated bluetooth" and "heated seats" are all indexed as values in an "options" field, the query autofiltering component will recognize that these are values of the same multi-valued field and will create the fq options:(GPS AND "voice-activated bluetooth" AND "heated seats") for the first case and options:(GPS OR "voice-activated bluetooth" OR "heated seats") for the second. And" is the default as in "show me inexpensive, fuel-efficient, safe cars" which implies that I want to see cars with all of these attributes.

        So, the vernacular usage of "and" and "or" is dependent on the context. For single value fields, they are synonyms but for multi-value fields they are antonyms. The context also depends on whether we are talking about one thing or a group of things - "red and blue cars" has different meaning from "a red and blue car". The latter must have a custom paint job so that it has both red and blue. This problem is not solved currently because it requires a more semantically intelligent stemming operation. For now, I assume that the user is looking for a set of things as is most often the case with search.

        Show
        tedsullivan Ted Sullivan added a comment - Added logic that will determine if a field is single or multi-valued enabling the filter to "do the right thing" when users use logic phrases ("and" and "or") within queries. It turns out that knowing whether a field is single or multi-value can enable the query autofiltering component to disambiguate user entered boolean phrases. For example, the query "blue or red cars" means the same as "blue and red cars" since the color field is single valued - a single car can only be blue or red. Given that a property can only have one value for a given record, for a set of cars, with "or" we mean "either" and with "and" we mean "both", which both translate to a set UNION operation. In other words, for single valued fields when talking about a group of things, "and" and "or" are synonyms. The QueryAutofilteringComponent will detect that "red" and "blue" are both values of the "color" field, determine that "color" is not a multi-valued field and then translate this into an fq color:(blue OR red) no matter what logical term was entered in the query phrase. Using AND in this fq would of course yield 0 results because by definition, no item can have a value of both "red" and "blue". However, for multi-valued fields, "and" and "or" are antonyms in common parlance, as in "cars with GPS, voice-activated bluetooth and heated seats" - which means that I only want cars that have all three of these options (if i say "or" in this phrase instead of "and", it means something different - I want to see cars with any of these options). If "GPS", "voice-activated bluetooth" and "heated seats" are all indexed as values in an "options" field, the query autofiltering component will recognize that these are values of the same multi-valued field and will create the fq options:(GPS AND "voice-activated bluetooth" AND "heated seats") for the first case and options:(GPS OR "voice-activated bluetooth" OR "heated seats") for the second. And" is the default as in "show me inexpensive, fuel-efficient, safe cars" which implies that I want to see cars with all of these attributes. So, the vernacular usage of "and" and "or" is dependent on the context. For single value fields, they are synonyms but for multi-value fields they are antonyms. The context also depends on whether we are talking about one thing or a group of things - "red and blue cars" has different meaning from "a red and blue car". The latter must have a custom paint job so that it has both red and blue. This problem is not solved currently because it requires a more semantically intelligent stemming operation. For now, I assume that the user is looking for a set of things as is most often the case with search.
        Hide
        tedsullivan Ted Sullivan added a comment -

        Added logic to handle the case where a phrase match on field values can also have a set of single term matches if that is valid. So for the use case where "White Linen" is a brand, "perfume" is a product type, "white" is a color, "linen" is a material and "shirts" is a product type. The query "White Linen perfume" will be parsed to (brand:"White Linen") OR (color:white AND material:linen)) AND product_type:perfume. This ensures that the correct match will be returned for either this query or "white linen shirts". Without this fix, the brand match would overrule the single term match as it has more terms. This secondary rule only happens if a phrase match is encountered AND there is a complete set of single term matches for that phrase.

        Also made the internal field value delimiter configurable and set it to '|' by default. The original code used ',' which can probably occur in String field values.

        Show
        tedsullivan Ted Sullivan added a comment - Added logic to handle the case where a phrase match on field values can also have a set of single term matches if that is valid. So for the use case where "White Linen" is a brand, "perfume" is a product type, "white" is a color, "linen" is a material and "shirts" is a product type. The query "White Linen perfume" will be parsed to (brand:"White Linen") OR (color:white AND material:linen)) AND product_type:perfume. This ensures that the correct match will be returned for either this query or "white linen shirts". Without this fix, the brand match would overrule the single term match as it has more terms. This secondary rule only happens if a phrase match is encountered AND there is a complete set of single term matches for that phrase. Also made the internal field value delimiter configurable and set it to '|' by default. The original code used ',' which can probably occur in String field values.
        Hide
        billnbell Bill Bell added a comment -

        +1

        Show
        billnbell Bill Bell added a comment - +1
        Hide
        jmlucjav jmlucjav added a comment -

        This would be a great addition, I have used some custom handler to deal with these situations, wrote about it here
        If I might suggest something, I would add some config to allow certain values to be excluded from being used as autofilters, this is useful when for example a brand name is also a common word that can occur naturally in other places etc.

        Show
        jmlucjav jmlucjav added a comment - This would be a great addition, I have used some custom handler to deal with these situations, wrote about it here If I might suggest something, I would add some config to allow certain values to be excluded from being used as autofilters, this is useful when for example a brand name is also a common word that can occur naturally in other places etc.
        Hide
        tedsullivan Ted Sullivan added a comment -

        Thanks jmlucjav - we seem to be on the same page. Thanks for the reference link to your blog. You make some very interesting and useful points. To answer your question, the query autofilter has a configuration to exclude fields that you don't want to be autofiltered. This may also be useful to protect the FST lookup in cases where the field has a very large number of values.

        Show
        tedsullivan Ted Sullivan added a comment - Thanks jmlucjav - we seem to be on the same page. Thanks for the reference link to your blog. You make some very interesting and useful points. To answer your question, the query autofilter has a configuration to exclude fields that you don't want to be autofiltered. This may also be useful to protect the FST lookup in cases where the field has a very large number of values.
        Hide
        tedsullivan Ted Sullivan added a comment -

        I have just uploaded a new patch that adds what I call "verb support" to the autofilter. This enables you to specify terms that will constrain the autofilter field choices. The example I have been using for this is a Music Ontology that I have been using to demonstrate the features of the query autofilter. So if I have records of musicians, songwriters, songs etc. with fields like performer_ss, composer_ss, composition_type_s and so on. If I search for

        songs written by Johnny Cash
        vs
        songs performed by Johnny Cash

        Without the verb support, the autofilter would pick up composition_type_s:Song and performer_ss:"Johnny Cash" OR composer_ss:"Johnny Cash" because this artist has documents in which he is either (or both) a performer and a songwriter. That is, both of these queries would return the same results because neither 'written' or 'performed' is a value in any document field.

        By adding configurations like this and some supporting code

        <searchComponent name="autofilter" class="org.apache.solr.handler.component.QueryAutoFilteringComponent" >

        <arr name="verbModifiers">
        <str>written,wrote,composed:composer_ss</str>
        <str>performed,played,sang,recorded:performer_ss</str>
        </arr>
        </searchComponent>

        The above queries work as expected. The code detects the presence of the modifier in proximity to a term that occurs in the search field (for 'written' that would be 'composer_ss') and then collapses the choices to that field alone so

        (composer_ss:"Johnny Cash" OR performer_ss:"Johnny Cash") becomes just composer_ss:"Johnny Cash" when the verb is 'written' and performer_ss:"Johnny Cash" when the verb is 'performed'.

        In addition, noun phrases that are composed of two different nouns in which one acts as a qualifier of the other as in "Beatles Songs" are handled with this configuration:

        <str>covered,covers:performer_ss|version_s:Cover|original_performer_s:ENTITY,recording_type_ss:Song=>original_performer_s:ENTITY</str>

        In this case, "Beatles Songs" is a single noun phrase that refers to songs written by one or more of the Beatles. With this configuration and supporting code, we can now disambiguate queries like:

        "Beatles Songs covered" - which are covers of Beatles songs by other artists from "songs Beatles covered" - which are songs performed by the Beatles that were written by other songwriters. Two test cases have been added to the patch to demonstrate these new features.

        Show
        tedsullivan Ted Sullivan added a comment - I have just uploaded a new patch that adds what I call "verb support" to the autofilter. This enables you to specify terms that will constrain the autofilter field choices. The example I have been using for this is a Music Ontology that I have been using to demonstrate the features of the query autofilter. So if I have records of musicians, songwriters, songs etc. with fields like performer_ss, composer_ss, composition_type_s and so on. If I search for songs written by Johnny Cash vs songs performed by Johnny Cash Without the verb support, the autofilter would pick up composition_type_s:Song and performer_ss:"Johnny Cash" OR composer_ss:"Johnny Cash" because this artist has documents in which he is either (or both) a performer and a songwriter. That is, both of these queries would return the same results because neither 'written' or 'performed' is a value in any document field. By adding configurations like this and some supporting code <searchComponent name="autofilter" class="org.apache.solr.handler.component.QueryAutoFilteringComponent" > <arr name="verbModifiers"> <str>written,wrote,composed:composer_ss</str> <str>performed,played,sang,recorded:performer_ss</str> </arr> </searchComponent> The above queries work as expected. The code detects the presence of the modifier in proximity to a term that occurs in the search field (for 'written' that would be 'composer_ss') and then collapses the choices to that field alone so (composer_ss:"Johnny Cash" OR performer_ss:"Johnny Cash") becomes just composer_ss:"Johnny Cash" when the verb is 'written' and performer_ss:"Johnny Cash" when the verb is 'performed'. In addition, noun phrases that are composed of two different nouns in which one acts as a qualifier of the other as in "Beatles Songs" are handled with this configuration: <str>covered,covers:performer_ss|version_s:Cover|original_performer_s: ENTITY ,recording_type_ss:Song=>original_performer_s: ENTITY </str> In this case, "Beatles Songs" is a single noun phrase that refers to songs written by one or more of the Beatles. With this configuration and supporting code, we can now disambiguate queries like: "Beatles Songs covered" - which are covers of Beatles songs by other artists from "songs Beatles covered" - which are songs performed by the Beatles that were written by other songwriters. Two test cases have been added to the patch to demonstrate these new features.
        Hide
        markus17 Markus Jelsma added a comment - - edited

        Hi Ted - i've read the code and your posts about this awesome feature but never really knew how to apply it in real world apps without a.o. the problem pointed out by jmlucjav. So regarding the current solution on that very problem; i feel it would introduce a cumbersome maintenance hazard to any shop or catalog site that has non trivial data, so probably most

        This solution would require very frequent maintenance for any index that is fueled by users or automatically, as new examples of said exceptions come and go and are not easy to spot.

        Is it not a simpler idea to detect these ambiguities and not filter, but emit the ambiguities to the ResponseBuilder so applications can deal with it? You have the control of the SearchComponent so you can let app developers ask the question to users, do you want the performer, or the writer?

        In any case, filtering is bad here, boosting both writer and performer may be another (additional) solution to deal with ambiguities. I fear labour intense maintenance yields this cool feature unusable.

        What do you think?
        M.

        Show
        markus17 Markus Jelsma added a comment - - edited Hi Ted - i've read the code and your posts about this awesome feature but never really knew how to apply it in real world apps without a.o. the problem pointed out by jmlucjav. So regarding the current solution on that very problem; i feel it would introduce a cumbersome maintenance hazard to any shop or catalog site that has non trivial data, so probably most This solution would require very frequent maintenance for any index that is fueled by users or automatically, as new examples of said exceptions come and go and are not easy to spot. Is it not a simpler idea to detect these ambiguities and not filter, but emit the ambiguities to the ResponseBuilder so applications can deal with it? You have the control of the SearchComponent so you can let app developers ask the question to users, do you want the performer, or the writer? In any case, filtering is bad here, boosting both writer and performer may be another (additional) solution to deal with ambiguities. I fear labour intense maintenance yields this cool feature unusable. What do you think? M.
        Hide
        tedsullivan Ted Sullivan added a comment - - edited

        Thanks Markus Jelsma - I agree that there is some maintenance involved maybe even a lot as you say but IMO, data curation is an important activity that is - or should be - ongoing - i.e. to avoid the "garbage in garbage out" problem. So while I agree with you that maintenance is needed to improve precision, I would argue that this should be done anyway - and that any gains in precision are good things and should be approached in an incremental fashion. Search can expose data quality issues, so this maintenance would be ongoing as you say, but maybe where we disagree is in whether this is necessary or egregious.

        Note that the solution can be used in "filter" or "boost" mode - so although I call it query autofiltering - you can also use it as query autoboosting - the name I chose obviously indicates my bias however I prefer to make this a choice rather than to hard code it, so boosting is a configurable option that may be more satisfying to users that have data problems like you describe. In this case, any improvement in relevance is a win. Filter mode is the default mode as fits my bias but it is not set in stone

        Emitting the ambiguities may be a possibility too - are you thinking that this would be like a "did you mean" where the component would suggest more precise query contexts - i.e. give the user the chance to agree to what the autofilter detects? This could work but may suffer from the "too many clicks" objection. With clean data, I don't see the need for this but as you say, data is rarely clean. The question is can some of the necessary cleanup be done automatically during indexing or does it require manual interventions which would be difficult to scale?

        As you say, the SearchComponent has access to the ResponseBuilder so that is definitely an option that we could add to the code - add the re-written query to the response as a suggestion but don't execute it. That would give the user more flexibility in applying the solution.

        Show
        tedsullivan Ted Sullivan added a comment - - edited Thanks Markus Jelsma - I agree that there is some maintenance involved maybe even a lot as you say but IMO, data curation is an important activity that is - or should be - ongoing - i.e. to avoid the "garbage in garbage out" problem. So while I agree with you that maintenance is needed to improve precision, I would argue that this should be done anyway - and that any gains in precision are good things and should be approached in an incremental fashion. Search can expose data quality issues, so this maintenance would be ongoing as you say, but maybe where we disagree is in whether this is necessary or egregious. Note that the solution can be used in "filter" or "boost" mode - so although I call it query autofiltering - you can also use it as query autoboosting - the name I chose obviously indicates my bias however I prefer to make this a choice rather than to hard code it, so boosting is a configurable option that may be more satisfying to users that have data problems like you describe. In this case, any improvement in relevance is a win. Filter mode is the default mode as fits my bias but it is not set in stone Emitting the ambiguities may be a possibility too - are you thinking that this would be like a "did you mean" where the component would suggest more precise query contexts - i.e. give the user the chance to agree to what the autofilter detects? This could work but may suffer from the "too many clicks" objection. With clean data, I don't see the need for this but as you say, data is rarely clean. The question is can some of the necessary cleanup be done automatically during indexing or does it require manual interventions which would be difficult to scale? As you say, the SearchComponent has access to the ResponseBuilder so that is definitely an option that we could add to the code - add the re-written query to the response as a suggestion but don't execute it. That would give the user more flexibility in applying the solution.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1c32cc9a49d8e692949f92a8a517300e498e1a55 in lucene-solr's branch refs/heads/branch_6x from Dawid Weiss
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1c32cc9 ]

        SOLR-7539: Upgrade the clustering plugin to Carrot2 3.15.0.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1c32cc9a49d8e692949f92a8a517300e498e1a55 in lucene-solr's branch refs/heads/branch_6x from Dawid Weiss [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1c32cc9 ] SOLR-7539 : Upgrade the clustering plugin to Carrot2 3.15.0.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 401d77485d2b0759c85ea537f545fd02c7b9b11e in lucene-solr's branch refs/heads/master from Dawid Weiss
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=401d774 ]

        SOLR-7539: Upgrade the clustering plugin to Carrot2 3.15.0.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 401d77485d2b0759c85ea537f545fd02c7b9b11e in lucene-solr's branch refs/heads/master from Dawid Weiss [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=401d774 ] SOLR-7539 : Upgrade the clustering plugin to Carrot2 3.15.0.

          People

          • Assignee:
            Unassigned
            Reporter:
            tedsullivan Ted Sullivan
          • Votes:
            10 Vote for this issue
            Watchers:
            17 Start watching this issue

            Dates

            • Created:
              Updated:

              Development