Solr
  1. Solr
  2. SOLR-247

Allow facet.field=* to facet on all fields (without knowing what they are)

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      I don't know if this is a good idea to include – it is potentially a bad idea to use it, but that can be ok.

      This came out of trying to use faceting for the LukeRequestHandler top term collecting.
      http://www.nabble.com/Luke-request-handler-issue-tf3762155.html

      1. SOLR-247.patch
        3 kB
        Lars Kotthoff
      2. SOLR-247.patch
        3 kB
        Lars Kotthoff
      3. SOLR-247.patch
        31 kB
        Lars Kotthoff
      4. SOLR-247-FacetAllFields.patch
        1.0 kB
        Ryan McKinley

        Issue Links

          Activity

          Ryan McKinley created issue -
          Ryan McKinley made changes -
          Field Original Value New Value
          Attachment SOLR-247-FacetAllFields.patch [ 12358009 ]
          Hide
          Erik Hatcher added a comment -

          I can see value in supporting the dynamicField wildcard syntax, so *_facet would work. In fact, maybe that'd be a good syntax to support for all fl-like parameters too.

          • scares me, and it'd certainly be discouraged for anything but small indexes! But of course I don't have to use it.
          Show
          Erik Hatcher added a comment - I can see value in supporting the dynamicField wildcard syntax, so *_facet would work. In fact, maybe that'd be a good syntax to support for all fl-like parameters too. scares me, and it'd certainly be discouraged for anything but small indexes! But of course I don't have to use it.
          Hide
          Hoss Man added a comment - - edited

          I have a really hard time imagining anything but the most trivial use cases for facet.field=* ... it doesn't really sime like a problem in need of a solution.

          with somehting like fl=*, we're only talking about stored fields ... storing a field makes no sense unless you plan on returning it in the field list some of the time, so fl=* makes sense as a "return all of hte fields that are possible to return" option.

          There are lots of reasons why a field might be indexed though, so faceting on every indexed field doesn't seem like it would ever make sense.

          in my opinion a "best practice" is not to use fl=* unless you are debugging anyway, otherwise you find yourself getting slammed with large amounts of data you don't want as the index evolves over time ... something like facet.field=* would be worse because it's not just the amount of data getting returned that would increase, but the amount of computation (and time and poor cache performance) that would spike as well.

          if we do this, i would think it only makes sense to generalize the use of "*" in both fl and facet.field into a true glob style syntax, so we can at least encourage people who want this type of syntax to use a naming convention to help limit how much they hurt themselves.

          (i have no problem giving people enough rope to hang themselves, but we shouldn't tie a noose in the rope before we give it to them)

          Show
          Hoss Man added a comment - - edited I have a really hard time imagining anything but the most trivial use cases for facet.field=* ... it doesn't really sime like a problem in need of a solution. with somehting like fl=* , we're only talking about stored fields ... storing a field makes no sense unless you plan on returning it in the field list some of the time, so fl=* makes sense as a "return all of hte fields that are possible to return" option. There are lots of reasons why a field might be indexed though, so faceting on every indexed field doesn't seem like it would ever make sense. in my opinion a "best practice" is not to use fl=* unless you are debugging anyway, otherwise you find yourself getting slammed with large amounts of data you don't want as the index evolves over time ... something like facet.field=* would be worse because it's not just the amount of data getting returned that would increase, but the amount of computation (and time and poor cache performance) that would spike as well. if we do this, i would think it only makes sense to generalize the use of "*" in both fl and facet.field into a true glob style syntax, so we can at least encourage people who want this type of syntax to use a naming convention to help limit how much they hurt themselves. (i have no problem giving people enough rope to hang themselves, but we shouldn't tie a noose in the rope before we give it to them)
          Hide
          Ryan McKinley added a comment -

          >
          > There are lots of reasons why a field might be indexed though, so faceting on every indexed field doesn't seem like it would ever make sense.
          >

          agreed, but *_facet would be useful

          >
          > if we do this, i would think it only makes sense to generalize the use of "*" in both fl and facet.field into a true glob style syntax

          One issue is that fl=XXX is typically a field list separated with "," or "|", facet.field expects each field as a separate parameter.

          Show
          Ryan McKinley added a comment - > > There are lots of reasons why a field might be indexed though, so faceting on every indexed field doesn't seem like it would ever make sense. > agreed, but *_facet would be useful > > if we do this, i would think it only makes sense to generalize the use of "*" in both fl and facet.field into a true glob style syntax One issue is that fl=XXX is typically a field list separated with "," or "|", facet.field expects each field as a separate parameter.
          Hide
          Hoss Man added a comment -

          see some follow up comments in the mailing lists...

          http://www.nabble.com/forum/Search.jtp?forum=14479&local=y&query=SOLR-247

          in a nut shell, i think this issue can be resolved won't fix ... but i'm not opposed to leaving open if someone wants to work on it. there are ways for people to configure solr so that all the fields they want to facet on are faceted on by defualt (when configuring the requestHanlder) which is safer then wild carding.

          Show
          Hoss Man added a comment - see some follow up comments in the mailing lists... http://www.nabble.com/forum/Search.jtp?forum=14479&local=y&query=SOLR-247 in a nut shell, i think this issue can be resolved won't fix ... but i'm not opposed to leaving open if someone wants to work on it. there are ways for people to configure solr so that all the fields they want to facet on are faceted on by defualt (when configuring the requestHanlder) which is safer then wild carding.
          Hide
          Pieter Berkel added a comment -

          Some recent discussion on this topic:

          http://www.nabble.com/Structured-Lucene-documents-tf4234661.html

          I get the impression that general wildcard syntax support for field listing parameters (i.e. the reverse of dynamic fields) as described in the above thread would be far more useful than a simple '*' match-anything syntax (not only in faceting but other cases like hl.fl and perhaps even mlt.fl).

          I haven't really considered the performance issues of this approach however, as it would involve checking each field supplied in the parameter for '*' before expanding it into full field names for every query.

          Given the above, the fact that it could be used across multiple response handlers and subhandlers like SimpleFacets & Highlighting, and that it would require access to IndexReader to getFieldNames(), where might be the most sensible place to put this code?

          Show
          Pieter Berkel added a comment - Some recent discussion on this topic: http://www.nabble.com/Structured-Lucene-documents-tf4234661.html I get the impression that general wildcard syntax support for field listing parameters (i.e. the reverse of dynamic fields) as described in the above thread would be far more useful than a simple '*' match-anything syntax (not only in faceting but other cases like hl.fl and perhaps even mlt.fl). I haven't really considered the performance issues of this approach however, as it would involve checking each field supplied in the parameter for '*' before expanding it into full field names for every query. Given the above, the fact that it could be used across multiple response handlers and subhandlers like SimpleFacets & Highlighting, and that it would require access to IndexReader to getFieldNames(), where might be the most sensible place to put this code?
          Hide
          Matthew Runo added a comment - - edited

          http://www.nabble.com/Dynamic-fields---Facets-to14739422.html

          also provides a use case for this to be fixed. While I'd never do a facet on the wildcard, I'd love to be able to do one on attribute_<wildcard>. It just makes using the dynamic fields so much easier.

          Show
          Matthew Runo added a comment - - edited http://www.nabble.com/Dynamic-fields---Facets-to14739422.html also provides a use case for this to be fixed. While I'd never do a facet on the wildcard, I'd love to be able to do one on attribute_<wildcard>. It just makes using the dynamic fields so much easier.
          Hide
          Hoss Man added a comment -

          i've put soem thoughts on the broader issues of having solr admin control over how field names are dealt with (globs, regexes, aliasing, etc...) in various contexts on the wiki...

          http://wiki.apache.org/solr/FieldAliasesAndGlobsInParams

          ...it might be best to use that as a whiteboard for a design discussion since the ultimate issues are a little bigger then this issue originally set out to tackle.

          Show
          Hoss Man added a comment - i've put soem thoughts on the broader issues of having solr admin control over how field names are dealt with (globs, regexes, aliasing, etc...) in various contexts on the wiki... http://wiki.apache.org/solr/FieldAliasesAndGlobsInParams ...it might be best to use that as a whiteboard for a design discussion since the ultimate issues are a little bigger then this issue originally set out to tackle.
          Lars Kotthoff made changes -
          Link This issue relates to SOLR-540 [ SOLR-540 ]
          Hide
          Lars Kotthoff added a comment -

          Attaching patch which implements support for wildcards in facet field specifications similar to SOLR-540. If the facet field specification contains an asterisk, every indexed field the reader knows about is matched against the corresponding regular expression.

          Note that the unit tests part of the patch sort of depends on SOLR-645. When applied to the current trunk it will create the new facets test file with all the old tests plus the new ones. This doesn't cause anything to not work anymore, but duplicates the old tests. I can provide a new patch either against the current trunk or against the trunk with SOLR-645 committed, whichever is required.

          Show
          Lars Kotthoff added a comment - Attaching patch which implements support for wildcards in facet field specifications similar to SOLR-540 . If the facet field specification contains an asterisk, every indexed field the reader knows about is matched against the corresponding regular expression. Note that the unit tests part of the patch sort of depends on SOLR-645 . When applied to the current trunk it will create the new facets test file with all the old tests plus the new ones. This doesn't cause anything to not work anymore, but duplicates the old tests. I can provide a new patch either against the current trunk or against the trunk with SOLR-645 committed, whichever is required.
          Lars Kotthoff made changes -
          Attachment SOLR-247.patch [ 12386605 ]
          Lars Kotthoff made changes -
          Link This issue is related to SOLR-645 [ SOLR-645 ]
          Lars Kotthoff made changes -
          Link This issue relates to SOLR-650 [ SOLR-650 ]
          Hide
          Lars Kotthoff added a comment -

          Attaching new patch which applies to current TRUNK.

          Show
          Lars Kotthoff added a comment - Attaching new patch which applies to current TRUNK.
          Lars Kotthoff made changes -
          Attachment SOLR-247.patch [ 12386846 ]
          Hide
          Lars Kotthoff added a comment -

          Syncing patch with trunk.

          Show
          Lars Kotthoff added a comment - Syncing patch with trunk.
          Lars Kotthoff made changes -
          Attachment SOLR-247.patch [ 12397553 ]
          Hide
          Shalin Shekhar Mangar added a comment -

          Lars, I see you have been updating the patches to trunk diligently. However, I'm not sure if there is a consensus on adding this without having a glob like feature in place.

          Do you have a use-case in mind which can be solved only with the current patch?

          Show
          Shalin Shekhar Mangar added a comment - Lars, I see you have been updating the patches to trunk diligently. However, I'm not sure if there is a consensus on adding this without having a glob like feature in place. Do you have a use-case in mind which can be solved only with the current patch?
          Hide
          Lars Kotthoff added a comment -

          Off the top of my head, having an automated feed parser which adds fields and facet_field to facet on. I agree that all this should be part of a global glob-like thing, but that would probably only apply to the part which parses the parameters anyway. How a glob is matched depends on the type of glob (i.e. whether the field is indexed/stored/... and we want to facet/highlight/...).

          If people start using it and it turns out to be important, it can always be refactored into something more general. If nobody uses globbing, there'd be no need to invest the effort of making it general

          Show
          Lars Kotthoff added a comment - Off the top of my head, having an automated feed parser which adds fields and facet_field to facet on. I agree that all this should be part of a global glob-like thing, but that would probably only apply to the part which parses the parameters anyway. How a glob is matched depends on the type of glob (i.e. whether the field is indexed/stored/... and we want to facet/highlight/...). If people start using it and it turns out to be important, it can always be refactored into something more general. If nobody uses globbing, there'd be no need to invest the effort of making it general
          Hide
          Avlesh Singh added a comment -

          I haven't tested this patch yet. But my belief is that the primary objective should be to support dynamic fields than pure wildcard field names. Dynamic fields offer wide range of capabilities with w.r.t key-value(s) kind of data. Most of the times people use such fields because the keys are not known upfront.

          If nothing more, this patch should at least cater to that audience.

          Show
          Avlesh Singh added a comment - I haven't tested this patch yet. But my belief is that the primary objective should be to support dynamic fields than pure wildcard field names. Dynamic fields offer wide range of capabilities with w.r.t key-value(s) kind of data. Most of the times people use such fields because the keys are not known upfront. If nothing more, this patch should at least cater to that audience.
          Hide
          Jan Høydahl added a comment -

          Seems like there has not been much demand for this the last 4 years Could this not be a nice task to do at the same time as SOLR-650 ?

          SPRING_CLEANING_2013

          Show
          Jan Høydahl added a comment - Seems like there has not been much demand for this the last 4 years Could this not be a nice task to do at the same time as SOLR-650 ? SPRING_CLEANING_2013
          Jan Høydahl made changes -
          Labels beginners
          Jan Høydahl made changes -
          Labels beginners beginners newdev
          Hide
          Erick Erickson added a comment -

          My first reaction to this is that while it might have some limited use-cases with small indexes, as soon as one went to any decent size corpus it'd blow memory up. Not sure it's worth the effort, but I could be convinced otherwise...

          SOLR-650 seems something of a separate issue, it's much more controlled. That said, they're both really about now to specify the list of fields for faceting, so you're right in that they're part of the same concept....

          Show
          Erick Erickson added a comment - My first reaction to this is that while it might have some limited use-cases with small indexes, as soon as one went to any decent size corpus it'd blow memory up. Not sure it's worth the effort, but I could be convinced otherwise... SOLR-650 seems something of a separate issue, it's much more controlled. That said, they're both really about now to specify the list of fields for faceting, so you're right in that they're part of the same concept....
          Hide
          Jan Høydahl added a comment -

          I agree it's a terrible idea for anything production, but for discovery it could be nice. I often throw "unknown" data into an index with a catch-all <dynamicField name="*" type="string"/> kind of config, and then find myself specifying a lot of facet.field's to introspect what's in the various fields. For pure dev purposes it'd be a nice shortcut. So for me it can live as a newdev issue for still some time...

          Show
          Jan Høydahl added a comment - I agree it's a terrible idea for anything production, but for discovery it could be nice. I often throw "unknown" data into an index with a catch-all <dynamicField name="*" type="string"/> kind of config, and then find myself specifying a lot of facet.field's to introspect what's in the various fields. For pure dev purposes it'd be a nice shortcut. So for me it can live as a newdev issue for still some time...
          Hide
          Gowtham Gutha added a comment -

          Why doesn't it accept wildcards. So, that when creating the schema.xml, I will be including the faceted fields with a suffix to identify them as facet fields.

          This would be great and even can be fixed.

          http://localhost:8983/solr/select?q=ipod&rows=0&facet=true&facet.limit=-1&facet.field=*_facet&facet.mincount=1

          Show
          Gowtham Gutha added a comment - Why doesn't it accept wildcards. So, that when creating the schema.xml , I will be including the faceted fields with a suffix to identify them as facet fields. This would be great and even can be fixed. http://localhost:8983/solr/select?q=ipod&rows=0&facet=true&facet.limit=-1&facet.field=*_facet&facet.mincount=1
          Hide
          Jack Krupansky added a comment -

          The earlier commentary clearly lays out that the primary concern is that it would be a performance nightmare, but... that does depend on your particular use case.

          Personally, I would say to go forward with adding this feature, but with a clear documentation caveat that this feature "should be use with great care since it is likely to be extremely memory and performance intensive and more of a development testing tool than a production feature, although it could have value when wildcard patterns are crafted with care for a very limited number of fields."

          Show
          Jack Krupansky added a comment - The earlier commentary clearly lays out that the primary concern is that it would be a performance nightmare, but... that does depend on your particular use case. Personally, I would say to go forward with adding this feature, but with a clear documentation caveat that this feature "should be use with great care since it is likely to be extremely memory and performance intensive and more of a development testing tool than a production feature, although it could have value when wildcard patterns are crafted with care for a very limited number of fields."
          Hide
          Philip Willoughby added a comment -

          We have a concrete use-case for which this facility is required. We have a requirement to add arbitrary tags in arbitrary groups to products, and to be able to filter by those tags in the same way as you can filter our documents by more-structured attributes (e.g. price, discount, size, designer, etc). The semantics we want are to ignore the filter on property X when faceting property X.

          With our known-in-advance fields this is easy: taking the example of designers we add an fq=

          {!tag=did}

          designer_id:## for filtering and add facet.field=

          {!ex=did}

          designer_id when looking for designer facets.

          With these unknown-in-advance fields it is hard: what we had hoped to do was use facet.field=arbitrary_tag_* to generate the tag group facets and then if someone filters to group X=Y we'd add fq=

          {!tag=atX}

          arbitrary_tag_X:Y for the filter and pass facet.field=

          {!ex=atX}

          arbitrary_tag_X to get the facets. Of course in this case we would also want to pass facet.field=arbitrary_tag_* to get the facets over the other tags which means faceting arbitrary_tag_X twice, and creates a precedence problem.

          We want, I think, facet.field=arbitrary_tag_* to work, but to be disregarded for any field it would otherwise match which is explicitly named as a facet.field

          The other model we have considered is to combine every group and tag into a string like "group\u001Ftag", put them all into a field named tags and facet over that. But this means that we can't disregard the filters over group X when faceting group X while respecting them while faceting group Y etc without making multiple queries.

          Show
          Philip Willoughby added a comment - We have a concrete use-case for which this facility is required. We have a requirement to add arbitrary tags in arbitrary groups to products, and to be able to filter by those tags in the same way as you can filter our documents by more-structured attributes (e.g. price, discount, size, designer, etc). The semantics we want are to ignore the filter on property X when faceting property X. With our known-in-advance fields this is easy: taking the example of designers we add an fq= {!tag=did} designer_id:## for filtering and add facet.field= {!ex=did} designer_id when looking for designer facets. With these unknown-in-advance fields it is hard: what we had hoped to do was use facet.field=arbitrary_tag_* to generate the tag group facets and then if someone filters to group X=Y we'd add fq= {!tag=atX} arbitrary_tag_X:Y for the filter and pass facet.field= {!ex=atX} arbitrary_tag_X to get the facets. Of course in this case we would also want to pass facet.field=arbitrary_tag_* to get the facets over the other tags which means faceting arbitrary_tag_X twice, and creates a precedence problem. We want, I think, facet.field=arbitrary_tag_* to work, but to be disregarded for any field it would otherwise match which is explicitly named as a facet.field The other model we have considered is to combine every group and tag into a string like "group\u001Ftag", put them all into a field named tags and facet over that. But this means that we can't disregard the filters over group X when faceting group X while respecting them while faceting group Y etc without making multiple queries.
          Hide
          Erick Erickson added a comment -

          If you had all the arbitrary_tag_* fields, could you construct the proper query programmatically? Because you can get the list of all fields that are actually used by any indexed document as opposed to the fields defined in the schema. That's what allows the admin/schema browser to display its drop-down.

          It's probably unlikely that this functionality will be incorporated in Solr as per this JIRA based on the fact that no real action has happened on it for 6 years.

          Show
          Erick Erickson added a comment - If you had all the arbitrary_tag_* fields, could you construct the proper query programmatically? Because you can get the list of all fields that are actually used by any indexed document as opposed to the fields defined in the schema. That's what allows the admin/schema browser to display its drop-down. It's probably unlikely that this functionality will be incorporated in Solr as per this JIRA based on the fact that no real action has happened on it for 6 years.
          Hide
          Philip Willoughby added a comment -

          Erick Erickson Yes, we could do that. We don't use the schema browser on this core because it crashes or locks up the browser. The underlying /admin/luke endpoint takes over 12 seconds to respond (with 20280 known fields already this is not surprising) so we wouldn't be able to meet our <100ms SLA without re-architecting our application so that it's no longer stateless, which is a big step we aren't willing to take.

          We are working around this by using both indexing approaches I outlined above and mixing the facets together correctly in application logic.

          Show
          Philip Willoughby added a comment - Erick Erickson Yes, we could do that. We don't use the schema browser on this core because it crashes or locks up the browser. The underlying /admin/luke endpoint takes over 12 seconds to respond (with 20280 known fields already this is not surprising) so we wouldn't be able to meet our <100ms SLA without re-architecting our application so that it's no longer stateless, which is a big step we aren't willing to take. We are working around this by using both indexing approaches I outlined above and mixing the facets together correctly in application logic.
          Hide
          Erick Erickson added a comment -

          < 100 ms SLAs would be hard to meet if you wind up faceting on very many fields in the first place, so I'm not quite sure how this JIRA would solve your problem. Generally having that many fields indicates some design alternatives should be explored...

          FWIW

          Show
          Erick Erickson added a comment - < 100 ms SLAs would be hard to meet if you wind up faceting on very many fields in the first place, so I'm not quite sure how this JIRA would solve your problem. Generally having that many fields indicates some design alternatives should be explored... FWIW

            People

            • Assignee:
              Unassigned
              Reporter:
              Ryan McKinley
            • Votes:
              10 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:

                Development