Solr
  1. Solr
  2. SOLR-106

new facet params: facet.sort, facet.mincount, facet.offset

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2
    • Component/s: search
    • Labels:
      None

      Description

      a couple of new facet params:
      facet lists become pageable with facet.offset, facet.limit (idea from Erik)
      facet.sort explicitly specifies sort order (true for count descending, false for natural index order)
      facet.mincount: minimum count for facets included in response (idea from JJ, deprecate zeros)

        Activity

        Hide
        Yonik Seeley added a comment -

        Attached untested code... I don't want to bother testing + adding test code w/o feedback on direction.

        Changes included in this code:

        • facet.offset, facet.sort, facet.mincount support (detailed above)
        • avoid temporary HashMap when sort=false, add directly to response
        • for sorted case, keep track of min element in the queue to avoid needlessly adding pairs
        Show
        Yonik Seeley added a comment - Attached untested code... I don't want to bother testing + adding test code w/o feedback on direction. Changes included in this code: facet.offset, facet.sort, facet.mincount support (detailed above) avoid temporary HashMap when sort=false, add directly to response for sorted case, keep track of min element in the queue to avoid needlessly adding pairs
        Hide
        Andreas Hochsteger added a comment -

        What about a facet filter query to only return facets which match a certain query?

        In our case we use categories which are organized hierarchically via a path syntax.
        Documets have different categories attached which is indexed by a solr field.

        Now certain apps (which deal with these documets) are interested in a specific category subpath.
        If you use faceted searching you get all categories back - not only those in which the application is interested in.

        Currently we use a workaround which uses an additional solr search field that just contains the categories for the application, but this doesn't scale very well.
        Would it be possible to add an additional filter query for facets to limit the facets which are actually returned in the faceted search?

        Thoughts?

        Show
        Andreas Hochsteger added a comment - What about a facet filter query to only return facets which match a certain query? In our case we use categories which are organized hierarchically via a path syntax. Documets have different categories attached which is indexed by a solr field. Now certain apps (which deal with these documets) are interested in a specific category subpath. If you use faceted searching you get all categories back - not only those in which the application is interested in. Currently we use a workaround which uses an additional solr search field that just contains the categories for the application, but this doesn't scale very well. Would it be possible to add an additional filter query for facets to limit the facets which are actually returned in the faceted search? Thoughts?
        Hide
        Hoss Man added a comment -

        Yonik: i think all of these params make sense. i only skimmed the code cahnges breifly but they look sound to me.

        Andreas: there was some discussion on the list about being able to specify a "prefix" that facet field values must match to be listed (which would trivial and efficient for TermEnum based facets, and doable for FieldCache facets) ... that makes a lot of sense to me in a "type ahead" word completion type of application (like google suggest) but the situation you describe doesn't relaly make sense to me – if your client is only interested in documents in certain categories, then don't you want to filter the documents by do just those categories (at which point the facets will also be filtered) ... can you start a thread on solr-user describing your situation/scaling issue?

        Show
        Hoss Man added a comment - Yonik: i think all of these params make sense. i only skimmed the code cahnges breifly but they look sound to me. Andreas: there was some discussion on the list about being able to specify a "prefix" that facet field values must match to be listed (which would trivial and efficient for TermEnum based facets, and doable for FieldCache facets) ... that makes a lot of sense to me in a "type ahead" word completion type of application (like google suggest) but the situation you describe doesn't relaly make sense to me – if your client is only interested in documents in certain categories, then don't you want to filter the documents by do just those categories (at which point the facets will also be filtered) ... can you start a thread on solr-user describing your situation/scaling issue?
        Hide
        J.J. Larrea added a comment -

        +1 on direction, and based on a quick scan of the patch.

        It could be sort=asc|desc|none rather than true|false, but I'm not sure whether anyone would ever have a use for asc so it's probably not worth implementing.

        Of course extending to caching the facet tallies would dramatically speed paging. Perhaps both getFieldCacheCounts and getFacetTermEnumCounts should return a Collection, which could be a BoundedTreeSet when sorting or a List implementation when not, holding all the counts >= facet.mincount; then getTermCounts could centralize the paging and response creation and provide an object to (someday) cache. It would lose the mincount=0 optimization for getFieldCacheCounts, but how many users are really going to want mincounts=0 unless the list is small and non-sparse in which case the optimization isn't a big win anyway.

        Show
        J.J. Larrea added a comment - +1 on direction, and based on a quick scan of the patch. It could be sort=asc|desc|none rather than true|false, but I'm not sure whether anyone would ever have a use for asc so it's probably not worth implementing. Of course extending to caching the facet tallies would dramatically speed paging. Perhaps both getFieldCacheCounts and getFacetTermEnumCounts should return a Collection, which could be a BoundedTreeSet when sorting or a List implementation when not, holding all the counts >= facet.mincount; then getTermCounts could centralize the paging and response creation and provide an object to (someday) cache. It would lose the mincount=0 optimization for getFieldCacheCounts, but how many users are really going to want mincounts=0 unless the list is small and non-sparse in which case the optimization isn't a big win anyway.
        Hide
        Yonik Seeley added a comment -

        I did have a reservation about always guaranteeing a sort order... I wasn't sure if it would always be easy to maintain term-sort order in future implementations. If that were the case, maybe it should be sort=count|term|none... might be more future-flexible, but harder to remember than a boolean.

        Just curious, what are your usecases for facet paging, and what percent of facet queries would have an offset other than 0?

        Show
        Yonik Seeley added a comment - I did have a reservation about always guaranteeing a sort order... I wasn't sure if it would always be easy to maintain term-sort order in future implementations. If that were the case, maybe it should be sort=count|term|none... might be more future-flexible, but harder to remember than a boolean. Just curious, what are your usecases for facet paging, and what percent of facet queries would have an offset other than 0?
        Hide
        J.J. Larrea added a comment -

        Case for Facet Count Caching: Paging through the hitlist (as well as paging through the facet list). In some cases it appears that generating the facet counts takes much longer than generating the hitlist. And that's certainly the case when the hitlist is retrieved from cache.

        Case or Facet Paging: The UI design I'm doing back-end for has a list of facets with 5 top values each, and a "More..." link when there are indeed more than 5 facet values. Traversing that link is supposed to show a page with all facet values which fit, and Prev and Next paging buttons to access those which don't. This browser shows counts and can be sorted by count but by default is sorted alphabetically by term. Next to each term is a checkbox; after browsing and checking, a button returns to the hitlist but adds a big OR of the checked terms as an fq. So for example if a user searches and gets 437 hits with rutabaga in the title, having 264 unique author names, they might want to browse the list looking for friends. Then after browsing and checking they can see a hitlist of all articles written by friends with rutabage in the title.

        I don't have any idea what the proportion of facet queries would have offset > 0 e.g. where the user has moved to the next page, but I assume it's non-rare.

        It occurs to me that facet.limit should NOT do double-duty for paging: In a world where facet counts are cached, facet.limit should continue to play its current role, and limit the number of ranked values that make it into the BoundedTreeSet and thus the cache. Then facet.offset and facet.count could be used to return a subset. facet.limit==0 --> no limit, but can still be paged.

        Case for pulling response generation out of getFieldCacheCounts and getFacetTermEnumCounts: I (truly) have a 37 million document index which I need to facet on Author, of which there are millions. The TermEnum algorithm is clearly unsuited, and the FieldCache algorithm requires an inordinate amount of memory; I had to disable it. So rather than tell management "can't be done", I think I need to plug in at least one more algorithm, e.g. using TermFreqVectors, to SimpleFacets. Would love not to have to replicate the response generation code.

        Or the sorting code. Just had an idea: It would be even nicer if the counting logic could be passed some object, say an implementation of TermCountRecorder, which has an add(String term, int count) method.

        • That object would encapsulate and isolate the generation of CountPair objects, the filtering for mincount, and whatever varieties of sorting are supported.
        • Rather than have one object with multiple pathways e.g. for term vs. count vs. no sorting, a static factory method could take the field, sort, and mincount arguments and return an anonymous implementation based on a List or a TreeSet or whatever.
        • The factory could also be told whether the counting logic guarantees adding terms in term (index) order, and if not but if term order were requested it could return an implementation which sorts by term text, otherwise a simple List.
        • It could be the object that gets cached for that query for that field.
        • It could have a generateResponse(offset, count) method which generates the <list name="<facetfield>">
        • It could optimize memory when multiple TermCountRecorders corresponding to different queries are cached for a field, by maintaining a single WeakHashMap of term strings for the field, so each TermCountRecorder with the same term has a pointer to the same String object – essentially like String.intern() but the scope is the field and the master value would disappear once all cached TermCountRecorders referencing it disappear.
        • It would make life much easier for a faceting approache where rather than iterating field->document it might be more efficient to iterate document->field (e.g. TermFreqVectors?): A TermCountRecorder could be allocated for each faceting field using that algorithm and have add(...) called in a round-robin fashion as documents are iterated. At the end all could be added to the cache and, whether added or retrieved, would have generateResponse called.
        Show
        J.J. Larrea added a comment - Case for Facet Count Caching: Paging through the hitlist (as well as paging through the facet list). In some cases it appears that generating the facet counts takes much longer than generating the hitlist. And that's certainly the case when the hitlist is retrieved from cache. Case or Facet Paging: The UI design I'm doing back-end for has a list of facets with 5 top values each, and a "More..." link when there are indeed more than 5 facet values. Traversing that link is supposed to show a page with all facet values which fit, and Prev and Next paging buttons to access those which don't. This browser shows counts and can be sorted by count but by default is sorted alphabetically by term. Next to each term is a checkbox; after browsing and checking, a button returns to the hitlist but adds a big OR of the checked terms as an fq. So for example if a user searches and gets 437 hits with rutabaga in the title, having 264 unique author names, they might want to browse the list looking for friends. Then after browsing and checking they can see a hitlist of all articles written by friends with rutabage in the title. I don't have any idea what the proportion of facet queries would have offset > 0 e.g. where the user has moved to the next page, but I assume it's non-rare. It occurs to me that facet.limit should NOT do double-duty for paging: In a world where facet counts are cached, facet.limit should continue to play its current role, and limit the number of ranked values that make it into the BoundedTreeSet and thus the cache. Then facet.offset and facet.count could be used to return a subset. facet.limit==0 --> no limit, but can still be paged. Case for pulling response generation out of getFieldCacheCounts and getFacetTermEnumCounts: I (truly) have a 37 million document index which I need to facet on Author, of which there are millions. The TermEnum algorithm is clearly unsuited, and the FieldCache algorithm requires an inordinate amount of memory; I had to disable it. So rather than tell management "can't be done", I think I need to plug in at least one more algorithm, e.g. using TermFreqVectors, to SimpleFacets. Would love not to have to replicate the response generation code. Or the sorting code. Just had an idea: It would be even nicer if the counting logic could be passed some object, say an implementation of TermCountRecorder, which has an add(String term, int count) method. That object would encapsulate and isolate the generation of CountPair objects, the filtering for mincount, and whatever varieties of sorting are supported. Rather than have one object with multiple pathways e.g. for term vs. count vs. no sorting, a static factory method could take the field, sort, and mincount arguments and return an anonymous implementation based on a List or a TreeSet or whatever. The factory could also be told whether the counting logic guarantees adding terms in term (index) order, and if not but if term order were requested it could return an implementation which sorts by term text, otherwise a simple List. It could be the object that gets cached for that query for that field. It could have a generateResponse(offset, count) method which generates the <list name="<facetfield>"> It could optimize memory when multiple TermCountRecorders corresponding to different queries are cached for a field, by maintaining a single WeakHashMap of term strings for the field, so each TermCountRecorder with the same term has a pointer to the same String object – essentially like String.intern() but the scope is the field and the master value would disappear once all cached TermCountRecorders referencing it disappear. It would make life much easier for a faceting approache where rather than iterating field->document it might be more efficient to iterate document->field (e.g. TermFreqVectors?): A TermCountRecorder could be allocated for each faceting field using that algorithm and have add(...) called in a round-robin fashion as documents are iterated. At the end all could be added to the cache and, whether added or retrieved, would have generateResponse called.
        Hide
        Yonik Seeley added a comment -

        committed, with tests and other changes.
        I'll work on updating docs shortly.

        Show
        Yonik Seeley added a comment - committed, with tests and other changes. I'll work on updating docs shortly.
        Hide
        Yonik Seeley added a comment -

        Thanks for the info JJ... didn't see your update untill after I committed this (I'm running a bit behind all the solr traffic

        > Case for Facet Count Caching: Paging through the hitlist

        Hmmm, yes that would be good for a more stateless client. Even more efficient would be to recognize in the client that since you are only changing a page in the hitlist, the facets won't change (and hence don't re-query).

        > It occurs to me that facet.limit should NOT do double-duty for paging:

        Or, it should only be used for paging, specifying the number to be returned. The BoundedTreeSet size and caching are an implementation detail and shouldn't be in the API unless really necessary. If it matters in the future, we could add a hint specifying how much "extra" should be computed.

        > Case for pulling response generation out of getFieldCacheCounts and getFacetTermEnumCounts

        Sure, makes sense. Don't view the current facet code as "done"... I have a lot of little ideas on how to make it better, esp for cases like faceting on author.

        > TermFreqVectors
        Regarding this, do you have any performance data on it... my assumption was that it would be too slow for a large number of hits. Perhaps still a good option to have if the number of hits are small and the fieldcache isn't an option though.

        > Just had an idea: It would be even nicer if the counting logic could be passed some object,
        Yup, separating those things was on my todo list.

        Show
        Yonik Seeley added a comment - Thanks for the info JJ... didn't see your update untill after I committed this (I'm running a bit behind all the solr traffic > Case for Facet Count Caching: Paging through the hitlist Hmmm, yes that would be good for a more stateless client. Even more efficient would be to recognize in the client that since you are only changing a page in the hitlist, the facets won't change (and hence don't re-query). > It occurs to me that facet.limit should NOT do double-duty for paging: Or, it should only be used for paging, specifying the number to be returned. The BoundedTreeSet size and caching are an implementation detail and shouldn't be in the API unless really necessary. If it matters in the future, we could add a hint specifying how much "extra" should be computed. > Case for pulling response generation out of getFieldCacheCounts and getFacetTermEnumCounts Sure, makes sense. Don't view the current facet code as "done"... I have a lot of little ideas on how to make it better, esp for cases like faceting on author. > TermFreqVectors Regarding this, do you have any performance data on it... my assumption was that it would be too slow for a large number of hits. Perhaps still a good option to have if the number of hits are small and the fieldcache isn't an option though. > Just had an idea: It would be even nicer if the counting logic could be passed some object, Yup, separating those things was on my todo list.
        Hide
        Hoss Man added a comment -

        This bug was modified as part of a bulk update using the criteria...

        • Marked ("Resolved" or "Closed") and "Fixed"
        • Had no "Fix Version" versions
        • Was listed in the CHANGES.txt for 1.2

        The Fix Version for all 39 issues found was set to 1.2, email notification
        was suppressed to prevent excessive email.

        For a list of all the issues modified, search jira comments for this
        (hopefully) unique string: 20080415hossman2

        Show
        Hoss Man added a comment - This bug was modified as part of a bulk update using the criteria... Marked ("Resolved" or "Closed") and "Fixed" Had no "Fix Version" versions Was listed in the CHANGES.txt for 1.2 The Fix Version for all 39 issues found was set to 1.2, email notification was suppressed to prevent excessive email. For a list of all the issues modified, search jira comments for this (hopefully) unique string: 20080415hossman2

          People

          • Assignee:
            Unassigned
            Reporter:
            Yonik Seeley
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development