Solr
  1. Solr
  2. SOLR-764

Support facet.sort=false (index order) with distributed search

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 1.4
    • Component/s: None
    • Labels:
      None

      Description

      Distributed search does not currently support sorting facets by index order (facet.sort=false).

      1. SOLR-764.patch
        12 kB
        Yonik Seeley
      2. SOLR-764.patch
        3 kB
        Yonik Seeley
      3. SOLR-764.patch
        6 kB
        Lars Kotthoff
      4. SOLR-764.patch
        6 kB
        Lars Kotthoff

        Activity

        Hide
        Yonik Seeley added a comment -

        committed.

        Show
        Yonik Seeley added a comment - committed.
        Hide
        Yonik Seeley added a comment -

        Revised patch that switches "lex" to "index" to avoid confusion about which lex (the invisible indexed token, or the human readable label).

        TODO: update wiki when this patch is committed

        Show
        Yonik Seeley added a comment - Revised patch that switches "lex" to "index" to avoid confusion about which lex (the invisible indexed token, or the human readable label). TODO: update wiki when this patch is committed
        Hide
        Lars Kotthoff added a comment -

        When I said "do the right thing" I basically meant what your patch does already – sort according to the field type. I agree that "index" is a better name for it though.

        Show
        Lars Kotthoff added a comment - When I said "do the right thing" I basically meant what your patch does already – sort according to the field type. I agree that "index" is a better name for it though.
        Hide
        Yonik Seeley added a comment -

        "lex" originally made sense to me because it is index order, which is lexicographic for the indexed tokens (but not for the human-readable values).
        So an "integer" field will sort the terms as 0,100,2
        while an "sint" field will show 0,2,100

        It's not as simple to just "do the right thing" since it would carry a significant performance penalty (and that's something we shouldn't cover up or just automatically do).

        Perhaps a better name for "lex" would be "index"?

        Show
        Yonik Seeley added a comment - "lex" originally made sense to me because it is index order, which is lexicographic for the indexed tokens (but not for the human-readable values). So an "integer" field will sort the terms as 0,100,2 while an "sint" field will show 0,2,100 It's not as simple to just "do the right thing" since it would carry a significant performance penalty (and that's something we shouldn't cover up or just automatically do). Perhaps a better name for "lex" would be "index"?
        Hide
        Lars Kotthoff added a comment -

        Looks good. I still think that the (semantically) better solution would be to make it explicit with a new parameter – which was one of the reasons for changing facet.sort from true/false to count/lex.

        That said, we could also simply change "lex" to "field", meaning "sort it by field value" and Solr just does the right thing for the different field types.

        Show
        Lars Kotthoff added a comment - Looks good. I still think that the (semantically) better solution would be to make it explicit with a new parameter – which was one of the reasons for changing facet.sort from true/false to count/lex. That said, we could also simply change "lex" to "field", meaning "sort it by field value" and Solr just does the right thing for the different field types.
        Hide
        Yonik Seeley added a comment -

        How about this patch to fix the sort order ? It should make distributed search lex sort equal to what one would get for non-distributed search.

        Show
        Yonik Seeley added a comment - How about this patch to fix the sort order ? It should make distributed search lex sort equal to what one would get for non-distributed search.
        Hide
        Yonik Seeley added a comment -

        I would expect it to return something like 1, 10, 11, 2, 20, 3, 4, 5 etc. as this is the lexicographic order

        Not for a field like sint, which is manipulated so that the lexicographic order is equal to the numeric order.
        Looks like we need to do FieldType.toInternal() to go back to the indexed form before doing string compares.

        Show
        Yonik Seeley added a comment - I would expect it to return something like 1, 10, 11, 2, 20, 3, 4, 5 etc. as this is the lexicographic order Not for a field like sint, which is manipulated so that the lexicographic order is equal to the numeric order. Looks like we need to do FieldType.toInternal() to go back to the indexed form before doing string compares.
        Hide
        Wojtek Piaseczny added a comment -

        In the new implementation, sort=false and sort=lex are the same. In the previous implementation, specifying sort=false would sort sortable-numbers (like sint) numerically, rather lexicographically. So there's a slight inconsistency between the new and old implementation. I like the idea of adding a numeric sort order.

        Show
        Wojtek Piaseczny added a comment - In the new implementation, sort=false and sort=lex are the same. In the previous implementation, specifying sort=false would sort sortable-numbers (like sint) numerically, rather lexicographically. So there's a slight inconsistency between the new and old implementation. I like the idea of adding a numeric sort order.
        Hide
        Lars Kotthoff added a comment -

        What do you mean by wrong order? When sorting lexicographically, I would expect it to return something like 1, 10, 11, 2, 20, 3, 4, 5 etc. as this is the lexicographic order. Ordering those by numeric value should probably be a new parameter, like "facet.sort=numeric".

        Show
        Lars Kotthoff added a comment - What do you mean by wrong order? When sorting lexicographically, I would expect it to return something like 1, 10, 11, 2, 20, 3, 4, 5 etc. as this is the lexicographic order. Ordering those by numeric value should probably be a new parameter, like "facet.sort=numeric".
        Hide
        Wojtek Piaseczny added a comment -

        I found one small issue: sorting lexicographically gives the wrong order when the field type is a sortable number (e.g. sint). I think the fix is to change DistribFieldFacet.getLexSorted() to compare on term number instead of name.

        (maybe this should be a comment on SOLR 781 instead).

        Show
        Wojtek Piaseczny added a comment - I found one small issue: sorting lexicographically gives the wrong order when the field type is a sortable number (e.g. sint). I think the fix is to change DistribFieldFacet.getLexSorted() to compare on term number instead of name. (maybe this should be a comment on SOLR 781 instead).
        Hide
        Lars Kotthoff added a comment -

        I think this is fixed with SOLR-781 and can be closed. Can anybody confirm?

        Show
        Lars Kotthoff added a comment - I think this is fixed with SOLR-781 and can be closed. Can anybody confirm?
        Hide
        Lars Kotthoff added a comment -

        Attaching new patch which, as per Yonik's suggestion, doesn't refine facet counts when minCount <=1 and sort == false.

        Regarding the term order problem, I think it might be best to redesign the whole thing. Term order as such probably doesn't make a lot of sense, it's really used to get the terms in alphabetical order. So why not drop term order altogether and introduce lexicographic order instead. This probably also means changing the facet.sort parameter from a boolean to a string – "count" or "lex".

        This would mean that the same query will return the same results for both distributed and non-distributed setups. The downside is that the facets always need to be sorted (i.e. more computation), but according to Wojtek's measurements this seems to be neglible.

        If this seems like the right direction to go to, we should probably open another issue for this.

        Show
        Lars Kotthoff added a comment - Attaching new patch which, as per Yonik's suggestion, doesn't refine facet counts when minCount <=1 and sort == false. Regarding the term order problem, I think it might be best to redesign the whole thing. Term order as such probably doesn't make a lot of sense, it's really used to get the terms in alphabetical order. So why not drop term order altogether and introduce lexicographic order instead. This probably also means changing the facet.sort parameter from a boolean to a string – "count" or "lex". This would mean that the same query will return the same results for both distributed and non-distributed setups. The downside is that the facets always need to be sorted (i.e. more computation), but according to Wojtek's measurements this seems to be neglible. If this seems like the right direction to go to, we should probably open another issue for this.
        Hide
        Yonik Seeley added a comment -

        I think we still need the over-requesting, regardless of the sort order.

        Good point Lars. There are some special cases that we probably want to optimize for, esp if they are the usual cases. if mincount<=1 and sort==false it seems we can safely avoid any over-request.

        Show
        Yonik Seeley added a comment - I think we still need the over-requesting, regardless of the sort order. Good point Lars. There are some special cases that we probably want to optimize for, esp if they are the usual cases. if mincount<=1 and sort==false it seems we can safely avoid any over-request.
        Hide
        Lars Kotthoff added a comment -

        I think we still need the over-requesting, regardless of the sort order. Through the facet.mincount parameter the count influences the window to be returned even if we're sorting by term order. We do need to get the exact counts. We can only relax this if facet.mincount isn't specified.

        Show
        Lars Kotthoff added a comment - I think we still need the over-requesting, regardless of the sort order. Through the facet.mincount parameter the count influences the window to be returned even if we're sorting by term order. We do need to get the exact counts. We can only relax this if facet.mincount isn't specified.
        Hide
        Yonik Seeley added a comment -

        How is the refinement different?

        The lists are merged by sorting, then a window is selected to return.

        For facets sorted by count, the counts (and thus the order) may not be correct. Facet terms outside the window may actually be inside the window (or above it, changing the order). The code calculates what terms (out of all those returned) could possibly change what the window looks like, taking into account the maximum value it could have for each shard based on what it saw from each.

        For facets sorted by index order, the window into the list will not change based on any facet refinements. Actually, we should remove any amount of "over-requesting" when sorting by index order. The counts could be off though... we may not have the counts from certain shards. So we only have to request counts when:

        • the term is in the window to be returned
        • the term is missing from a shard, and that term comes after the last term the shard returned.
        Show
        Yonik Seeley added a comment - How is the refinement different? The lists are merged by sorting, then a window is selected to return. For facets sorted by count, the counts (and thus the order) may not be correct. Facet terms outside the window may actually be inside the window (or above it, changing the order). The code calculates what terms (out of all those returned) could possibly change what the window looks like, taking into account the maximum value it could have for each shard based on what it saw from each. For facets sorted by index order, the window into the list will not change based on any facet refinements. Actually, we should remove any amount of "over-requesting" when sorting by index order. The counts could be off though... we may not have the counts from certain shards. So we only have to request counts when: the term is in the window to be returned the term is missing from a shard, and that term comes after the last term the shard returned.
        Hide
        Wojtek Piaseczny added a comment -

        I measured how long the sort takes in my case: for collections of up to about 1100 items it took less than 1 ms. Compared with my average response times (~1000 ms), it truly is insignificant.

        Show
        Wojtek Piaseczny added a comment - I measured how long the sort takes in my case: for collections of up to about 1100 items it took less than 1 ms. Compared with my average response times (~1000 ms), it truly is insignificant.
        Hide
        Lars Kotthoff added a comment -

        How is the refinement different? It seems to me that both methods operate on the same list, which is just sorted according to different criteria. Facets with less than the minimum count etc. still need to be filtered out the same way.

        Show
        Lars Kotthoff added a comment - How is the refinement different? It seems to me that both methods operate on the same list, which is just sorted according to different criteria. Facets with less than the minimum count etc. still need to be filtered out the same way.
        Hide
        Yonik Seeley added a comment -

        I separated out topFacets from listFacets because I assumed the merging logic would be different .
        list facet refinement seems pretty different than top facet refinement.

        Show
        Yonik Seeley added a comment - I separated out topFacets from listFacets because I assumed the merging logic would be different . list facet refinement seems pretty different than top facet refinement.
        Hide
        Lars Kotthoff added a comment -

        My gut feeling is that it's probably not worth the effort. Usually the number of facets will be very small and sorting the list will be very quick. For large numbers of facets, everything else will slow down as well. I don't think that sorting the list will become a bottleneck.

        Of course, you can make some real measurements and prove me wrong
        Unless the gain is significant, I'd prefer keeping the code as simple as possible.

        Show
        Lars Kotthoff added a comment - My gut feeling is that it's probably not worth the effort. Usually the number of facets will be very small and sorting the list will be very quick. For large numbers of facets, everything else will slow down as well. I don't think that sorting the list will become a bottleneck. Of course, you can make some real measurements and prove me wrong Unless the gain is significant, I'd prefer keeping the code as simple as possible.
        Hide
        Wojtek Piaseczny added a comment -

        Do you think it's worth keeping lexicographically sorted facets separate (i.e. in listFacets) and optimizing the combining code specifically for that case?

        I misunderstood what term number meant, thanks for clarifying.

        Show
        Wojtek Piaseczny added a comment - Do you think it's worth keeping lexicographically sorted facets separate (i.e. in listFacets) and optimizing the combining code specifically for that case? I misunderstood what term number meant, thanks for clarifying.
        Hide
        Lars Kotthoff added a comment -

        As far as I could see, "listFacets" wasn't used anywhere in the code – everything was put into "topFacets" (which behaved exactly the same as "listFacets") and sorted in the final stage. I guess the idea was to have a data structure which automatically sorts entries as new ones are inserted. Optimising the combining of sorted data from the shards can't be done in the general case, as e.g. for sort by count the data returned affects the order – merging the results from 2 shards can change the order completely compared because you're adding up the counts.

        If you specify facet.sort=false in a non-distributed setup, the facet values are ordered by term number as well – which happens to be the lexicographic order. Also, sorting by term name only really makes sense for string or text fields.

        Show
        Lars Kotthoff added a comment - As far as I could see, "listFacets" wasn't used anywhere in the code – everything was put into "topFacets" (which behaved exactly the same as "listFacets") and sorted in the final stage. I guess the idea was to have a data structure which automatically sorts entries as new ones are inserted. Optimising the combining of sorted data from the shards can't be done in the general case, as e.g. for sort by count the data returned affects the order – merging the results from 2 shards can change the order completely compared because you're adding up the counts. If you specify facet.sort=false in a non-distributed setup, the facet values are ordered by term number as well – which happens to be the lexicographic order. Also, sorting by term name only really makes sense for string or text fields.
        Hide
        Wojtek Piaseczny added a comment -

        I notice this patch consolidates the FacetInfo class' 'topFacets' & 'listFacets' into a single collection. Do you know why these were ever separate? I had guessed it was because when 'listFacets' was being populated (only when facet.sort=false), it was being populated with data that was already ordered correctly per shard, and that combining the data (while maintaining sort order) from each shard could be optimized beyond calling Arrays.sort().

        Shouldn't the getTermSorted method always use name instead of term number?

        Show
        Wojtek Piaseczny added a comment - I notice this patch consolidates the FacetInfo class' 'topFacets' & 'listFacets' into a single collection. Do you know why these were ever separate? I had guessed it was because when 'listFacets' was being populated (only when facet.sort=false), it was being populated with data that was already ordered correctly per shard, and that combining the data (while maintaining sort order) from each shard could be optimized beyond calling Arrays.sort(). Shouldn't the getTermSorted method always use name instead of term number?
        Hide
        Lars Kotthoff added a comment -

        Patch implementing support for facet.sort=false with distributed search.

        Simplified the existing code by removing things that weren't used and added an additional method to sort facets by term number. Note that this implementation is flawed – there are no distributed term numbers, therefore it's possible that several facet values have the same term number because they come from different shards. In that case the lexicographic order of the value will be used to break the tie. It is possible that facet values are returned in a different order for distributed and local setups.

        The patch also adds a unit test, commented out because of the issues described above.

        Show
        Lars Kotthoff added a comment - Patch implementing support for facet.sort=false with distributed search. Simplified the existing code by removing things that weren't used and added an additional method to sort facets by term number. Note that this implementation is flawed – there are no distributed term numbers, therefore it's possible that several facet values have the same term number because they come from different shards. In that case the lexicographic order of the value will be used to break the tie. It is possible that facet values are returned in a different order for distributed and local setups. The patch also adds a unit test, commented out because of the issues described above.
        Hide
        Wojtek Piaseczny added a comment -

        I want to show dynamic ranges for numeric facets. My (and probably most??) implementation for building dynamic ranges requires the numbers to be sorted by facet name rather than by occurrences.

        Show
        Wojtek Piaseczny added a comment - I want to show dynamic ranges for numeric facets. My (and probably most??) implementation for building dynamic ranges requires the numbers to be sorted by facet name rather than by occurrences.
        Hide
        Lars Kotthoff added a comment -

        Ah, right. Sorry, I missed that part. Out of interest, what is your use case that requires facet.sort=false?

        Show
        Lars Kotthoff added a comment - Ah, right. Sorry, I missed that part. Out of interest, what is your use case that requires facet.sort=false?
        Hide
        Wojtek Piaseczny added a comment -

        I'm trying to get all facets from all shards, ordered alphabetically. I'm not trying to disable facets.

        SOLR-755 solves the case where facet.limit=-1 and facet.sort=true, but not where facet.sort=false. A comment in SOLR-755 states:

        "Note that one must add facet.sorted=true in conjunction with facet.limit=-1 since it defaults to unsorted (or sorted-by-term), and this is currently unsupported by distributed search."

        Show
        Wojtek Piaseczny added a comment - I'm trying to get all facets from all shards, ordered alphabetically. I'm not trying to disable facets. SOLR-755 solves the case where facet.limit=-1 and facet.sort=true, but not where facet.sort=false. A comment in SOLR-755 states: "Note that one must add facet.sorted=true in conjunction with facet.limit=-1 since it defaults to unsorted (or sorted-by-term), and this is currently unsupported by distributed search."
        Hide
        Lars Kotthoff added a comment -

        Have you tried a version which incorporates SOLR-755? This should have been fixed in that issue.

        Show
        Lars Kotthoff added a comment - Have you tried a version which incorporates SOLR-755 ? This should have been fixed in that issue.
        Hide
        Amit Nithian added a comment -

        Are you purposely trying to disable facet results? Why not simply pass in facet=false parameter?

        Show
        Amit Nithian added a comment - Are you purposely trying to disable facet results? Why not simply pass in facet=false parameter?

          People

          • Assignee:
            Yonik Seeley
            Reporter:
            Wojtek Piaseczny
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development