[SOLR-12839] add a 'prelim_sort' option to JSON faceting - ASF JIRA

Chris M. Hostetter added a comment - 08/Oct/18 03:23

I've started looking into this, and i think it's a very viable idea.

Ironically this is something that's fairly trivial to implement in the SolrCloud situation because of existing division in labor between the FacetFieldProcessor handling the shard requests (which only needs to worry about sort), and the FacetFieldMerger (which looks at sort during phase#1 and then the resort after refinement). In the single node Solr situation, things are actaully much more complex because of how the SlotAcc(umulators) are used and collection of (non-sort) stats is defered until final buckets have been determined.

The attached patch includes a "complete" cloud solution with lots of tests, and a straw man implemention for single node solr that isn't very good – it completely fails in the case of resorting on non-trivial stats (like relatedness() or percentiles() , which defeats the whole point – but was a good starting point for writing tests and to help me understand why i was initially struggling to have a simple impl using the SlotAcc's directly.

The syntax is pretty simple, just a new resort option that uses the same syntax as the sort param...

json.facet={
  categories : {
    type : terms,
    field : cat,
    limit : 5,
    sort : "count desc",
    resort : "consumer_value desc",
    facet : { consumer_value : "sum(div(popularity,price))" }
  }
}

...the semantics are conceptually straightforward:

overrequest + (effective) limit is used to pick the initial buckets to consier acording to sort
- patch modifies the single-node solr case to respect explicit overrequest if resort is used
- still no implicit default overrequest for single-node even if resort is used
once initial buckets are selected (and refined), resort is used to pick the final top buckets

My plan is to try and move forward w/fixing the single node model per the rough game plan noted in the comments, unless people have any concerns about the overall concept/API/approach.

Chris M. Hostetter added a comment - 08/Oct/18 03:23 I've started looking into this, and i think it's a very viable idea. Ironically this is something that's fairly trivial to implement in the SolrCloud situation because of existing division in labor between the FacetFieldProcessor handling the shard requests (which only needs to worry about sort ), and the FacetFieldMerger (which looks at sort during phase#1 and then the resort after refinement). In the single node Solr situation, things are actaully much more complex because of how the SlotAcc(umulators) are used and collection of (non-sort) stats is defered until final buckets have been determined. The attached patch includes a "complete" cloud solution with lots of tests, and a straw man implemention for single node solr that isn't very good – it completely fails in the case of resorting on non-trivial stats (like relatedness() or percentiles() , which defeats the whole point – but was a good starting point for writing tests and to help me understand why i was initially struggling to have a simple impl using the SlotAcc's directly. The syntax is pretty simple, just a new resort option that uses the same syntax as the sort param... json.facet={ categories : { type : terms, field : cat, limit : 5, sort : "count desc", resort : "consumer_value desc", facet : { consumer_value : "sum(div(popularity,price))" } } } ...the semantics are conceptually straightforward: overrequest + (effective) limit is used to pick the initial buckets to consier acording to sort patch modifies the single-node solr case to respect explicit overrequest if resort is used still no implicit default overrequest for single-node even if resort is used once initial buckets are selected (and refined), resort is used to pick the final top buckets My plan is to try and move forward w/fixing the single node model per the rough game plan noted in the comments, unless people have any concerns about the overall concept/API/approach.

Chris M. Hostetter added a comment - 12/Oct/18 17:08

Updated patch with a "real" implementation of resort for single node solr instances, and some more tests i wrote along the way.

still some nocommits – mostly involving additional tests we need, and a few places where i think the code can be cleaned up – but i wanted to get this posted ASAP in case people want to try it out ... i probably won't have time to work on this again for at least a week.

Chris M. Hostetter added a comment - 12/Oct/18 17:08 Updated patch with a "real" implementation of resort for single node solr instances, and some more tests i wrote along the way. still some nocommits – mostly involving additional tests we need, and a few places where i think the code can be cleaned up – but i wanted to get this posted ASAP in case people want to try it out ... i probably won't have time to work on this again for at least a week.

Lucene/Solr QA added a comment - 13/Oct/18 06:30

-1 overall

Vote	Subsystem	Runtime	Comment
			Prechecks
+1	test4tests	0m 0s	The patch appears to include 4 new or modified test files.
			master Compile Tests
+1	compile	2m 21s	master passed
			Patch Compile Tests
+1	compile	3m 5s	the patch passed
+1	javac	3m 5s	the patch passed
+1	Release audit (RAT)	3m 5s	the patch passed
+1	Check forbidden APIs	3m 5s	the patch passed
-1	Validate source patterns	3m 5s	Validate source patterns validate-source-patterns failed
			Other Tests
+1	unit	85m 0s	core in the patch passed.
		95m 35s

Subsystem	Report/Notes
JIRA Issue	~~SOLR-12839~~
JIRA Patch URL	https://issues.apache.org/jira/secure/attachment/12943677/SOLR-12839.patch
Optional Tests	compile javac unit ratsources checkforbiddenapis validatesourcepatterns
uname	Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool	ant
Personality	/home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
git revision	master / 9c8ffab
ant	version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018
Default Java	1.8.0_172
Validate source patterns	https://builds.apache.org/job/PreCommit-SOLR-Build/198/artifact/out/patch-validate-source-patterns-root.txt
Test Results	https://builds.apache.org/job/PreCommit-SOLR-Build/198/testReport/
modules	C: solr/core U: solr/core
Console output	https://builds.apache.org/job/PreCommit-SOLR-Build/198/console
Powered by	Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

Lucene/Solr QA added a comment - 13/Oct/18 06:30 -1 overall Vote Subsystem Runtime Comment Prechecks +1 test4tests 0m 0s The patch appears to include 4 new or modified test files. master Compile Tests +1 compile 2m 21s master passed Patch Compile Tests +1 compile 3m 5s the patch passed +1 javac 3m 5s the patch passed +1 Release audit (RAT) 3m 5s the patch passed +1 Check forbidden APIs 3m 5s the patch passed -1 Validate source patterns 3m 5s Validate source patterns validate-source-patterns failed Other Tests +1 unit 85m 0s core in the patch passed. 95m 35s Subsystem Report/Notes JIRA Issue SOLR-12839 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12943677/SOLR-12839.patch Optional Tests compile javac unit ratsources checkforbiddenapis validatesourcepatterns uname Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux Build tool ant Personality /home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh git revision master / 9c8ffab ant version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018 Default Java 1.8.0_172 Validate source patterns https://builds.apache.org/job/PreCommit-SOLR-Build/198/artifact/out/patch-validate-source-patterns-root.txt Test Results https://builds.apache.org/job/PreCommit-SOLR-Build/198/testReport/ modules C: solr/core U: solr/core Console output https://builds.apache.org/job/PreCommit-SOLR-Build/198/console Powered by Apache Yetus 0.7.0 http://yetus.apache.org This message was automatically generated.

Yonik Seeley added a comment - 14/Oct/18 04:00

We should perhaps think about how to extend to N sorts instead of 2.
Also keeping in mind that sort should be able to have tiebreaks someday.

Brainstorming syntax:
Maybe just append a number to our existing sort syntax, so we would get something like "foo desc, bar asc 50" (bar would be a tiebreak in this case)
So two resorts in a row could be
"field1 asc 100; field2 desc 10" or a slightly more decomposed array ["field1 asc 100","field2 desc 10"]
Or given that this is just an extension of the sort syntax, it could even just go in the "sort" param itself and not bother with "resort"
sort:"count desc 5" could be a synonym for sort:"count desc",limit:5

It's late and my slides for Activate aren't done.... take it for what it's worth

Yonik Seeley added a comment - 14/Oct/18 04:00 We should perhaps think about how to extend to N sorts instead of 2. Also keeping in mind that sort should be able to have tiebreaks someday. Brainstorming syntax: Maybe just append a number to our existing sort syntax, so we would get something like "foo desc, bar asc 50" (bar would be a tiebreak in this case) So two resorts in a row could be "field1 asc 100; field2 desc 10" or a slightly more decomposed array ["field1 asc 100","field2 desc 10"] Or given that this is just an extension of the sort syntax, it could even just go in the "sort" param itself and not bother with "resort" sort:"count desc 5" could be a synonym for sort:"count desc",limit:5 It's late and my slides for Activate aren't done.... take it for what it's worth

Chris M. Hostetter added a comment - 14/Oct/18 05:20

We should perhaps think about how to extend to N sorts instead of 2.

I'm confused ... why/how/when would it make sense to resort multiple times? Do you have an example of what you have in mind?

My usecase/idea here is directly analogous to the "rerank" concept of queries and sorting (docs) by score: start with a "simple" query/sort when doing the first pass to find matches/buckets, then "rerank/resort only the topN on a more complex query/function that you don't want to compute against every doc/bucket.

I can't really think of a situation where it would would be useful to resort buckets more then once – or how it could work practically given the 2 phrase nature of distributed facet refinement. (as long as we only do 2 phrases, it seems like it would just be wasted cycles to do anything but the first and last sort?)

Maybe just append a number to our existing sort syntax, so we would get something like "foo desc, bar asc 50" (bar would be a tiebreak in this case)

I'm not really understanding what you're suggestion ... can you elaborate on what would happen in that example? if "foo desc" is the primary sort, and "bar asc" is the tiebreaker, then what is being resorted on? and where does the "50" in your example come into play,? (is that suppose to be how many of the top buckets we want to resort? the rerank parser has a reRankDocs local param but i didn't really see much point in having that concept for facet resorting since the shards will have already computed all the stat values for all the buckets once they've been refined, and so why bother artificially limiting the number of buckets being considered for resorting beyond that?)

Also keeping in mind that sort should be able to have tiebreaks someday.

...

Or given that this is just an extension of the sort syntax, it could even just go in the "sort" param itself and not bother with "resort"

I briefly considered adding resorting as a new option to the sort syntax, but quickly dismissed it for the precise reason that you mention: i expect the sort option to get more robust, and supporting secondary, tertiary, etc... tie breakers - so trying to shoe horn resorting into it as well seemed like a bad idea (plus i wanted to ensure that whatever features were added to the sort syntax down the road – tie breakers, etc... – would also be supported for resorting as well.)

Chris M. Hostetter added a comment - 14/Oct/18 05:20 We should perhaps think about how to extend to N sorts instead of 2. I'm confused ... why/how/when would it make sense to resort multiple times? Do you have an example of what you have in mind? My usecase/idea here is directly analogous to the "rerank" concept of queries and sorting (docs) by score: start with a "simple" query/sort when doing the first pass to find matches/buckets, then "rerank/resort only the topN on a more complex query/function that you don't want to compute against every doc/bucket. I can't really think of a situation where it would would be useful to resort buckets more then once – or how it could work practically given the 2 phrase nature of distributed facet refinement. (as long as we only do 2 phrases, it seems like it would just be wasted cycles to do anything but the first and last sort?) Maybe just append a number to our existing sort syntax, so we would get something like "foo desc, bar asc 50" (bar would be a tiebreak in this case) I'm not really understanding what you're suggestion ... can you elaborate on what would happen in that example? if "foo desc" is the primary sort, and "bar asc" is the tiebreaker, then what is being resorted on? and where does the "50" in your example come into play,? (is that suppose to be how many of the top buckets we want to resort? the rerank parser has a reRankDocs local param but i didn't really see much point in having that concept for facet resorting since the shards will have already computed all the stat values for all the buckets once they've been refined, and so why bother artificially limiting the number of buckets being considered for resorting beyond that?) Also keeping in mind that sort should be able to have tiebreaks someday. ... Or given that this is just an extension of the sort syntax, it could even just go in the "sort" param itself and not bother with "resort" I briefly considered adding resorting as a new option to the sort syntax, but quickly dismissed it for the precise reason that you mention: i expect the sort option to get more robust, and supporting secondary, tertiary, etc... tie breakers - so trying to shoe horn resorting into it as well seemed like a bad idea (plus i wanted to ensure that whatever features were added to the sort syntax down the road – tie breakers, etc... – would also be supported for resorting as well.)

Yonik Seeley added a comment - 14/Oct/18 13:44

if "foo desc" is the primary sort, and "bar asc" is the tiebreaker, then what is being resorted on?

"foo desc, bar asc 50" was an example of a single sort with tiebreak and a limit (no resort).
If one wanted a single string version ";" would be the divider. For example adding a resort with a tiebreak: "foo desc, bar asc 50; baz desc, qux asc 10"

why/how/when would it make sense to resort multiple times?

If there are use cases for starting with N sorted things and reducing that to K with a different sort, then it's just sort of recursive. Why would there be use cases for one resort and not two resorts?

One use case that comes to mind are stock screens I've seen that consist of multiple sorting and "take top N" steps.
Example: Sort by current dividend yield and take the top 100, then sort those by low PE and take the top 50, then sort those by total return 1 year and take the top 10.

or how it could work practically given the 2 phrase nature of distributed facet refinement.

Hmm, good point. Over the long term I'd always imagined the number of phases could be variable, so It's more of a current implementation detail (albeit a very major one). It would currently kill the usefulness in distributed though.
Anyway we don't have to worry about multiple resorts now as long as we can unambiguously upgrade if desired later (i.e. whatever the resort spec looks like, if we can unambiguously wrap an array around it later and specify multiple of them, then we're good)

Yonik Seeley added a comment - 14/Oct/18 13:44 if "foo desc" is the primary sort, and "bar asc" is the tiebreaker, then what is being resorted on? "foo desc, bar asc 50" was an example of a single sort with tiebreak and a limit (no resort). If one wanted a single string version ";" would be the divider. For example adding a resort with a tiebreak: "foo desc, bar asc 50; baz desc, qux asc 10" why/how/when would it make sense to resort multiple times? If there are use cases for starting with N sorted things and reducing that to K with a different sort, then it's just sort of recursive. Why would there be use cases for one resort and not two resorts? One use case that comes to mind are stock screens I've seen that consist of multiple sorting and "take top N" steps. Example: Sort by current dividend yield and take the top 100, then sort those by low PE and take the top 50, then sort those by total return 1 year and take the top 10. or how it could work practically given the 2 phrase nature of distributed facet refinement. Hmm, good point. Over the long term I'd always imagined the number of phases could be variable, so It's more of a current implementation detail (albeit a very major one). It would currently kill the usefulness in distributed though. Anyway we don't have to worry about multiple resorts now as long as we can unambiguously upgrade if desired later (i.e. whatever the resort spec looks like, if we can unambiguously wrap an array around it later and specify multiple of them, then we're good)

Chris M. Hostetter added a comment - 22/Oct/18 23:57

"foo desc, bar asc 50" was an example of a single sort with tiebreak and a limit (no resort). If one wanted a single string version ";" would be the divider. For example adding a resort with a tiebreak: "foo desc, bar asc 50; baz desc, qux asc 10"

Ok ... i realize now that you were discussing 2 diff ideas and giving 2 diff examples and i was conflating them – but i'm still not certain what you're saying the behavior of these examples would be, particulalry because (independent of the idea of resorting AND independent of the idea of supporting tiebreakers on sort/resort syntax) you ALSO seem to be suggesting a numeric "limit" that would be inlined as part of the sort/resort syntax – and this confuses me in 2 orthoginal ways:

are you suggesting this would be an alternative for the existing limit param on these facets?
- if so, what would be behavior if someone tired to do both? use the "inline limit" and use a "limit" param?
- if not, then what do you mean by "limit" in the above sentence?
assuming you did mean as a replacement/override of the existing limit param, i don't understand your example and what the value add of asking solr to resort the "top 10" by criteria "baz desc, qux asc" if we're already returning the "top 50"

If there are use cases for starting with N sorted things and reducing that to K with a different sort, then it's just sort of recursive. Why would there be use cases for one resort and not two resorts?
...
One use case that comes to mind are stock screens I've seen that consist of multiple sorting and "take top N" steps.

Example: Sort by current dividend yield and take the top 100, then sort those by low PE and take the top 50, then sort those by total return 1 year and take the top 10.

...again: if this is a situation where solr is returning the top 100 buckets, what's the value add in having solr resort the top 50 (and then the top 10 again) instead of just letting the client manipulate & re-order those same buckets?

I feel like maybe there is a disconnect in the principle of the ideas we are discussing?

As I mentioned when i created this issue, the overall goal i'm trying to address is to mirror the concept of the "reranking query" at a facet bucket level ... for addressing the performance cost of sorting by something complex/expensive.

Today you can ask solr:
- Compute expensive_function() for every bucket that exists, and sort all the buckets by that function – then return the top $limit buckets"
I want to be able to tell solr:
- "Compute cheaper_aproximation_of_expensive_function() for every bucket that exists, sort all the buckets by that function, and compute expensive_function() only for the top candidate buckets – then (once refinement/merging is complete) resort just the fully populated buckets by expensive_function()

...note in particular that I'm not even suggesting any sort of new resort_limit option or any hard and fast guarantees on the number of buckets that are "resorted" – just a way to tell solr "during the first pass, you can use this cheap function instead of the final expensive function i really care about" ... in essence just a "performance hint" or "save some CPU cycles" type feature

What you're describing on the other hand seems to be more akin to a "i want specific operations to be performed on my buckets" type feature ... the examples you're describing sound almost like a subset of a more robust scripting type functionality, or at the very least a multi stage "post processing" that might include filtering or collapsing of buckets?

...Lemme come back to this conceptual disconnect in a minute...

Anyway we don't have to worry about multiple resorts now as long as we can unambiguously upgrade if desired later (i.e. whatever the resort spec looks like, if we can unambiguously wrap an array around it later and specify multiple of them, then we're good)

right ... but if you're trying to future proof the API, there's also the question of "tiebreakers" look like when using the (existing) JSON object syntax for sorting instead of just the shorthand string syntax.

ie, if you completely ignore the concept of "resorting", today we support this...

json.facet={
  categories : {
    type : terms,
    field : cat,
    limit : 5,
    facet : { x : "sum(div(popularity,price))" },
    // can use short hand of "x desc"
    sort : { x : desc }, 
  }
}

...and if you assume you want to have multiple tiebreaker sorts then that would be something like...

json.facet={
  categories : {
    type : terms,
    field : cat,
    limit : 5,
    facet : { x : "sum(div(popularity,price))" },
    // can use short hand of "x desc, count desc"
    sort : [{ x : desc },
            { count : desc }], 
  }
}

...so in a hypothetical future world were you can have multiple "resort" options, each of which can be arrays of sort criteria JSON objects (which may or may not have their own "intermediate limits") then we have to imagine what we might want that to look like?

...revisiting the conceptual disconnect I mentioned above: Let's assume (since you've clearly already thought about it) that somewhere down the road we definitely do want more robust options for resorting/filtering/reducing buckets, then maybe the best way to more forward now with a short term improvement for the sorting/resorting on an expensive_function() vs cheaper_aproximation_of_expensive_function() performance hint/optimization – in a way that wouldn't hinder us down the road with more full featured bucket processing – would be to "invert" the API i proposed/implemented and renamed the option...

instead of adding a resort option, add an approximate_sort option and flip the meaning from what i was originally thinking...

json.facet={
  categories : {
    type : terms,
    field : cat,
    limit : 5,
    sort : "consumer_value desc",
    approximate_sort : "count desc",
    facet : { consumer_value : "sum(div(popularity,price))" }
  }
}

instead of using sort in phase#1, and resorting on the resort option after merging, we would use approximate_sort in phase#1, and sort in phase#2
from an end user perspective sort is the most important thing:
- the buckets to be returned will be in sort order
- approximate_sort is an expert level option only used as an inexpensive approximation when picking the buckets to be returned, where the exact number of buckets considered in this way is essentially an implementation detail
down the road, when we add "tiebreaker" support to "sort type options", it should be fairly trivial to support it for both options (with some generalization of the helper methods added in the patch)
farther down the road, if we want a more robust "sort, resort, process, filter the buckets returned" type option/syntax to replace the sort param – then we can add that independent of the approximate_sort option, and approximate_sort can still be used as a way to indicate an "inexpensive" way to initially rank the buckets to consider before any of that new functionality.

WDYT?

Chris M. Hostetter added a comment - 22/Oct/18 23:57 "foo desc, bar asc 50" was an example of a single sort with tiebreak and a limit (no resort). If one wanted a single string version ";" would be the divider. For example adding a resort with a tiebreak: "foo desc, bar asc 50; baz desc, qux asc 10" Ok ... i realize now that you were discussing 2 diff ideas and giving 2 diff examples and i was conflating them – but i'm still not certain what you're saying the behavior of these examples would be, particulalry because (independent of the idea of resorting AND independent of the idea of supporting tiebreakers on sort/resort syntax) you ALSO seem to be suggesting a numeric "limit" that would be inlined as part of the sort/resort syntax – and this confuses me in 2 orthoginal ways: are you suggesting this would be an alternative for the existing limit param on these facets? if so, what would be behavior if someone tired to do both? use the "inline limit" and use a "limit" param? if not, then what do you mean by "limit" in the above sentence? assuming you did mean as a replacement/override of the existing limit param, i don't understand your example and what the value add of asking solr to resort the "top 10" by criteria "baz desc, qux asc" if we're already returning the "top 50" If there are use cases for starting with N sorted things and reducing that to K with a different sort, then it's just sort of recursive. Why would there be use cases for one resort and not two resorts? ... One use case that comes to mind are stock screens I've seen that consist of multiple sorting and "take top N" steps. Example: Sort by current dividend yield and take the top 100, then sort those by low PE and take the top 50, then sort those by total return 1 year and take the top 10. ...again: if this is a situation where solr is returning the top 100 buckets, what's the value add in having solr resort the top 50 (and then the top 10 again) instead of just letting the client manipulate & re-order those same buckets? I feel like maybe there is a disconnect in the principle of the ideas we are discussing? As I mentioned when i created this issue, the overall goal i'm trying to address is to mirror the concept of the "reranking query" at a facet bucket level ... for addressing the performance cost of sorting by something complex/expensive. Today you can ask solr: Compute expensive_function() for every bucket that exists, and sort all the buckets by that function – then return the top $limit buckets" I want to be able to tell solr: "Compute cheaper_aproximation_of_expensive_function() for every bucket that exists, sort all the buckets by that function, and compute expensive_function() only for the top candidate buckets – then (once refinement/merging is complete) resort just the fully populated buckets by expensive_function() ...note in particular that I'm not even suggesting any sort of new resort_limit option or any hard and fast guarantees on the number of buckets that are "resorted" – just a way to tell solr "during the first pass, you can use this cheap function instead of the final expensive function i really care about" ... in essence just a "performance hint" or "save some CPU cycles" type feature What you're describing on the other hand seems to be more akin to a "i want specific operations to be performed on my buckets" type feature ... the examples you're describing sound almost like a subset of a more robust scripting type functionality, or at the very least a multi stage "post processing" that might include filtering or collapsing of buckets? ...Lemme come back to this conceptual disconnect in a minute... Anyway we don't have to worry about multiple resorts now as long as we can unambiguously upgrade if desired later (i.e. whatever the resort spec looks like, if we can unambiguously wrap an array around it later and specify multiple of them, then we're good) right ... but if you're trying to future proof the API, there's also the question of "tiebreakers" look like when using the (existing) JSON object syntax for sorting instead of just the shorthand string syntax. ie, if you completely ignore the concept of "resorting", today we support this... json.facet={ categories : { type : terms, field : cat, limit : 5, facet : { x : "sum(div(popularity,price))" }, // can use short hand of "x desc" sort : { x : desc }, } } ...and if you assume you want to have multiple tiebreaker sorts then that would be something like... json.facet={ categories : { type : terms, field : cat, limit : 5, facet : { x : "sum(div(popularity,price))" }, // can use short hand of "x desc, count desc" sort : [{ x : desc }, { count : desc }], } } ...so in a hypothetical future world were you can have multiple "resort" options, each of which can be arrays of sort criteria JSON objects (which may or may not have their own "intermediate limits") then we have to imagine what we might want that to look like? ...revisiting the conceptual disconnect I mentioned above: Let's assume (since you've clearly already thought about it) that somewhere down the road we definitely do want more robust options for resorting/filtering/reducing buckets, then maybe the best way to more forward now with a short term improvement for the sorting/resorting on an expensive_function() vs cheaper_aproximation_of_expensive_function() performance hint/optimization – in a way that wouldn't hinder us down the road with more full featured bucket processing – would be to "invert" the API i proposed/implemented and renamed the option... instead of adding a resort option, add an approximate_sort option and flip the meaning from what i was originally thinking... json.facet={ categories : { type : terms, field : cat, limit : 5, sort : "consumer_value desc", approximate_sort : "count desc", facet : { consumer_value : "sum(div(popularity,price))" } } } instead of using sort in phase#1, and resorting on the resort option after merging, we would use approximate_sort in phase#1, and sort in phase#2 from an end user perspective sort is the most important thing: the buckets to be returned will be in sort order approximate_sort is an expert level option only used as an inexpensive approximation when picking the buckets to be returned, where the exact number of buckets considered in this way is essentially an implementation detail down the road, when we add "tiebreaker" support to "sort type options", it should be fairly trivial to support it for both options (with some generalization of the helper methods added in the patch) farther down the road, if we want a more robust "sort, resort, process, filter the buckets returned" type option/syntax to replace the sort param – then we can add that independent of the approximate_sort option, and approximate_sort can still be used as a way to indicate an "inexpensive" way to initially rank the buckets to consider before any of that new functionality. WDYT?

Chris M. Hostetter added a comment - 24/Oct/18 01:49

patch updated to finish filling in the tests i wanted, and a fix forsome problems found with resorting when using 'allBuckets' that i found as a result ... this actaully simplifies the number of places in the code that need to care about resorting.

NOTE: this updated doesn't make any of the API/semantic changes i mentioned in my last comment (ie: still using "sort -> resort" not "approximate_sort -> sort") ... still waiting on feedback for that idea before diving in.

Chris M. Hostetter added a comment - 24/Oct/18 01:49 patch updated to finish filling in the tests i wanted, and a fix forsome problems found with resorting when using 'allBuckets' that i found as a result ... this actaully simplifies the number of places in the code that need to care about resorting. NOTE: this updated doesn't make any of the API/semantic changes i mentioned in my last comment (ie: still using "sort -> resort" not "approximate_sort -> sort") ... still waiting on feedback for that idea before diving in.

Lucene/Solr QA added a comment - 24/Oct/18 15:05

-1 overall

Vote	Subsystem	Runtime	Comment
			Prechecks
+1	test4tests	0m 0s	The patch appears to include 4 new or modified test files.
			master Compile Tests
+1	compile	3m 32s	master passed
			Patch Compile Tests
+1	compile	3m 46s	the patch passed
+1	javac	3m 46s	the patch passed
+1	Release audit (RAT)	3m 46s	the patch passed
+1	Check forbidden APIs	3m 46s	the patch passed
-1	Validate source patterns	3m 47s	Validate source patterns validate-source-patterns failed
			Other Tests
-1	unit	82m 56s	core in the patch failed.
		93m 44s

Reason	Tests
Failed junit tests	solr.search.QueryEqualityTest

Subsystem	Report/Notes
JIRA Issue	~~SOLR-12839~~
JIRA Patch URL	https://issues.apache.org/jira/secure/attachment/12945318/SOLR-12839.patch
Optional Tests	compile javac unit ratsources checkforbiddenapis validatesourcepatterns
uname	Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool	ant
Personality	/home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
git revision	master / e083b15
ant	version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018
Default Java	1.8.0_172
Validate source patterns	https://builds.apache.org/job/PreCommit-SOLR-Build/211/artifact/out/patch-validate-source-patterns-root.txt
unit	https://builds.apache.org/job/PreCommit-SOLR-Build/211/artifact/out/patch-unit-solr_core.txt
Test Results	https://builds.apache.org/job/PreCommit-SOLR-Build/211/testReport/
modules	C: solr/core U: solr/core
Console output	https://builds.apache.org/job/PreCommit-SOLR-Build/211/console
Powered by	Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

Lucene/Solr QA added a comment - 24/Oct/18 15:05 -1 overall Vote Subsystem Runtime Comment Prechecks +1 test4tests 0m 0s The patch appears to include 4 new or modified test files. master Compile Tests +1 compile 3m 32s master passed Patch Compile Tests +1 compile 3m 46s the patch passed +1 javac 3m 46s the patch passed +1 Release audit (RAT) 3m 46s the patch passed +1 Check forbidden APIs 3m 46s the patch passed -1 Validate source patterns 3m 47s Validate source patterns validate-source-patterns failed Other Tests -1 unit 82m 56s core in the patch failed. 93m 44s Reason Tests Failed junit tests solr.search.QueryEqualityTest Subsystem Report/Notes JIRA Issue SOLR-12839 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12945318/SOLR-12839.patch Optional Tests compile javac unit ratsources checkforbiddenapis validatesourcepatterns uname Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux Build tool ant Personality /home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh git revision master / e083b15 ant version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018 Default Java 1.8.0_172 Validate source patterns https://builds.apache.org/job/PreCommit-SOLR-Build/211/artifact/out/patch-validate-source-patterns-root.txt unit https://builds.apache.org/job/PreCommit-SOLR-Build/211/artifact/out/patch-unit-solr_core.txt Test Results https://builds.apache.org/job/PreCommit-SOLR-Build/211/testReport/ modules C: solr/core U: solr/core Console output https://builds.apache.org/job/PreCommit-SOLR-Build/211/console Powered by Apache Yetus 0.7.0 http://yetus.apache.org This message was automatically generated.

Chris M. Hostetter added a comment - 26/Nov/18 16:37

Pinging yseeley@gmail.com – any thoughts/objections to the idea i mentioned on Oct 22 ...

... maybe the best way to more forward now with a short term improvement for the sorting/resorting on an expensive_function() vs cheaper_aproximation_of_expensive_function() performance hint/optimization – in a way that wouldn't hinder us down the road with more full featured bucket processing – would be to "invert" the API i proposed/implemented and renamed the option...
instead of adding a resort option, add an approximate_sort option and flip the meaning from what i was originally thinking...
json.facet={
  categories : {
    type : terms,
    field : cat,
    limit : 5,
    sort : "consumer_value desc",
    approximate_sort : "count desc",
    facet : { consumer_value : "sum(div(popularity,price))" }
  }
}
instead of using sort in phase#1, and resorting on the resort option after merging, we would use approximate_sort in phase#1, and sort in phase#2

from an end user perspective sort is the most important thing:

the buckets to be returned will be in sort order

approximate_sort is an expert level option only used as an inexpensive approximation when picking the buckets to be returned, where the exact number of buckets considered in this way is essentially an implementation detail

down the road, when we add "tiebreaker" support to "sort type options", it should be fairly trivial to support it for both options (with some generalization of the helper methods added in the patch)

farther down the road, if we want a more robust "sort, resort, process, filter the buckets returned" type option/syntax to replace the sort param – then we can add that independent of the approximate_sort option, and approximate_sort can still be used as a way to indicate an "inexpensive" way to initially rank the buckets to consider before any of that new functionality.

...this would be a fairly meaty re-working of the existing patch i've got in progress (sort->approx_sort, resort->sort, & flipping the syntax in all the test queries) so i'd rather not bite it off if it's something you have objections to in principle.

Chris M. Hostetter added a comment - 26/Nov/18 16:37 Pinging yseeley@gmail.com – any thoughts/objections to the idea i mentioned on Oct 22 ... ... maybe the best way to more forward now with a short term improvement for the sorting/resorting on an expensive_function() vs cheaper_aproximation_of_expensive_function() performance hint/optimization – in a way that wouldn't hinder us down the road with more full featured bucket processing – would be to "invert" the API i proposed/implemented and renamed the option... instead of adding a resort option, add an approximate_sort option and flip the meaning from what i was originally thinking... json.facet={ categories : { type : terms, field : cat, limit : 5, sort : "consumer_value desc", approximate_sort : "count desc", facet : { consumer_value : "sum(div(popularity,price))" } } } instead of using sort in phase#1, and resorting on the resort option after merging, we would use approximate_sort in phase#1, and sort in phase#2 from an end user perspective sort is the most important thing: the buckets to be returned will be in sort order approximate_sort is an expert level option only used as an inexpensive approximation when picking the buckets to be returned, where the exact number of buckets considered in this way is essentially an implementation detail down the road, when we add "tiebreaker" support to "sort type options", it should be fairly trivial to support it for both options (with some generalization of the helper methods added in the patch) farther down the road, if we want a more robust "sort, resort, process, filter the buckets returned" type option/syntax to replace the sort param – then we can add that independent of the approximate_sort option, and approximate_sort can still be used as a way to indicate an "inexpensive" way to initially rank the buckets to consider before any of that new functionality. ...this would be a fairly meaty re-working of the existing patch i've got in progress (sort->approx_sort, resort->sort, & flipping the syntax in all the test queries) so i'd rather not bite it off if it's something you have objections to in principle.

Chris M. Hostetter added a comment - 29/Nov/18 22:21

I've updated the patch to use prelim_sort as the new param name – not quite as unweildy as approximate_sort but still long enough to not be "common" and indicitive of something advanced/expert. (While also leaving the door open for using the concept of "resorting" and/or beefing up the sort syntax) for the future

Here's the syntax/example I used in the ref guide updates...

{
  categories:{
    type : terms,
    field : cat,
    refine: true,
    limit: 10,
    overrequest: 100,
    prelim_sort: "sales_rank desc",
    sort : "prod_quality desc",
    facet : {
      prod_quality : "avg(div(prod(rating,sales_rank),prod(num_returns,price)))"
      sales_rank : "sum(sales_rank)"
    }
  }
}

I think this is ready to commit unless anyone has any objections.

Chris M. Hostetter added a comment - 29/Nov/18 22:21 I've updated the patch to use prelim_sort as the new param name – not quite as unweildy as approximate_sort but still long enough to not be "common" and indicitive of something advanced/expert. (While also leaving the door open for using the concept of "resorting" and/or beefing up the sort syntax) for the future Here's the syntax/example I used in the ref guide updates... { categories:{ type : terms, field : cat, refine: true, limit: 10, overrequest: 100, prelim_sort: "sales_rank desc", sort : "prod_quality desc", facet : { prod_quality : "avg(div(prod(rating,sales_rank),prod(num_returns,price)))" sales_rank : "sum(sales_rank)" } } } I think this is ready to commit unless anyone has any objections.

Lucene/Solr QA added a comment - 30/Nov/18 16:53

+1 overall

Vote	Subsystem	Runtime	Comment
			Prechecks
+1	test4tests	0m 0s	The patch appears to include 4 new or modified test files.
			master Compile Tests
+1	compile	1m 40s	master passed
			Patch Compile Tests
+1	compile	1m 37s	the patch passed
+1	javac	1m 37s	the patch passed
+1	Release audit (RAT)	1m 43s	the patch passed
+1	Check forbidden APIs	1m 37s	the patch passed
+1	Validate source patterns	1m 37s	the patch passed
+1	Validate ref guide	1m 37s	the patch passed
			Other Tests
+1	unit	35m 57s	core in the patch passed.
		42m 56s

Subsystem	Report/Notes
JIRA Issue	~~SOLR-12839~~
JIRA Patch URL	https://issues.apache.org/jira/secure/attachment/12950083/SOLR-12839.patch
Optional Tests	compile javac unit ratsources checkforbiddenapis validatesourcepatterns validaterefguide
uname	Linux lucene1-us-west 4.4.0-137-generic #163~14.04.1-Ubuntu SMP Mon Sep 24 17:14:57 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool	ant
Personality	/home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
git revision	master / 0491623
ant	version: Apache Ant(TM) version 1.9.3 compiled on July 24 2018
Default Java	1.8.0_191
Test Results	https://builds.apache.org/job/PreCommit-SOLR-Build/237/testReport/
modules	C: solr/core solr/solr-ref-guide U: solr
Console output	https://builds.apache.org/job/PreCommit-SOLR-Build/237/console
Powered by	Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

Lucene/Solr QA added a comment - 30/Nov/18 16:53 +1 overall Vote Subsystem Runtime Comment Prechecks +1 test4tests 0m 0s The patch appears to include 4 new or modified test files. master Compile Tests +1 compile 1m 40s master passed Patch Compile Tests +1 compile 1m 37s the patch passed +1 javac 1m 37s the patch passed +1 Release audit (RAT) 1m 43s the patch passed +1 Check forbidden APIs 1m 37s the patch passed +1 Validate source patterns 1m 37s the patch passed +1 Validate ref guide 1m 37s the patch passed Other Tests +1 unit 35m 57s core in the patch passed. 42m 56s Subsystem Report/Notes JIRA Issue SOLR-12839 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12950083/SOLR-12839.patch Optional Tests compile javac unit ratsources checkforbiddenapis validatesourcepatterns validaterefguide uname Linux lucene1-us-west 4.4.0-137-generic #163~14.04.1-Ubuntu SMP Mon Sep 24 17:14:57 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux Build tool ant Personality /home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh git revision master / 0491623 ant version: Apache Ant(TM) version 1.9.3 compiled on July 24 2018 Default Java 1.8.0_191 Test Results https://builds.apache.org/job/PreCommit-SOLR-Build/237/testReport/ modules C: solr/core solr/solr-ref-guide U: solr Console output https://builds.apache.org/job/PreCommit-SOLR-Build/237/console Powered by Apache Yetus 0.7.0 http://yetus.apache.org This message was automatically generated.

Yonik Seeley added a comment - 30/Nov/18 19:07

Yeah, I think this is OK - my main objection was going to be the name "approximate" which highly suggests that an estimate is fine. "prelim_sort" seems fine.

Yonik Seeley added a comment - 30/Nov/18 19:07 Yeah, I think this is OK - my main objection was going to be the name "approximate" which highly suggests that an estimate is fine. "prelim_sort" seems fine.

ASF subversion and git services added a comment - 30/Nov/18 22:58

Commit 5763d8b194009cdf602e983d56115be9480f69b8 in lucene-solr's branch refs/heads/branch_7x from Chris Hostetter
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5763d8b ]

~~SOLR-12839~~: JSON 'terms' Faceting now supports a 'prelim_sort' option to use when initially selecting the top ranking buckets, prior to the final 'sort' option used after refinement.

(cherry picked from commit 5dc988f5eeff78464d852f54ce7f06a801dcbfee)

Conflicts:
solr/CHANGES.txt

ASF subversion and git services added a comment - 30/Nov/18 22:58 Commit 5763d8b194009cdf602e983d56115be9480f69b8 in lucene-solr's branch refs/heads/branch_7x from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5763d8b ] SOLR-12839 : JSON 'terms' Faceting now supports a 'prelim_sort' option to use when initially selecting the top ranking buckets, prior to the final 'sort' option used after refinement. (cherry picked from commit 5dc988f5eeff78464d852f54ce7f06a801dcbfee) Conflicts: solr/CHANGES.txt

ASF subversion and git services added a comment - 30/Nov/18 22:58

Commit 5dc988f5eeff78464d852f54ce7f06a801dcbfee in lucene-solr's branch refs/heads/master from Chris Hostetter
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5dc988f ]

~~SOLR-12839~~: JSON 'terms' Faceting now supports a 'prelim_sort' option to use when initially selecting the top ranking buckets, prior to the final 'sort' option used after refinement.

ASF subversion and git services added a comment - 30/Nov/18 22:58 Commit 5dc988f5eeff78464d852f54ce7f06a801dcbfee in lucene-solr's branch refs/heads/master from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5dc988f ] SOLR-12839 : JSON 'terms' Faceting now supports a 'prelim_sort' option to use when initially selecting the top ranking buckets, prior to the final 'sort' option used after refinement.

Chris M. Hostetter added a comment - 30/Nov/18 23:01

my main objection was going to be the name "approximate" which highly suggests that an estimate is fine. "prelim_sort" seems fine.

+1 ... i hadn't considered that. I'm glad i disliked it for other reasons

Chris M. Hostetter added a comment - 30/Nov/18 23:01 my main objection was going to be the name "approximate" which highly suggests that an estimate is fine. "prelim_sort" seems fine. +1 ... i hadn't considered that. I'm glad i disliked it for other reasons

ASF subversion and git services added a comment - 01/Dec/18 17:09

Commit 5dc988f5eeff78464d852f54ce7f06a801dcbfee in lucene-solr's branch refs/heads/jira/http2 from Chris Hostetter
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5dc988f ]

~~SOLR-12839~~: JSON 'terms' Faceting now supports a 'prelim_sort' option to use when initially selecting the top ranking buckets, prior to the final 'sort' option used after refinement.

ASF subversion and git services added a comment - 01/Dec/18 17:09 Commit 5dc988f5eeff78464d852f54ce7f06a801dcbfee in lucene-solr's branch refs/heads/jira/http2 from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5dc988f ] SOLR-12839 : JSON 'terms' Faceting now supports a 'prelim_sort' option to use when initially selecting the top ranking buckets, prior to the final 'sort' option used after refinement.

Solr

add a 'prelim_sort' option to JSON faceting

Details

Description

Attachments

Attachments

Activity

People

Dates