Solr
  1. Solr
  2. SOLR-2522

add syntax for selecting the min or max of a multivalued field in value source functions

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.3, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      Initial request...

      Switch max() and min() functions to work on multiValued fields so we can

      do sort=min(fieldname) asc and the sort would work on multiValued fields...

      ...this specific syntax has been spun off into SOLR-7853, but the underlying functionality s being implemented here using a new optional second argument to the field() function: field(multivalued_field_name,min) and field(multivalued_field_name,max).

      1. SOLR-2522.patch
        37 kB
        Hoss Man
      2. SOLR-2522.patch
        34 kB
        Hoss Man
      3. SOLR-2522.patch
        30 kB
        Hoss Man

        Issue Links

          Activity

          Hide
          Chris Emerson added a comment -

          This would be incredibly helpful! Voting up.

          As a side note, I actually tried for a while to do this as it wasn't clear to me from the documentation on min() that it couldn't be done. As it stands I'm having to do sorting, and therefore pagination, on the client side which is a real drag, and necessitates bringing back the entire result set every time a new page is required.

          Show
          Chris Emerson added a comment - This would be incredibly helpful! Voting up. As a side note, I actually tried for a while to do this as it wasn't clear to me from the documentation on min() that it couldn't be done. As it stands I'm having to do sorting, and therefore pagination, on the client side which is a real drag, and necessitates bringing back the entire result set every time a new page is required.
          Hide
          Jan Høydahl added a comment -

          Ping() - anyone who have thought more on this?

          Show
          Jan Høydahl added a comment - Ping() - anyone who have thought more on this?
          Hide
          Flavio Pompermaier added a comment -

          Lucene has it (https://issues.apache.org/jira/browse/LUCENE-5454), elasticsearch too..why Solr cannot?

          Show
          Flavio Pompermaier added a comment - Lucene has it ( https://issues.apache.org/jira/browse/LUCENE-5454 ), elasticsearch too..why Solr cannot?
          Hide
          Ere Maijala added a comment -

          Voted too.

          Show
          Ere Maijala added a comment - Voted too.
          Hide
          Erick Erickson added a comment -

          bq: why Solr cannot?

          Patches welcome of course.....

          Show
          Erick Erickson added a comment - bq: why Solr cannot? Patches welcome of course.....
          Hide
          Hoss Man added a comment -

          I don't think this was possible back when the issue was first filed because of how the ValueSource API was tied so closely with the FieldCache API.

          But with DocValues API, this is in fact now possible – but only for fields that actually use DocValues. The UninvertedReader's implementation of SortedSetDocValues doesn't support RandomAccessOrds so we can't support those (yet - i'll file a related issue to see if it's possible to improve that later)

          The attached patch has a starting point for an implementation of this for all of the TrieFields and some basic tests (but i want to add some more randomized tests as well, in addition to tests of some of the error cases.)

          The main thing that's missing from this patch is the syntax – getting "min(multivalued_field_name)" to work is non trivial, so for now i just hacked in a "minf(fieldname)" and "maxf(fieldname)" function to get something i could write tests against – working on the ValueSourceParser changes to get the syntax we all want to work is the next big thing on the agenda.

          The other thing i want to look into is LUCENE-6609 – the same comments there about directly using the SortField for the underlying field should also correlate here: changing SortedSetFieldSource.getSortField() to return an instance of SortedSetSortField.

          Show
          Hoss Man added a comment - I don't think this was possible back when the issue was first filed because of how the ValueSource API was tied so closely with the FieldCache API. But with DocValues API, this is in fact now possible – but only for fields that actually use DocValues. The UninvertedReader's implementation of SortedSetDocValues doesn't support RandomAccessOrds so we can't support those (yet - i'll file a related issue to see if it's possible to improve that later) The attached patch has a starting point for an implementation of this for all of the TrieFields and some basic tests (but i want to add some more randomized tests as well, in addition to tests of some of the error cases.) The main thing that's missing from this patch is the syntax – getting "min(multivalued_field_name)" to work is non trivial, so for now i just hacked in a "minf(fieldname)" and "maxf(fieldname)" function to get something i could write tests against – working on the ValueSourceParser changes to get the syntax we all want to work is the next big thing on the agenda. The other thing i want to look into is LUCENE-6609 – the same comments there about directly using the SortField for the underlying field should also correlate here: changing SortedSetFieldSource.getSortField() to return an instance of SortedSetSortField.
          Hide
          Hoss Man added a comment -

          The main thing thats missing from this patch is the syntax – getting "min(multivalued_field_name)" to work is non trivial, so for now i just hacked in a "minf(fieldname)" and "maxf(fieldname)" function to get something i could write tests against – working on the ValueSourceParser changes to get the syntax we all want to work is the next big thing on the agenda.

          The more i look into this, the harder i realize it is.

          The root complexity is that the ValueSourceParsers all delegate to FunctionQParser.parseValueSource() (or parseValueSourceList()) when expecting an argument that can be an arbitrary (nested) ValueSource – this is how min & max work now. FQP.parseValueSource() handles the logic of figuring out what hte argument is (literal, nested function, $variable, field name, etc...) and if it's a field name, calls f.getType().getValueSource(f, this) on the associated SchemaField – which for multivalued fields throws an exception that gets propogated up.

          My initial thinking was that i could refactor parseValueSource() to support another flag for either specifying the MultiValueSelector, or indicating that we want a "defered evaluation" of the underlying SchemaFields (ie: return some mock FieldBasedValueSource that doesn't call FieldType.getValueSource until used, so the min/max functions can explicitly call getSingleValueSource() instead if they encounter a single argument) but then i realized that because of how FunctionQParser deals with $variable derefrencing – and the QParser abstraction (variables might refer to other queries, which get automatically unwrapped if they are FunctionQueries) then even that type of refacotring solution wouldn't work in simple use cases like this...

          ...
          & fl=id,min($my_f)
          & my_f=some_multi_valued_field_name
          

          So I'm going to punt on getting the min(some_multi_valued_field_name) (and max(some_multi_valued_field_name)) syntax working, and leave that as a usibillitiy improvement for the future.

          What I think is nice, clean, very feasible (so feasible i already implemented it), solution for now (that can still serve as a useful "advanced" syntax moving forward even if we add the simple syntax later) is to support field(some_multi_valued_field_name,'min') – ie: add a new optional 2nd arg to the field() function.


          This new patch has that new syntax supported, along with cleanup of most of the nocommits.

          I think it's pretty much good to go as is, but i want to still add some more tests (see remaining nocommits)

          Show
          Hoss Man added a comment - The main thing thats missing from this patch is the syntax – getting "min(multivalued_field_name)" to work is non trivial, so for now i just hacked in a "minf(fieldname)" and "maxf(fieldname)" function to get something i could write tests against – working on the ValueSourceParser changes to get the syntax we all want to work is the next big thing on the agenda. The more i look into this, the harder i realize it is. The root complexity is that the ValueSourceParsers all delegate to FunctionQParser.parseValueSource() (or parseValueSourceList()) when expecting an argument that can be an arbitrary (nested) ValueSource – this is how min & max work now. FQP.parseValueSource() handles the logic of figuring out what hte argument is (literal, nested function, $variable, field name, etc...) and if it's a field name, calls f.getType().getValueSource(f, this) on the associated SchemaField – which for multivalued fields throws an exception that gets propogated up. My initial thinking was that i could refactor parseValueSource() to support another flag for either specifying the MultiValueSelector, or indicating that we want a "defered evaluation" of the underlying SchemaFields (ie: return some mock FieldBasedValueSource that doesn't call FieldType.getValueSource until used, so the min/max functions can explicitly call getSingleValueSource() instead if they encounter a single argument) but then i realized that because of how FunctionQParser deals with $variable derefrencing – and the QParser abstraction (variables might refer to other queries, which get automatically unwrapped if they are FunctionQueries) then even that type of refacotring solution wouldn't work in simple use cases like this... ... & fl=id,min($my_f) & my_f=some_multi_valued_field_name So I'm going to punt on getting the min(some_multi_valued_field_name) (and max(some_multi_valued_field_name) ) syntax working, and leave that as a usibillitiy improvement for the future. What I think is nice, clean, very feasible (so feasible i already implemented it), solution for now (that can still serve as a useful "advanced" syntax moving forward even if we add the simple syntax later) is to support field(some_multi_valued_field_name,'min') – ie: add a new optional 2nd arg to the field() function. This new patch has that new syntax supported, along with cleanup of most of the nocommits. I think it's pretty much good to go as is, but i want to still add some more tests (see remaining nocommits)
          Hide
          Hoss Man added a comment -

          This patch cleans up the remaining nocommits in the test (ie: adding randomized tests and tests for good errors in unsupported situations).

          I plan on committing & backporting tommorow unless anyone has any concerns

          Show
          Hoss Man added a comment - This patch cleans up the remaining nocommits in the test (ie: adding randomized tests and tests for good errors in unsupported situations). I plan on committing & backporting tommorow unless anyone has any concerns
          Hide
          ASF subversion and git services added a comment -

          Commit 1693625 from hossman@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1693625 ]

          SOLR-2522: new two argument option for the existing field() function; picks the min/max value of a docValues field to use as a ValueSource: "field(field_name,min)" and "field(field_name,max)"

          Show
          ASF subversion and git services added a comment - Commit 1693625 from hossman@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1693625 ] SOLR-2522 : new two argument option for the existing field() function; picks the min/max value of a docValues field to use as a ValueSource: "field(field_name,min)" and "field(field_name,max)"
          Hide
          ASF subversion and git services added a comment -

          Commit 1693628 from hossman@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1693628 ]

          SOLR-2522: new two argument option for the existing field() function; picks the min/max value of a docValues field to use as a ValueSource: "field(field_name,min)" and "field(field_name,max)" (merge r1693625)

          Show
          ASF subversion and git services added a comment - Commit 1693628 from hossman@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1693628 ] SOLR-2522 : new two argument option for the existing field() function; picks the min/max value of a docValues field to use as a ValueSource: "field(field_name,min)" and "field(field_name,max)" (merge r1693625)
          Hide
          Shalin Shekhar Mangar added a comment -

          Bulk close for 5.3.0 release

          Show
          Shalin Shekhar Mangar added a comment - Bulk close for 5.3.0 release
          Hide
          David Smiley added a comment -

          BTW I noticed a performance improvement possibility in the implementations of the ValueFiller returned from code in TrieLongField and probably the other field types that were introduced by this issue. Look at the fillValue method. The code there is populating "exists" and is then populating the "value". According to a recent interaction I had with Adrien on some issue or another on fillValue, looking up "exists" is potentially one disk seek and looking up the "value" is another. By relying on the knowledge that the default value is '0' when exists is false, you can only bother checking for "exists" when the value is 0 to differentiate between the default 0 versus an explicit 0. For Example LongFieldSource's fillValue() looks like this:

          mval.value = arr.get(doc);
          mval.exists = mval.value != 0 || valid.get(doc);
          

          In TrieLongField's longVal() it should check if the bytes is 0 length and if so return 0 instead of attempting to decode, which will fail (I tried).

          Show
          David Smiley added a comment - BTW I noticed a performance improvement possibility in the implementations of the ValueFiller returned from code in TrieLongField and probably the other field types that were introduced by this issue. Look at the fillValue method. The code there is populating "exists" and is then populating the "value". According to a recent interaction I had with Adrien on some issue or another on fillValue, looking up "exists" is potentially one disk seek and looking up the "value" is another. By relying on the knowledge that the default value is '0' when exists is false, you can only bother checking for "exists" when the value is 0 to differentiate between the default 0 versus an explicit 0. For Example LongFieldSource's fillValue() looks like this: mval.value = arr.get(doc); mval.exists = mval.value != 0 || valid.get(doc); In TrieLongField's longVal() it should check if the bytes is 0 length and if so return 0 instead of attempting to decode, which will fail (I tried).
          Hide
          Hoss Man added a comment -

          The code there is populating "exists" and is then populating the "value". According to a recent interaction I had with Adrien on some issue or another on fillValue, looking up "exists" is potentially one disk seek and looking up the "value" is another. ...

          Maybe i'm missunderstanding what you're saying...

          I'm pretty sure what you describe is only true in case where NumericDocValues are used under the covers (such as in your example: LongFieldSource) – because the only way to know in that case if a value exists is to independently call DocValues.getDocsWithField to get a Bits instance.

          But that's not applicable here, where the underlying "on disk" representation is SortedSetDocValues. In this case the "exists" information comes directly from the ord value – values that don't exist have an ord of -1 (note the implementation of the exists() methods in the new code added by this issue. The new ValueFillers all follow the same pattern as DocTermsIndexDocValues, first using the ordinal value to determine if a value exists before trying to assign it. (the only diff here is delegating to the exists() method which already encapsulates the ordVal check.)

          In TrieLongField's longVal() it should check if the bytes is 0 length and if so return 0 instead of attempting to decode, which will fail (I tried).

          Hmmm... the start of your comment mentioned that you thought these changes would be a perf improvement, but the wording at the end your comment ("will fail (i tried)") sounds like you're saying you found/demonstrated a bug ... but it's not exactly clear to me what exactly the bug is or how to reproduce it?

          If there is a bug, can you please open a new jira (since this feature was already released in 5.3) with either a test case or an example of how to reproduce?

          (FWIW: i'm working on a blog post about this new feature with some benchmarks comparing it to sorting on single valued field. Even if i missunderstood about whether you found a bug, if you can whip up a patch demonstrating your perf improvement idea – even if it's just a single field type – i'm happy to test it as well, and flesh it out to all field types).

          Show
          Hoss Man added a comment - The code there is populating "exists" and is then populating the "value". According to a recent interaction I had with Adrien on some issue or another on fillValue, looking up "exists" is potentially one disk seek and looking up the "value" is another. ... Maybe i'm missunderstanding what you're saying... I'm pretty sure what you describe is only true in case where NumericDocValues are used under the covers (such as in your example: LongFieldSource) – because the only way to know in that case if a value exists is to independently call DocValues.getDocsWithField to get a Bits instance. But that's not applicable here, where the underlying "on disk" representation is SortedSetDocValues. In this case the "exists" information comes directly from the ord value – values that don't exist have an ord of -1 (note the implementation of the exists() methods in the new code added by this issue. The new ValueFillers all follow the same pattern as DocTermsIndexDocValues, first using the ordinal value to determine if a value exists before trying to assign it. (the only diff here is delegating to the exists() method which already encapsulates the ordVal check.) In TrieLongField's longVal() it should check if the bytes is 0 length and if so return 0 instead of attempting to decode, which will fail (I tried). Hmmm... the start of your comment mentioned that you thought these changes would be a perf improvement, but the wording at the end your comment ("will fail (i tried)") sounds like you're saying you found/demonstrated a bug ... but it's not exactly clear to me what exactly the bug is or how to reproduce it? If there is a bug, can you please open a new jira (since this feature was already released in 5.3) with either a test case or an example of how to reproduce? (FWIW: i'm working on a blog post about this new feature with some benchmarks comparing it to sorting on single valued field. Even if i missunderstood about whether you found a bug, if you can whip up a patch demonstrating your perf improvement idea – even if it's just a single field type – i'm happy to test it as well, and flesh it out to all field types).
          Hide
          David Smiley added a comment -

          You make good points; I'm glad we're having this interaction. I can see there shouldn't be an extra seek() now.

          One possible improvement is to avoid a double-lookup of the ordinal – first with exists(doc) and then by doc(0).

          Hmmm... the start of your comment mentioned that you thought these changes would be a perf improvement, but the wording at the end your comment ("will fail (i tried)") sounds like you're saying you found/demonstrated a bug ... but it's not exactly clear to me what exactly the bug is or how to reproduce it?

          Sorry, I could have been more clear. I mean that if (somehow) longVal(doc) gets called when the document in fact has no value then BytesRef will have 0 length and NumericUtils.prefixCodedToLong will throw an exception. I did test that NumericUtils.prefixCodedToLong throws an exception when given a 0 length BytesRef (executed via a Groovy shell). I will file a bug report and link to this issue to show how to trigger this error.

          Show
          David Smiley added a comment - You make good points; I'm glad we're having this interaction. I can see there shouldn't be an extra seek() now. One possible improvement is to avoid a double-lookup of the ordinal – first with exists(doc) and then by doc(0). Hmmm... the start of your comment mentioned that you thought these changes would be a perf improvement, but the wording at the end your comment ("will fail (i tried)") sounds like you're saying you found/demonstrated a bug ... but it's not exactly clear to me what exactly the bug is or how to reproduce it? Sorry, I could have been more clear. I mean that if (somehow) longVal(doc) gets called when the document in fact has no value then BytesRef will have 0 length and NumericUtils.prefixCodedToLong will throw an exception. I did test that NumericUtils.prefixCodedToLong throws an exception when given a 0 length BytesRef (executed via a Groovy shell). I will file a bug report and link to this issue to show how to trigger this error.

            People

            • Assignee:
              Hoss Man
              Reporter:
              Bill Bell
            • Votes:
              9 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development