Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      The majority of the logic in StatsValuesFactory for dealing with stats over fields just uses the ValueSource API. There's very little reason we can't generalize this to support computing aggregate stats over any arbitrary function (or the scores from an arbitrary query).

      Example...

      stats.field={!func key=mean_rating mean=true}prod(user_rating,pow(editor_rating,2))
      

      ...would mean that we can compute a conceptual "rating" for each doc by multiplying the user_rating field by the square of the editor_rating field, and then we'd compute the mean of that "rating" across all docs in the set and return it as "mean_rating"

      1. SOLR-6354.patch
        55 kB
        Hoss Man
      2. SOLR-6354.patch
        49 kB
        Hoss Man
      3. SOLR-6354.patch
        15 kB
        Hoss Man
      4. TstStatsComponent.java
        19 kB
        Crawdaddy

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          Proposed implementation...

          • SimpleStats should check the local params of each stats.field for a "type" param
            • type is already treated special in local param parsing as as way to specify the parser type for the body of the string, ie: {!foo}... is just an alias for {!type=foo}...
          • if "type" param doesn't exist, or is the empty string, treat the param value as a regular field name and get it's value source (just like is done today)
          • if "type" does exist, do normal QParserPlugin lookup to parse the param value
            • if the resulting Query is instanceof FunctionQuery, cast it and pull out it's ValueSource
            • else: wrap the Query in a QueryValueSource
          • add a new subclass of NumericStatsValues that can be constructed directly with a ValueSource
          Show
          Hoss Man added a comment - Proposed implementation... SimpleStats should check the local params of each stats.field for a "type" param type is already treated special in local param parsing as as way to specify the parser type for the body of the string, ie: { !foo}... is just an alias for { !type=foo}... if "type" param doesn't exist, or is the empty string, treat the param value as a regular field name and get it's value source (just like is done today) if "type" does exist, do normal QParserPlugin lookup to parse the param value if the resulting Query is instanceof FunctionQuery , cast it and pull out it's ValueSource else: wrap the Query in a QueryValueSource add a new subclass of NumericStatsValues that can be constructed directly with a ValueSource
          Hide
          Crawdaddy added a comment -

          Hey Hoss, I think you mean StatsInfo should do the check you propose? At least, that's where I found I needed to start intercepting this.

          I have all but the last line in your proposal implemented in StatsInfo.parse in a custom copy of StatsComponent, but having a little trouble seeing how to go from ValueSource -> StatsValues. Can you provide a couple more pointers here?

          Show
          Crawdaddy added a comment - Hey Hoss, I think you mean StatsInfo should do the check you propose? At least, that's where I found I needed to start intercepting this. I have all but the last line in your proposal implemented in StatsInfo.parse in a custom copy of StatsComponent, but having a little trouble seeing how to go from ValueSource -> StatsValues. Can you provide a couple more pointers here?
          Hide
          Hoss Man added a comment -

          Hey Hoss, I think you mean StatsInfo should do the check you propose? At least, that's where I found I needed to start intercepting this.

          Hmmm... i guess there's two code paths here that have to be considered? I think i was looking at SimpleStats because that's where the individual stats.field values are parsed into the "localParams" variable – so we definitely need to check for a type there, and then do the right thing as far as dealing with the NumericStatsValues in terms of methods like SimpleStats.getStatsFields() and/or SimpleStats.getFieldCacheStats()

          But i think you're right about StatsInfo ... looks like we need to account for it there as well ... i'd need to look over this more closely to understand what's going on there and why....

          ...having a little trouble seeing how to go from ValueSource -> StatsValues. Can you provide a couple more pointers here?

          If you look at StatsValuesFactory.createStatsValues and the existing AbstractStatsValues you'll see that it maintains references to the SchemaField/FieldType of the associated field – but the meat of the logic is in asking the FieldType for it's ValueSource to then accumulate values from. So what i had in mind was refactoring "field" specific bits of AbstractStatsValues as needed so that a (new) subclass could be completely field agnostic, and just do the accumulation directly from a VlaueSource passed in (based on the FunctionQParser in most cases)

          I have all but the last line in your proposal implemented...

          FYI: no need to "hold back" changes until they are "done" ... yonik's law of patches...

          A half-baked patch in Jira, with no documentation, no tests and no backwards compatibility is better than no patch at all.

          Show
          Hoss Man added a comment - Hey Hoss, I think you mean StatsInfo should do the check you propose? At least, that's where I found I needed to start intercepting this. Hmmm... i guess there's two code paths here that have to be considered? I think i was looking at SimpleStats because that's where the individual stats.field values are parsed into the "localParams" variable – so we definitely need to check for a type there, and then do the right thing as far as dealing with the NumericStatsValues in terms of methods like SimpleStats.getStatsFields() and/or SimpleStats.getFieldCacheStats() But i think you're right about StatsInfo ... looks like we need to account for it there as well ... i'd need to look over this more closely to understand what's going on there and why.... ...having a little trouble seeing how to go from ValueSource -> StatsValues. Can you provide a couple more pointers here? If you look at StatsValuesFactory.createStatsValues and the existing AbstractStatsValues you'll see that it maintains references to the SchemaField/FieldType of the associated field – but the meat of the logic is in asking the FieldType for it's ValueSource to then accumulate values from. So what i had in mind was refactoring "field" specific bits of AbstractStatsValues as needed so that a (new) subclass could be completely field agnostic, and just do the accumulation directly from a VlaueSource passed in (based on the FunctionQParser in most cases) I have all but the last line in your proposal implemented... FYI: no need to "hold back" changes until they are "done" ... yonik's law of patches... A half-baked patch in Jira, with no documentation, no tests and no backwards compatibility is better than no patch at all.
          Hide
          Crawdaddy added a comment -

          I did manage to get something working, but it has some problems. I see what you mean on the StatsValuesFactory refactor. Because I was experimenting with this in a copy of StatsComponent (which by the way, is not easy to do!), I ended up not modifying StatsValuesFactory at all. Instead, I wrote a couple inner classes extending NumericStatsValues and FieldType that take a ValueSource as input. In SimpleStats.getStatsFields() and SimpleStats.getFieldCacheStats(), I catch the exception to schema.getField() that is thrown when trying to look up non-existent function fields, and return my custom NSV/FT-based classes - stored in rb._statsInfo - instead. This seems to have broken stat faceting, however, I think since other calls to StatsValuesFactory.createStatsValues outside StatsComponent don't use the same logic. No doubt yours is the better road to travel - I was shooting for quick-n-dirty to see if this was a useful approach to a stats problem.

          Also, regarding changing the output key: either this was either broken already, or I broke it somehow.

          Would it be useful for me to upload what I have as a reference point for you or someone else to implement more coherently? I'm not sure I have the bandwidth to pull down a virgin Solr and migrate the changes at this time.

          Show
          Crawdaddy added a comment - I did manage to get something working, but it has some problems. I see what you mean on the StatsValuesFactory refactor. Because I was experimenting with this in a copy of StatsComponent (which by the way, is not easy to do!), I ended up not modifying StatsValuesFactory at all. Instead, I wrote a couple inner classes extending NumericStatsValues and FieldType that take a ValueSource as input. In SimpleStats.getStatsFields() and SimpleStats.getFieldCacheStats(), I catch the exception to schema.getField() that is thrown when trying to look up non-existent function fields, and return my custom NSV/FT-based classes - stored in rb._statsInfo - instead. This seems to have broken stat faceting, however, I think since other calls to StatsValuesFactory.createStatsValues outside StatsComponent don't use the same logic. No doubt yours is the better road to travel - I was shooting for quick-n-dirty to see if this was a useful approach to a stats problem. Also, regarding changing the output key: either this was either broken already, or I broke it somehow. Would it be useful for me to upload what I have as a reference point for you or someone else to implement more coherently? I'm not sure I have the bandwidth to pull down a virgin Solr and migrate the changes at this time.
          Hide
          Hoss Man added a comment -

          Would it be useful for me to upload what I have as a reference point for you or someone else to implement more coherently?

          Yes, absolutely ... didn't you see my last comment?

          A half-baked patch in Jira, with no documentation, no tests and no backwards compatibility is better than no patch at all.

          Show
          Hoss Man added a comment - Would it be useful for me to upload what I have as a reference point for you or someone else to implement more coherently? Yes, absolutely ... didn't you see my last comment? A half-baked patch in Jira, with no documentation, no tests and no backwards compatibility is better than no patch at all.
          Hide
          Crawdaddy added a comment -

          Here's my start at this.

          If desired, this can be built as a replacement Stats Component by putting it's JAR in /webapp/WEB-INF/lib so it can access ResponseBuilder's package private _statsInfo.

          I added a few "// SOLR-6354" comments around my changes.

          Show
          Crawdaddy added a comment - Here's my start at this. If desired, this can be built as a replacement Stats Component by putting it's JAR in /webapp/WEB-INF/lib so it can access ResponseBuilder's package private _statsInfo. I added a few "// SOLR-6354 " comments around my changes.
          Hide
          Hoss Man added a comment -

          Also, regarding changing the output key: either this was either broken already, or I broke it somehow.

          Yeah, beyond the initial problem you pointed out about code duplication dealing with where/how the StatsValues instances are constructed, theres also inconsistencies in when/how/if the localparams are parsed. I'm tackling that in SOLR-6507 first.

          Show
          Hoss Man added a comment - Also, regarding changing the output key: either this was either broken already, or I broke it somehow. Yeah, beyond the initial problem you pointed out about code duplication dealing with where/how the StatsValues instances are constructed, theres also inconsistencies in when/how/if the localparams are parsed. I'm tackling that in SOLR-6507 first.
          Hide
          Crawdaddy added a comment -

          Excellent - thanks Hoss. Maybe crosstalk, but do you think some of this work would make it easier for us to do stats on scores? Scores mean something in my application and I want to use them in the Stats component.

          Show
          Crawdaddy added a comment - Excellent - thanks Hoss. Maybe crosstalk, but do you think some of this work would make it easier for us to do stats on scores? Scores mean something in my application and I want to use them in the Stats component.
          Hide
          Hoss Man added a comment -

          First start at a patch along the lines of the original idea i outlined (now that SOLR-6507 has cleaned up the duplicate/broken localparams code).

          In this patch, you can see the logic for figuring out if/when we're dealing with field faceting vs query/function faceting. In the test changes, you can see it being smart about requests to facet over a function that really just normalizes away to being a single field

          The next steps from here are labeled with nocommits – start propogating the StatsField instance directly to the various classes that deal with StatsValuesFactory so it can see if/when we need a schema based StatsValue instance, and when it should return a (new) StatsValue class that deals directly with a ValueSource


          do you think some of this work would make it easier for us to do stats on scores? Scores mean something in my application and I want to use them in the Stats component.

          in the same sense that you can use {!frange} to filter on the scores of an arbitrary query, we should ultimately be able to compute stats on the scores of an arbitrary query – but done in a second pass, regardless of wether or not hte specified query is the same as the "main" query.

          ie, something like this should work....

                    q = foo bar^34 baz
                stats = true
          stats.field = {!query key=score_stats v=$q}
          

          ...just as well as something like this...

                    q = foo bar^3.4 +baz
                stats = true
          stats.field = {!lucene key=some_other_query_score_stats}yak^1.2 +zazz
          

          ...but the first won't be doing anything special to compute the stats "on the fly" as documents are collected.

          Show
          Hoss Man added a comment - First start at a patch along the lines of the original idea i outlined (now that SOLR-6507 has cleaned up the duplicate/broken localparams code). In this patch, you can see the logic for figuring out if/when we're dealing with field faceting vs query/function faceting. In the test changes, you can see it being smart about requests to facet over a function that really just normalizes away to being a single field The next steps from here are labeled with nocommits – start propogating the StatsField instance directly to the various classes that deal with StatsValuesFactory so it can see if/when we need a schema based StatsValue instance, and when it should return a (new) StatsValue class that deals directly with a ValueSource do you think some of this work would make it easier for us to do stats on scores? Scores mean something in my application and I want to use them in the Stats component. in the same sense that you can use {!frange} to filter on the scores of an arbitrary query, we should ultimately be able to compute stats on the scores of an arbitrary query – but done in a second pass, regardless of wether or not hte specified query is the same as the "main" query. ie, something like this should work.... q = foo bar^34 baz stats = true stats.field = {!query key=score_stats v=$q} ...just as well as something like this... q = foo bar^3.4 +baz stats = true stats.field = {!lucene key=some_other_query_score_stats}yak^1.2 +zazz ...but the first won't be doing anything special to compute the stats "on the fly" as documents are collected.
          Hide
          Crawdaddy added a comment -

          Nice work Hoss - thank you very much for the patch and the reminder about frange. Will give this a try.

          Show
          Crawdaddy added a comment - Nice work Hoss - thank you very much for the patch and the reminder about frange. Will give this a try.
          Hide
          Hoss Man added a comment -

          Much progress – now actually supports stats over arbitrary functions.

          • promoted StatsField to a top level public class
            • added accessors for the SchemaField, ValueSource, and calcDistinct props
          • changed the API for StastValueFactory.createStatsValues to take in the StatsField directly
            • affected callers: DocValuesStats, UnInvertedField.getStats, FieldFacetStats
            • propogate StatsField all the way down to the StatsValues constructors
          • changed AbstractStatsValues.accumulate(BytesRef) & setNextReader to conditionally check wether we have a FieldType or not
            • this seemed more straight forwrad and less complicated to understand then my initial idea of refactoring out new base classes due to some of the code duplication that would be needed in both the concrete leaf level classes (ie: NumericValueSourceStatsValues & NumericSchemaFieldStatsValues) which would need to have differnet parents (AbstractSchemaFieldStatsValues vs AbstractStatsValues) but would still collect the same types of stats in largely the same way.
          • more progress on tests

          Still some more nocommits, but their mostly just around some more robust test coverage and adding javadocs to some simple methods

          might be ready to commit tomorow.


          Crawdaddy - would love to hear any feedback you have if you get a chance to try this out.

          Show
          Hoss Man added a comment - Much progress – now actually supports stats over arbitrary functions. promoted StatsField to a top level public class added accessors for the SchemaField, ValueSource, and calcDistinct props changed the API for StastValueFactory.createStatsValues to take in the StatsField directly affected callers: DocValuesStats, UnInvertedField.getStats, FieldFacetStats propogate StatsField all the way down to the StatsValues constructors changed AbstractStatsValues.accumulate(BytesRef) & setNextReader to conditionally check wether we have a FieldType or not this seemed more straight forwrad and less complicated to understand then my initial idea of refactoring out new base classes due to some of the code duplication that would be needed in both the concrete leaf level classes (ie: NumericValueSourceStatsValues & NumericSchemaFieldStatsValues) which would need to have differnet parents (AbstractSchemaFieldStatsValues vs AbstractStatsValues) but would still collect the same types of stats in largely the same way. more progress on tests Still some more nocommits, but their mostly just around some more robust test coverage and adding javadocs to some simple methods might be ready to commit tomorow. Crawdaddy - would love to hear any feedback you have if you get a chance to try this out.
          Hide
          Hoss Man added a comment -

          Updated patch, all tests & javadocs written - no more nocommits.

          Other then tests, there's really only one code change between this patch and the last one – and that was fixing AbstractStatsValues.setNextReader to call ValueSource.newContext() instead of using Collections.emptyMap() – it's never really been a problem before, but it was problematic now if you tried to do stats over a QueryValueSource.

          I'm hoping to get this committed on monday unless anyone sees any problems.

          Show
          Hoss Man added a comment - Updated patch, all tests & javadocs written - no more nocommits. Other then tests, there's really only one code change between this patch and the last one – and that was fixing AbstractStatsValues.setNextReader to call ValueSource.newContext() instead of using Collections.emptyMap() – it's never really been a problem before, but it was problematic now if you tried to do stats over a QueryValueSource. I'm hoping to get this committed on monday unless anyone sees any problems.
          Hide
          ASF subversion and git services added a comment -

          Commit 1626856 from hossman@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1626856 ]

          SOLR-6354: stats.field can now be used to generate stats over the numeric results of arbitrary functions

          Show
          ASF subversion and git services added a comment - Commit 1626856 from hossman@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1626856 ] SOLR-6354 : stats.field can now be used to generate stats over the numeric results of arbitrary functions
          Hide
          ASF subversion and git services added a comment -

          Commit 1626875 from hossman@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1626875 ]

          SOLR-6354: stats.field can now be used to generate stats over the numeric results of arbitrary functions (merge r1626856)

          Show
          ASF subversion and git services added a comment - Commit 1626875 from hossman@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1626875 ] SOLR-6354 : stats.field can now be used to generate stats over the numeric results of arbitrary functions (merge r1626856)
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.

            People

            • Assignee:
              Hoss Man
              Reporter:
              Hoss Man
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development