Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.1, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      Stats component currently computes all stats (except for one) every time because they are relatively cheap, and in some cases dependent on eachother for distrib computation – but if we start layering stats on other things it becomes unnecessarily expensive to compute all the stats when they just want the "sum" (and it will definitely become excessively verbose in the responses).

      The plan here is to use local params to make this configurable. All of the existing stat options could be modeled as a simple boolean param, but future params (like percentiles) might take in a more complex param value...

      Example:

      stats.field={!min=true max=true percentiles='99,99.999'}price
      stats.field={!mean=true}weight
      
      1. make-data-and-queries.pl
        3 kB
        Hoss Man
      2. make-data-and-queries.pl
        2 kB
        Hoss Man
      3. make-data-and-queries.pl
        2 kB
        Hoss Man
      4. SOLR-6349___bad_idea_broken.patch
        19 kB
        Hoss Man
      5. SOLR-6349.patch
        135 kB
        Hoss Man
      6. SOLR-6349.patch
        93 kB
        Hoss Man
      7. SOLR-6349.patch
        87 kB
        Hoss Man
      8. SOLR-6349.patch
        83 kB
        Hoss Man
      9. SOLR-6349.patch
        77 kB
        Hoss Man
      10. SOLR-6349.patch
        63 kB
        Hoss Man
      11. SOLR-6349.patch
        59 kB
        Hoss Man
      12. SOLR-6349-tflobbe.patch
        21 kB
        Tomás Fernández Löbbe
      13. SOLR-6349-tflobbe.patch
        16 kB
        Tomás Fernández Löbbe
      14. SOLR-6349-tflobbe.patch
        16 kB
        Tomás Fernández Löbbe
      15. SOLR-6349-xu.patch
        55 kB
        Xu Zhang
      16. SOLR-6349-xu.patch
        37 kB
        Xu Zhang
      17. SOLR-6349-xu.patch
        30 kB
        Xu Zhang
      18. SOLR-6349-xu.patch
        23 kB
        Xu Zhang

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          Proposed implementation...

          • Change StatsValuesFactory.createStatsValues (and the constructors for the various StatsValues impls) to take in the local params from the stats.field
          • each StatsValue impl should validate the stat params it's asked to compute
          • all stats should default to disabled, but we need a special backcompat case that if no stats are specified in local param, all current default stats are computed
            • we can't be lazy and just check 0==localparams.size() - need to check the actuals stats params because of local params like "ex" and "key"
          • for the distributed logic where things get a bit more complex (ie: distrib mean needs sum+count from all shards; distrib stddev needs sum+count+sumOfSquares from each shard) we can go two possible routes:
            • A) StatsValue needs a new method to be asked by StatsComponent what local params it needs when sending shard requests
              • in this case the localparams of the shard requests have diff localparams and the processing of those shard stats requests can be ignorant of the fact that they are distributed
            • B) StatsValue (via the factory method) needs to be informed when it's computing stats for an "isShard" request, so it can internally decide what per-shard values to return based on the input
              • in this case, the localparams are not modified per shard, but since "isShard=true" the StatsValue may return diff metrics then the ones requested so that the coordinator gets what it needs to aggregate.
            • I think i'm leaning towards option "B" - particularly because it simplifies the idea of how to deal with situations like "percentiles" where the per-shard info isn't really a stat that should have it's own localparma folks migth ask for.
          • deprecate stats.calcdistinct but use it as a default for the new corresponding localparam(s)
          Show
          Hoss Man added a comment - Proposed implementation... Change StatsValuesFactory.createStatsValues (and the constructors for the various StatsValues impls) to take in the local params from the stats.field each StatsValue impl should validate the stat params it's asked to compute all stats should default to disabled, but we need a special backcompat case that if no stats are specified in local param, all current default stats are computed we can't be lazy and just check 0==localparams.size() - need to check the actuals stats params because of local params like "ex" and "key" for the distributed logic where things get a bit more complex (ie: distrib mean needs sum+count from all shards; distrib stddev needs sum+count+sumOfSquares from each shard) we can go two possible routes: A) StatsValue needs a new method to be asked by StatsComponent what local params it needs when sending shard requests in this case the localparams of the shard requests have diff localparams and the processing of those shard stats requests can be ignorant of the fact that they are distributed B) StatsValue (via the factory method) needs to be informed when it's computing stats for an "isShard" request, so it can internally decide what per-shard values to return based on the input in this case, the localparams are not modified per shard, but since "isShard=true" the StatsValue may return diff metrics then the ones requested so that the coordinator gets what it needs to aggregate. I think i'm leaning towards option "B" - particularly because it simplifies the idea of how to deal with situations like "percentiles" where the per-shard info isn't really a stat that should have it's own localparma folks migth ask for. deprecate stats.calcdistinct but use it as a default for the new corresponding localparam(s)
          Hide
          Hoss Man added a comment -

          I've been trying to work on this off and on over the psat week and i keep running into problems.

          the attached "SOLR-6349___bad_idea_broken.patch" shows the (lack of) "progress" made and should stand as a warning sign of the problems of going down this particular route.

          Ultimately i kept running into 2 problems with trying to modify the existing code in AbstractStatsValues & its subclasses...

          • parsing the localparams to know when the (legacy) default set of stats should be computed, vs specific individual stats
            • this got very hairy very fast because of the superclass/subclass relationship – each class needs properties to track what it's suppose to be computing (ie: "boolean needCount"), and the default initalization of those properties needs to be based on wether any specific propeties are requested (ie: "count" in localparams), but to do that properly it means the superclass can't init it's properties until giving the subclass a chance to check if any of it's specific properties have been requested.
          • stats depending on other stats
            • subclasses need to be able to override the init logic for the superclasses properties based on the subclass specific stats (ie: numeric "mean" and "stddev" stats both depends on the generic "count" stat)
            • there's a different between neding to compute a stat locally (ie: count->mean) and wether we should return that stat to the caller.
              • for single node "mean" computation, we depend on the local computation of "sum" and "count" but we don't want to return either "sum" or "count"
              • for distributed "mean": each shard needs to compute & return "count" and "sum" but we don't need a per-shard "mean"; the coordinator needs to collect & combine all the (per-shard) "count" and "sum" stats to produce a "mean" that will be returned to the client, but it shouldn't return the combined "count" and "sum"

          ...which is why i ultimately abandoned this current patch.


          I have a rough idea forming in the back of my head about a better way to solve this problem via a bit of an overhaul to the internals of AbstractStatsValues ... trying to outline what i'm thinking...

          • keep the basic contract of "StatsValues" intact
          • keep the AbstractStatsValues and subclasses
            • these should focus only on the differences in the data type of the source data (ie: Number vs Date vs String vs Enum)
          • refactor the meat of how each stat is computed into smaller "Stats" classes (ie: "StatsMean", "StatsSum", "StatsCount").
            • these should be construct based on looping over the local params looking up the keys in some map (SPI?)
            • each "Stat" can ask the StatsValues holding it to construct other dependent Stats as needed ("StatsMean" would ask for a StatsSum and StatsCount
              • if those already exist (because they were explicitly requested or because some other Stat also needed them) it would be given a ref to the existing instances, else StatsValues would create a new instance
            • each "Stat" will have some boolean state indicating if it should write data in the response
              • StatsValues would set that state to true on a Stats instance only if it was explicitly requested via local param
              • Stats can forcible set that state to true on other stats, notably when the request "isShard" (ie: StatsMean can choose not to write anything to the response on an "isShard" request, but can tell the StatsSum and StatsCount objects that they must)

          I'm going to sit on this for a few days, focus on other things, and then come back and revisit it with fresh eyes and see if i can think of anything better (or if anyone else comes along with a better suggestion)

          Show
          Hoss Man added a comment - I've been trying to work on this off and on over the psat week and i keep running into problems. the attached " SOLR-6349 ___bad_idea_broken.patch" shows the (lack of) "progress" made and should stand as a warning sign of the problems of going down this particular route. Ultimately i kept running into 2 problems with trying to modify the existing code in AbstractStatsValues & its subclasses... parsing the localparams to know when the (legacy) default set of stats should be computed, vs specific individual stats this got very hairy very fast because of the superclass/subclass relationship – each class needs properties to track what it's suppose to be computing (ie: "boolean needCount"), and the default initalization of those properties needs to be based on wether any specific propeties are requested (ie: "count" in localparams), but to do that properly it means the superclass can't init it's properties until giving the subclass a chance to check if any of it's specific properties have been requested. stats depending on other stats subclasses need to be able to override the init logic for the superclasses properties based on the subclass specific stats (ie: numeric "mean" and "stddev" stats both depends on the generic "count" stat) there's a different between neding to compute a stat locally (ie: count->mean) and wether we should return that stat to the caller. for single node "mean" computation, we depend on the local computation of "sum" and "count" but we don't want to return either "sum" or "count" for distributed "mean": each shard needs to compute & return "count" and "sum" but we don't need a per-shard "mean"; the coordinator needs to collect & combine all the (per-shard) "count" and "sum" stats to produce a "mean" that will be returned to the client, but it shouldn't return the combined "count" and "sum" ...which is why i ultimately abandoned this current patch. I have a rough idea forming in the back of my head about a better way to solve this problem via a bit of an overhaul to the internals of AbstractStatsValues ... trying to outline what i'm thinking... keep the basic contract of "StatsValues" intact keep the AbstractStatsValues and subclasses these should focus only on the differences in the data type of the source data (ie: Number vs Date vs String vs Enum) refactor the meat of how each stat is computed into smaller "Stats" classes (ie: "StatsMean", "StatsSum", "StatsCount"). these should be construct based on looping over the local params looking up the keys in some map (SPI?) each "Stat" can ask the StatsValues holding it to construct other dependent Stats as needed ("StatsMean" would ask for a StatsSum and StatsCount if those already exist (because they were explicitly requested or because some other Stat also needed them) it would be given a ref to the existing instances, else StatsValues would create a new instance each "Stat" will have some boolean state indicating if it should write data in the response StatsValues would set that state to true on a Stats instance only if it was explicitly requested via local param Stats can forcible set that state to true on other stats, notably when the request "isShard" (ie: StatsMean can choose not to write anything to the response on an "isShard" request, but can tell the StatsSum and StatsCount objects that they must) I'm going to sit on this for a few days, focus on other things, and then come back and revisit it with fresh eyes and see if i can think of anything better (or if anyone else comes along with a better suggestion)
          Hide
          Hoss Man added a comment -

          Side note...

          one good bit of discovery that came out of my failed patch was a realization of why the StatsComponent currently doesn't write out any stats values in some cases – as discussed in this comment in SOLR-6351...

          https://issues.apache.org/jira/browse/SOLR-6351?focusedCommentId=14160658&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14160658

          ...
          my thinking, was that:

          • the behavior when hanging stats off pivots should mirror that of regular stats
          • if you ask for a stats block , you should always get that block, so the client doesn't have to conditionally check if it's there before looking at the values.
          • the included stat values matter even if no doc has the stats.field, becuase one of the stats is in fact "missing" and that if you ask for stats, you should be able to look at that missing count. (and it should match up with your doc set size if the field is completely missing, etc...)

          but looking at an example of this now, i see that for simple field stats (w/o pivots), that's not even how it currently works – consider this URL using hte example data...

          http://localhost:8983/solr/select?rows=0&q=name:utf&stats.field=foo_d&stats.field=popularity&stats=true
          ...

          I found that the source of this behavior comes from this code in StatsComponent...

          if (isShard == true || (Long) stv.get("count") > 0) {
            stats_fields.add(statsField.getOutputKey(), stv);
          } else {
            stats_fields.add(statsField.getOutputKey(), null);
          }
          

          ...obviously if we're going to start making stats optional, we can't make the response data conditional on wether "count" is greater then 0, because there might not be a count.

          So we'll need to revamp this, and i'm more and more convinced my comment in SOLR-6351 is the right way to go...

          I thought (and still think) that the "correct" behavior for this query would be to get a stats block back for those fields where things like min/max/mean are "null", count==0, and missing=1 ... but that's not how it currently works.

          Show
          Hoss Man added a comment - Side note... one good bit of discovery that came out of my failed patch was a realization of why the StatsComponent currently doesn't write out any stats values in some cases – as discussed in this comment in SOLR-6351 ... https://issues.apache.org/jira/browse/SOLR-6351?focusedCommentId=14160658&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14160658 ... my thinking, was that: the behavior when hanging stats off pivots should mirror that of regular stats if you ask for a stats block , you should always get that block, so the client doesn't have to conditionally check if it's there before looking at the values. the included stat values matter even if no doc has the stats.field, becuase one of the stats is in fact "missing" and that if you ask for stats, you should be able to look at that missing count. (and it should match up with your doc set size if the field is completely missing, etc...) but looking at an example of this now, i see that for simple field stats (w/o pivots), that's not even how it currently works – consider this URL using hte example data... http://localhost:8983/solr/select?rows=0&q=name:utf&stats.field=foo_d&stats.field=popularity&stats=true ... I found that the source of this behavior comes from this code in StatsComponent... if (isShard == true || ( Long ) stv.get( "count" ) > 0) { stats_fields.add(statsField.getOutputKey(), stv); } else { stats_fields.add(statsField.getOutputKey(), null ); } ...obviously if we're going to start making stats optional, we can't make the response data conditional on wether "count" is greater then 0, because there might not be a count. So we'll need to revamp this, and i'm more and more convinced my comment in SOLR-6351 is the right way to go... I thought (and still think) that the "correct" behavior for this query would be to get a stats block back for those fields where things like min/max/mean are "null", count==0, and missing=1 ... but that's not how it currently works.
          Hide
          Tomás Fernández Löbbe added a comment -

          What about doing something like this:
          StatsFields knows about the stats to calculate and the stats include in response.
          For the basic “standalone” statistics, those sets are equal, but when a statistic requires others, the “calculate” set includes the depended stats.
          All stats in the “calculate” set are calculated.
          All stats in the “include in response” set are responded.
          In the case of a shard request, all stats in the “calculate” set are also included in the response.
          StatsField gets the local params from the request, the StatsValues ask the statsField instance before calculating a stat / including in the response.
          The bad part of it is that all the stats need the

          if (statsField.calculateStat(X)) {
                X = calculateX()
          }
          

          The good part is that we don’t need to create classes for every single stat, that may have to be generic in some way depending on the field type.
          I’m uploading a patch with the idea, it’s not even close to complete, just to show the idea (see only the NumericStatsValues implementation).
          There are tests failing in StatsComponent that I haven’t looked at yet.

          Show
          Tomás Fernández Löbbe added a comment - What about doing something like this: StatsFields knows about the stats to calculate and the stats include in response. For the basic “standalone” statistics, those sets are equal, but when a statistic requires others, the “calculate” set includes the depended stats. All stats in the “calculate” set are calculated. All stats in the “include in response” set are responded. In the case of a shard request, all stats in the “calculate” set are also included in the response. StatsField gets the local params from the request, the StatsValues ask the statsField instance before calculating a stat / including in the response. The bad part of it is that all the stats need the if (statsField.calculateStat(X)) { X = calculateX() } The good part is that we don’t need to create classes for every single stat, that may have to be generic in some way depending on the field type. I’m uploading a patch with the idea, it’s not even close to complete, just to show the idea (see only the NumericStatsValues implementation). There are tests failing in StatsComponent that I haven’t looked at yet.
          Hide
          Tomás Fernández Löbbe added a comment -

          Sorry, this is the correct version of the patch

          Show
          Tomás Fernández Löbbe added a comment - Sorry, this is the correct version of the patch
          Hide
          Tomás Fernández Löbbe added a comment -

          The failing tests were related to the fact that I now include stats even for count=0. For now I modified them to pass, but should be fixed based on what we decide to do in that case.
          I added some more tests, all for numeric stats for now.

          Show
          Tomás Fernández Löbbe added a comment - The failing tests were related to the fact that I now include stats even for count=0. For now I modified them to pass, but should be fixed based on what we decide to do in that case. I added some more tests, all for numeric stats for now.
          Hide
          Xu Zhang added a comment -

          +1 for Tomas's idea.

          I tried to improve it a little bit by using bit wise operation, but no big difference.
          And modified the Individual test in Tomas's patch, which basically test all combinations of all individual stats.

          Show
          Xu Zhang added a comment - +1 for Tomas's idea. I tried to improve it a little bit by using bit wise operation, but no big difference. And modified the Individual test in Tomas's patch, which basically test all combinations of all individual stats.
          Hide
          Tomás Fernández Löbbe added a comment -

          Thanks Xu

          I tried to improve it a little bit by using bit wise operation, but no big difference.

          Yes, my understanding is that there should not be much of a difference in this case. I think EnumSet is a bit more clear

          Chris Hostetter (Unused) Any thoughts on this path? I'll change the other stat types soon

          Show
          Tomás Fernández Löbbe added a comment - Thanks Xu I tried to improve it a little bit by using bit wise operation, but no big difference. Yes, my understanding is that there should not be much of a difference in this case. I think EnumSet is a bit more clear Chris Hostetter (Unused) Any thoughts on this path? I'll change the other stat types soon
          Hide
          Hoss Man added a comment -

          Tomas: thanks a lot for working on this.

          I have mixed feelings about your approach...

          • -1 to the way StatsField nows has to directly know about every possible stat for every possible data type, & the stat dependency logic is spread between the StatsValue classes and the new Enum in StatsField.
          • -0 to making it harder to add new types of stats in the future (one of the nice side effects of my hypothetical idea was that if we did go an SPI route or something like that it would have made it trivial for people to add new types of stats as plugins, even depending on other stats for distributed logic)
          • -1 for the need to have that ...
            if (statsField.calculateStat(X)) { 
              X = calculateX() 
            }
            

            ...pattern you mentioned in so much code – that's one of the reasons i abandomed my last patch (and before i abandoned it, i was focusingon trying to ensure that it was at least always a comarison with a final boolean in the hops that the JVM could optimize the if away)

          • +99 to the fact that you have something that actually works as opposed to my half-thought out idea that i never tried to implement.

          On balance, i think we should move forward with your idea – if nothing else, then as a straw man to help us flesh out the exact expected behavior & write more tests. If down the road we come up with a better internal implementation, then so be it.

          A few specific comments on what you've got so far...

          • kind of weird the way your patch's API uses a variety of Stat[], EnumSet<Stat> and Set<Stat> ... probably best just to use EnumSet everywhere?
          • be careful about your "statsInResponse.isEmpty() || statsInResponse.contains(stat)" logic ... we need to make sure we don't break existing behavior for things like stats.field=foo&stats.calcDistinct=true
            • speaking of calcDistinct, it needs to be accounted for in your new enum so we can start supporting it as a localparam and deprecate the top level stats.calcDistinct, maybe along the lines of...
              if (statsInResponse.isEmpty()) {
                statsInResponse.addAll(LEGACY_DEFAULT_STATS); // static final EnumSet
                statsToCalculate.addAll(LEGACY_DEFAULT_STATS_DEPENDS); // static {} built EnumSet looping over LEGACY_DEFAULT_STATS 
              }
              if (params.getFieldBool(f, STATS_CALCDISTINCT, false)) { // top level req params
                statsInResponse.add(Stat.calcDistinct);
                statsToCalculate.add(Stat.calcDistinct);
              }
              
          • i'm not positive - but i don't think your patch currently accounts for the idea that in a distributed request, we may need to calculate & return a stats dependencies, but not compute & return the stat itself (ie: for mean, we need each shard to compute a sum & count, but we don't wnat each shard to compute or return the per-shard mean)
            • we should be abl to easily test this behavior by "faking" an isShard request to a single node and asserting which keys we do/don't get back
          • rather then a static dependsOn(Stat) method with a case statement, why not make it a property of of the enum objects themselves?
            • maybe something like this, which also shows one idea for dealing with the "stat doesn't depend on itself in distributed calculations"...
                static enum Stat {
                  min(true),
                  max(true),
                  missing(true),
                  sum(true),
                  count(true),
                  mean(false, sum, count),
                  sumOfSquares(true),
                  stddev(false,sum,count,sumOfSquares);
                  
                  private final List<Stat> distribDeps;
              
                  Stat(boolean selfDep, Stat... deps) {
                    distribDeps = new ArrayList<Stat>(deps.length+1);
                    distribDeps.addAll(Arrays.asList(deps));
                    if (selfDep) { 
                      distribDeps.add(this);
                    }
                  }
                  public EnumSet<Stat> getDistribDeps() {
                    return EnumSet.copyOf(this.distribDeps);
                  }
                }
              
          • please help me fight against the trend of distributed tests that only do comparisons against single node w/o asserting specific results, ie...
            +    // only ask for "min" and "mean"
            +    query("q","*:*", "sort",i1+" desc", "stats", "true",
            +          "stats.field", "{!min=true mean=true}" + i1);
            +    
            +    // only ask for "min", "mean" and "stddev"
            +    query("q","*:*", "sort",i1+" desc", "stats", "true",
            +          "stats.field", "{!min=true mean=true stddev=true}" + i1);
            +    
            +    String[] stats = new String[]{"min", "max", "sum", "sumOfSquares", "stddev", "mean", "missing", "count"};
            +    
            +    for (String stat:stats) {
            +      for (String innerStat:stats) {
            +        query("q","*:*", "sort",i1+" desc", "stats", "true",
            +            "stats.field", "{!" + stat + "=true " + innerStat + "=true}" + i1);
            +      }
            +    }
            +    
            

            ...all of those can & should include xpaths asserting that the count() of keys in the stats is only N, and that the expected keys exist (and in the case of the first 2: we should be able to assert the expected value)


          The failing tests were related to the fact that I now include stats even for count=0. For now I modified them to pass, but should be fixed based on what we decide to do in that case.

          What's your opinion on this? mine hasn't really changed...

          ... the "correct" behavior for this query would be to get a stats block back for those fields where things like min/max/mean are "null", count==0, and missing=1 ...

          tht would be really easy to do – the only minor hitch is that i think the code right now (assuming count==0) winds up giving you things like -/+Infinity for min/max instead of null – so we'd have to start tracking a boolean of wether we'd collected any values at all.

          Show
          Hoss Man added a comment - Tomas: thanks a lot for working on this. I have mixed feelings about your approach... -1 to the way StatsField nows has to directly know about every possible stat for every possible data type, & the stat dependency logic is spread between the StatsValue classes and the new Enum in StatsField. -0 to making it harder to add new types of stats in the future (one of the nice side effects of my hypothetical idea was that if we did go an SPI route or something like that it would have made it trivial for people to add new types of stats as plugins, even depending on other stats for distributed logic) -1 for the need to have that ... if (statsField.calculateStat(X)) { X = calculateX() } ...pattern you mentioned in so much code – that's one of the reasons i abandomed my last patch (and before i abandoned it, i was focusingon trying to ensure that it was at least always a comarison with a final boolean in the hops that the JVM could optimize the if away) +99 to the fact that you have something that actually works as opposed to my half-thought out idea that i never tried to implement. On balance, i think we should move forward with your idea – if nothing else, then as a straw man to help us flesh out the exact expected behavior & write more tests. If down the road we come up with a better internal implementation, then so be it. A few specific comments on what you've got so far... kind of weird the way your patch's API uses a variety of Stat[], EnumSet<Stat> and Set<Stat> ... probably best just to use EnumSet everywhere? be careful about your " statsInResponse.isEmpty() || statsInResponse.contains(stat) " logic ... we need to make sure we don't break existing behavior for things like stats.field=foo&stats.calcDistinct=true speaking of calcDistinct, it needs to be accounted for in your new enum so we can start supporting it as a localparam and deprecate the top level stats.calcDistinct , maybe along the lines of... if (statsInResponse.isEmpty()) { statsInResponse.addAll(LEGACY_DEFAULT_STATS); // static final EnumSet statsToCalculate.addAll(LEGACY_DEFAULT_STATS_DEPENDS); // static {} built EnumSet looping over LEGACY_DEFAULT_STATS } if (params.getFieldBool(f, STATS_CALCDISTINCT, false )) { // top level req params statsInResponse.add(Stat.calcDistinct); statsToCalculate.add(Stat.calcDistinct); } i'm not positive - but i don't think your patch currently accounts for the idea that in a distributed request, we may need to calculate & return a stats dependencies, but not compute & return the stat itself (ie: for mean, we need each shard to compute a sum & count, but we don't wnat each shard to compute or return the per-shard mean) we should be abl to easily test this behavior by "faking" an isShard request to a single node and asserting which keys we do/don't get back rather then a static dependsOn(Stat) method with a case statement, why not make it a property of of the enum objects themselves? maybe something like this, which also shows one idea for dealing with the "stat doesn't depend on itself in distributed calculations"... static enum Stat { min( true ), max( true ), missing( true ), sum( true ), count( true ), mean( false , sum, count), sumOfSquares( true ), stddev( false ,sum,count,sumOfSquares); private final List<Stat> distribDeps; Stat( boolean selfDep, Stat... deps) { distribDeps = new ArrayList<Stat>(deps.length+1); distribDeps.addAll(Arrays.asList(deps)); if (selfDep) { distribDeps.add( this ); } } public EnumSet<Stat> getDistribDeps() { return EnumSet.copyOf( this .distribDeps); } } please help me fight against the trend of distributed tests that only do comparisons against single node w/o asserting specific results, ie... + // only ask for "min" and "mean" + query("q","*:*", "sort",i1+" desc", "stats", "true", + "stats.field", "{!min=true mean=true}" + i1); + + // only ask for "min", "mean" and "stddev" + query("q","*:*", "sort",i1+" desc", "stats", "true", + "stats.field", "{!min=true mean=true stddev=true}" + i1); + + String[] stats = new String[]{"min", "max", "sum", "sumOfSquares", "stddev", "mean", "missing", "count"}; + + for (String stat:stats) { + for (String innerStat:stats) { + query("q","*:*", "sort",i1+" desc", "stats", "true", + "stats.field", "{!" + stat + "=true " + innerStat + "=true}" + i1); + } + } + ...all of those can & should include xpaths asserting that the count() of keys in the stats is only N, and that the expected keys exist (and in the case of the first 2: we should be able to assert the expected value) The failing tests were related to the fact that I now include stats even for count=0. For now I modified them to pass, but should be fixed based on what we decide to do in that case. What's your opinion on this? mine hasn't really changed... ... the "correct" behavior for this query would be to get a stats block back for those fields where things like min/max/mean are "null", count==0, and missing=1 ... tht would be really easy to do – the only minor hitch is that i think the code right now (assuming count==0) winds up giving you things like -/+Infinity for min/max instead of null – so we'd have to start tracking a boolean of wether we'd collected any values at all.
          Hide
          Tomás Fernández Löbbe added a comment -

          I'll iterate this patch to improve what you are mentioning. I'll probably won't have time this week to work on this, so I'll take it next week, or if someone else wants to continue, please feel free.

          Show
          Tomás Fernández Löbbe added a comment - I'll iterate this patch to improve what you are mentioning. I'll probably won't have time this week to work on this, so I'll take it next week, or if someone else wants to continue, please feel free.
          Hide
          Xu Zhang added a comment -

          Made some adjustments based on Hoss's comments.

          1. Only use EnumSet now.
          2. Move calcDistinct into Enum.
          But the question is, do we consider calcDistinct is a local parameter like min/max, or it should be the top-level parameter. (Which keeps the existing behavior for things like stats.field=foo&stats.calcDistinct=true)

          3. Support distributed request, add faking test.
          4. Remove static dependsOn(Stat) methods, use Enum property instead.

          Show
          Xu Zhang added a comment - Made some adjustments based on Hoss's comments. 1. Only use EnumSet now. 2. Move calcDistinct into Enum. But the question is, do we consider calcDistinct is a local parameter like min/max, or it should be the top-level parameter. (Which keeps the existing behavior for things like stats.field=foo&stats.calcDistinct=true) 3. Support distributed request, add faking test. 4. Remove static dependsOn(Stat) methods, use Enum property instead.
          Hide
          Hoss Man added a comment -

          Xu:

          I havne't had a chance to review your patch, but it sounds like awesome progress – in respose to your specific question...

          do we consider calcDistinct is a local parameter like min/max, or it should be the top-level parameter. (Which keeps the existing behavior for things like stats.field=foo&stats.calcDistinct=true)

          ...the key is that we need to support both. as i mentioned previously...

          deprecate stats.calcdistinct but use it as a default for the new corresponding localparam(s)

          and note the psuedo code i postd in my last comment...

          if (statsInResponse.isEmpty()) {
            statsInResponse.addAll(LEGACY_DEFAULT_STATS); // static final EnumSet
            statsToCalculate.addAll(LEGACY_DEFAULT_STATS_DEPENDS); // static {} built EnumSet looping over LEGACY_DEFAULT_STATS 
          }
          if (params.getFieldBool(f, STATS_CALCDISTINCT, false)) { // top level req params
            statsInResponse.add(Stat.calcDistinct);
            statsToCalculate.add(Stat.calcDistinct);
          }
          

          to give some concrete examples, if we assume that "foo" is a numeric field, (where stats.field=foo currently returns min/max/missing/sum/count/mean/sumOfSquares/stddev) then these are the results you should get in various situations...

          1) stats.field=foo
          or stats.field=foo&stats.calcDistinct=false
          or stats.field=foo&f.foo.stats.calcDistinct=false
          or stats.field={!min=true max=true missing=true sum=true count=true mean=true sumOfSquares=true stddev=true}foo
          or stats.field={!min=true max=true missing=true sum=true count=true mean=true sumOfSquares=true stddev=true}foo&f.foo.stats.calcDistinct=false
          or stats.field={!min=true max=true missing=true sum=true count=true mean=true sumOfSquares=true stddev=true calcDistinct=false}foo&f.foo.stats.calcDistinct=true
          
          => min + max + missing + sum + count + mean + sumOfSquares + stddev
          
          ----
          
          2) stats.field=foo&stats.calcDistinct=true
          or stats.field=foo&f.foo.stats.calcDistinct=true
          or stats.field={!min=true max=true missing=true sum=true count=true mean=true sumOfSquares=true stddev=true calcDistinct=true}foo
          or stats.field={!min=true max=true missing=true sum=true count=true mean=true sumOfSquares=true stddev=true calcDistinct=true}foostats.calcDistinct=false
          or stats.field={!min=true max=true missing=true sum=true count=true mean=true sumOfSquares=true stddev=true calcDistinct=true}foo&f.foo.stats.calcDistinct=false
          
          => min + max + missing + sum + count + mean + sumOfSquares + stddev + calcDistinct
          
          ----
          
          3) stats.field={!calcDistinct=true}foo
          or stats.field={!calcDistinct=true}foo&stats.calcDistinct=false
          or stats.field={!calcDistinct=true}foo&f.foo.stats.calcDistinct=false
          
          => calcDistinct
          
          ----
          
          3) stats.field={!min=true}foo&stats.calcDistinct=true
          or stats.field={!min=true calcDistinct=true}foo&stats.calcDistinct=false
          or stats.field={!min=true calcDistinct=true}foo&f.foo.stats.calcDistinct=false
          
          => min + calcDistinct
          
          

          does that make sense?

          Show
          Hoss Man added a comment - Xu: I havne't had a chance to review your patch, but it sounds like awesome progress – in respose to your specific question... do we consider calcDistinct is a local parameter like min/max, or it should be the top-level parameter. (Which keeps the existing behavior for things like stats.field=foo&stats.calcDistinct=true) ...the key is that we need to support both. as i mentioned previously... deprecate stats.calcdistinct but use it as a default for the new corresponding localparam(s) and note the psuedo code i postd in my last comment... if (statsInResponse.isEmpty()) { statsInResponse.addAll(LEGACY_DEFAULT_STATS); // static final EnumSet statsToCalculate.addAll(LEGACY_DEFAULT_STATS_DEPENDS); // static {} built EnumSet looping over LEGACY_DEFAULT_STATS } if (params.getFieldBool(f, STATS_CALCDISTINCT, false )) { // top level req params statsInResponse.add(Stat.calcDistinct); statsToCalculate.add(Stat.calcDistinct); } to give some concrete examples, if we assume that "foo" is a numeric field, (where stats.field=foo currently returns min/max/missing/sum/count/mean/sumOfSquares/stddev) then these are the results you should get in various situations... 1) stats.field=foo or stats.field=foo&stats.calcDistinct=false or stats.field=foo&f.foo.stats.calcDistinct=false or stats.field={!min=true max=true missing=true sum=true count=true mean=true sumOfSquares=true stddev=true}foo or stats.field={!min=true max=true missing=true sum=true count=true mean=true sumOfSquares=true stddev=true}foo&f.foo.stats.calcDistinct=false or stats.field={!min=true max=true missing=true sum=true count=true mean=true sumOfSquares=true stddev=true calcDistinct=false}foo&f.foo.stats.calcDistinct=true => min + max + missing + sum + count + mean + sumOfSquares + stddev ---- 2) stats.field=foo&stats.calcDistinct=true or stats.field=foo&f.foo.stats.calcDistinct=true or stats.field={!min=true max=true missing=true sum=true count=true mean=true sumOfSquares=true stddev=true calcDistinct=true}foo or stats.field={!min=true max=true missing=true sum=true count=true mean=true sumOfSquares=true stddev=true calcDistinct=true}foostats.calcDistinct=false or stats.field={!min=true max=true missing=true sum=true count=true mean=true sumOfSquares=true stddev=true calcDistinct=true}foo&f.foo.stats.calcDistinct=false => min + max + missing + sum + count + mean + sumOfSquares + stddev + calcDistinct ---- 3) stats.field={!calcDistinct=true}foo or stats.field={!calcDistinct=true}foo&stats.calcDistinct=false or stats.field={!calcDistinct=true}foo&f.foo.stats.calcDistinct=false => calcDistinct ---- 3) stats.field={!min=true}foo&stats.calcDistinct=true or stats.field={!min=true calcDistinct=true}foo&stats.calcDistinct=false or stats.field={!min=true calcDistinct=true}foo&f.foo.stats.calcDistinct=false => min + calcDistinct does that make sense?
          Hide
          Xu Zhang added a comment -

          Thanks so much, totally agree.

          I will try to update Tomas's change and tests this evening.

          Show
          Xu Zhang added a comment - Thanks so much, totally agree. I will try to update Tomas's change and tests this evening.
          Hide
          Xu Zhang added a comment -

          Super naive implementation, but should work.

          Added more tests around calcDistinct, basically tests every case in the Hoss's comment.

          Show
          Xu Zhang added a comment - Super naive implementation, but should work. Added more tests around calcDistinct, basically tests every case in the Hoss's comment.
          Hide
          Xu Zhang added a comment -

          Updated:
          1. Use lower case calcdistinct in local parameters, as the same with the one in top level parameter.
          2. Clean unneeded methods in UnInvertedField class.

          Pass all tests.

          Show
          Xu Zhang added a comment - Updated: 1. Use lower case calcdistinct in local parameters, as the same with the one in top level parameter. 2. Clean unneeded methods in UnInvertedField class. Pass all tests.
          Hide
          Oliver Mannion added a comment - - edited

          Can this patch allow output of countDistinct but not distinctValues? We have this requirement, as the distinctValues field is not needed and its inclusion increases the response size dramatically (to the point where it becomes too slow to process).

          Show
          Oliver Mannion added a comment - - edited Can this patch allow output of countDistinct but not distinctValues? We have this requirement, as the distinctValues field is not needed and its inclusion increases the response size dramatically (to the point where it becomes too slow to process).
          Hide
          Hoss Man added a comment -

          Can this patch allow output of countDistinct but not distinctValues?

          i don't think we should tackle that as part of this issue - it's already fairly complicated w/o introducing new permutations of options.

          i think the best approach would be to leave "calcDistinct" alone as it is now but deprecate/discourage it andmove towards adding an entirely new stats option for computing an aproximated count using hyperloglog (i opened a new issue for this: SOLR-6968)

          Show
          Hoss Man added a comment - Can this patch allow output of countDistinct but not distinctValues? i don't think we should tackle that as part of this issue - it's already fairly complicated w/o introducing new permutations of options. i think the best approach would be to leave "calcDistinct" alone as it is now but deprecate/discourage it andmove towards adding an entirely new stats option for computing an aproximated count using hyperloglog (i opened a new issue for this: SOLR-6968 )
          Hide
          Hoss Man added a comment -

          Starting to get back into this, here's a quick checkpoint of some small progress

          Step #1: This new patch brings Xu's latest patch up to date with trunk using the minimal changes that seemed to work – in particular: I haven't started really digging into the code changes other then getting things to compile & tests to pass.

          Step #2...

          My main focus for now is making sure the tests are rock solid & all inclusive so we can then iterate on the code changes (see early comments about my cocerns with spreading hte logic arround). Only 2 noticable changes in this patch...

          • Fixed FacetPivotSmallTest.testPivotFacetStatsUnsortedTagged
            • was prematurely specifying 'mean=true' but then trying to assert that all stats were returned
            • beefed this up to also assert that it got an expected number of stats - if we add more stats in the future, this will be a canary that the test needs updated to assert the correct values for these new stats.
          • StatsComponentTest
            • added more asserts to the 3 testFieldStatisticsResults_TYPE_FieldAlwaysMissing to ensure expected values for all stats (when there is nothing to compute stats on)...
              // numerics & strings & dates
              min=null
              max=null
              // just numerics
              sum=0.0
              sumOfSquares=0.0
              stddev=0.0
              mean=NaN
              
              • these are based on the current behavior of the code ... my initial gut reaction was that they should all be null, but a quick bit of research says that in maths the "empty sum" is defined as "0" – if you start with that premise, then the values for the rest seems correct to me, but i'm definitely interested in knowing if there are contrary opinions (is NaN better?)
            • included "expected number of stats" asserts in these tests as well - more canary's if/when future stats are added.
          Show
          Hoss Man added a comment - Starting to get back into this, here's a quick checkpoint of some small progress Step #1: This new patch brings Xu's latest patch up to date with trunk using the minimal changes that seemed to work – in particular: I haven't started really digging into the code changes other then getting things to compile & tests to pass. Step #2... My main focus for now is making sure the tests are rock solid & all inclusive so we can then iterate on the code changes (see early comments about my cocerns with spreading hte logic arround). Only 2 noticable changes in this patch... Fixed FacetPivotSmallTest.testPivotFacetStatsUnsortedTagged was prematurely specifying 'mean=true' but then trying to assert that all stats were returned beefed this up to also assert that it got an expected number of stats - if we add more stats in the future, this will be a canary that the test needs updated to assert the correct values for these new stats. StatsComponentTest added more asserts to the 3 testFieldStatisticsResults_TYPE_FieldAlwaysMissing to ensure expected values for all stats (when there is nothing to compute stats on)... // numerics & strings & dates min=null max=null // just numerics sum=0.0 sumOfSquares=0.0 stddev=0.0 mean=NaN these are based on the current behavior of the code ... my initial gut reaction was that they should all be null, but a quick bit of research says that in maths the "empty sum" is defined as "0" – if you start with that premise, then the values for the rest seems correct to me, but i'm definitely interested in knowing if there are contrary opinions (is NaN better?) included "expected number of stats" asserts in these tests as well - more canary's if/when future stats are added.
          Hide
          Hoss Man added a comment -

          some mor progress auditing & enhancing the test changes in this patch...

          • SolrTestCaseJ4
            • revert unneccssary whitespace change
          • TestDistributedSearch
            • make new test queries include hard asserts on the values returned
              • even when looping over permutations of multiple stats, assert we don't get too many diff stats back
              • this also asserts that SolrJ is well behalved for these partial stat requests (ie: 'null' is returned for the stats we didn't ask for)
            • added a request + asserts of specific stats over a field that doesn't exist
          Show
          Hoss Man added a comment - some mor progress auditing & enhancing the test changes in this patch... SolrTestCaseJ4 revert unneccssary whitespace change TestDistributedSearch make new test queries include hard asserts on the values returned even when looping over permutations of multiple stats, assert we don't get too many diff stats back this also asserts that SolrJ is well behalved for these partial stat requests (ie: 'null' is returned for the stats we didn't ask for) added a request + asserts of specific stats over a field that doesn't exist
          Hide
          Hoss Man added a comment -

          some more udpates to the patch...

          • StatsComponentTest
            • undid an odd calcDistinct param change in testFieldStatisticsDocValuesAndMultiValuedIntegerFacetStats that shouldn't affect the test goal
              • want to ensure the behavior in this test isn't broken by changes
            • fixed testFieldStatisticsDocValuesAndMultiValuedDouble
              • was doing stats.field twice in same request diff ways, but only checking one
              • changed to do 2 explicit requests and assert results are the same
              • added in canary assert for future numeric stats
            • testIndividualStatLocalParams
              • added a canary assert to protect us against new stats in the future w/o updating the test
                • canary helped catch that we weren't testing calcdistinct in these permutations
              • added some sanity checks of localparams with 'false' values inspired by bug i found in testCalcDistinctStats
                • see question & nocommit (below)
              • added comment explaining point of isShard queries as best i understanding, see question & nocommit (below)
              • fixed asserts to play nice with calcdistinct excentricities
            • iterateParaCombination
              • kind of hard to understand what this is doing and how it works because of recursive nature
              • definitely need to replace magic number "8" since that is brittle against future stats and already doesn't jive with num of legal Stat params (missing calcdistinct)
              • in general, i want to refactor this to replace it with commons-math's Combinations class - i'll look into that tomorow
            • testCalcDistinctStats
              • part of the importance here is in the equivilence relationships - so i refactored each of the equivilent asesrt conditions to be a single assert inside a loop over the params.
                • this helps protect us against someone later thinking it's okay to change one assert w/o changing all of the equivilences
              • also simplified asserts to be less brittle: assert if expected stat keys are in response or not – not number of stat keys returned (which might change in future if more default stats added – in the case of these asserts, that isn't really relevant, what we care about here is the behavior of interaction of the various ways to request calcdistinct)
              • added some more permutations to each of the equivilences sets
                • this lead to discovring a semantics question so far undiscussed as far as what if only one stat is specified in a local param, but it's value is 'false' (see below)
              • fixed some mistakes in formatting params (eg: "f.a_i.stats.calcdistinct=false)"="true")
              • these changes let me eliminate the unused expectedStats and expectedType maps from this method (there were virtually unused cut/paste cruft prior to these changes anyway)
          • TestDistributedSearch
            • added some asserts that the (distrib) handling of calcdistinct works a variety of ways
          • DistributedFacetPivotSmallTest
            • small update to assert that when stats hang off pivots the local params correctly control which stats are returned.

          New Questions...

          • see nocommit in StatsComponentTest.testIndividualStatLocalParams - why the double loop here?
               // whitebox test: explicitly ask for isShard=true with an individual stat
               //
               // :nocommit: why loop over every stat to get it's deps and then loop over them? ...
               // ... isn't it enough to loop over the set of all known stats?
               for (Stat statAsk:expectedStats.keySet()) {
                 EnumSet <Stat> statAskEvaluate = statAsk.getDistribDeps();
            
                 for (Stat stat : statAskEvaluate){
            
          • how should these two queries behave...
            a) stats = true & stats.field = {!key=k min='false'}a_i
            
            b) stats = true & stats.field = {!key=k min='false'}a_i & stats.calcdistinct = true
            

            ...my gut says that for (a) the stats result set should be completley empty; and for (b) only countDistinct and distinctValues should be returned; – because in both cases because the use of localparams regardless of values should indicate that the implicit set of default stats should be ignored, and only explicitly requested stats should be returned – but the only explicity mentioned stat via localparams is deliberately disabled with 'false'. at the moment however, both of these cases returns all of the implicit default stats (see nocommits in tests) – we'll need to fix this

          Show
          Hoss Man added a comment - some more udpates to the patch... StatsComponentTest undid an odd calcDistinct param change in testFieldStatisticsDocValuesAndMultiValuedIntegerFacetStats that shouldn't affect the test goal want to ensure the behavior in this test isn't broken by changes fixed testFieldStatisticsDocValuesAndMultiValuedDouble was doing stats.field twice in same request diff ways, but only checking one changed to do 2 explicit requests and assert results are the same added in canary assert for future numeric stats testIndividualStatLocalParams added a canary assert to protect us against new stats in the future w/o updating the test canary helped catch that we weren't testing calcdistinct in these permutations added some sanity checks of localparams with 'false' values inspired by bug i found in testCalcDistinctStats see question & nocommit (below) added comment explaining point of isShard queries as best i understanding, see question & nocommit (below) fixed asserts to play nice with calcdistinct excentricities iterateParaCombination kind of hard to understand what this is doing and how it works because of recursive nature definitely need to replace magic number "8" since that is brittle against future stats and already doesn't jive with num of legal Stat params (missing calcdistinct) in general, i want to refactor this to replace it with commons-math's Combinations class - i'll look into that tomorow testCalcDistinctStats part of the importance here is in the equivilence relationships - so i refactored each of the equivilent asesrt conditions to be a single assert inside a loop over the params. this helps protect us against someone later thinking it's okay to change one assert w/o changing all of the equivilences also simplified asserts to be less brittle: assert if expected stat keys are in response or not – not number of stat keys returned (which might change in future if more default stats added – in the case of these asserts, that isn't really relevant, what we care about here is the behavior of interaction of the various ways to request calcdistinct) added some more permutations to each of the equivilences sets this lead to discovring a semantics question so far undiscussed as far as what if only one stat is specified in a local param, but it's value is 'false' (see below) fixed some mistakes in formatting params (eg: "f.a_i.stats.calcdistinct=false)"="true") these changes let me eliminate the unused expectedStats and expectedType maps from this method (there were virtually unused cut/paste cruft prior to these changes anyway) TestDistributedSearch added some asserts that the (distrib) handling of calcdistinct works a variety of ways DistributedFacetPivotSmallTest small update to assert that when stats hang off pivots the local params correctly control which stats are returned. New Questions... see nocommit in StatsComponentTest.testIndividualStatLocalParams - why the double loop here? // whitebox test: explicitly ask for isShard=true with an individual stat // // :nocommit: why loop over every stat to get it's deps and then loop over them? ... // ... isn't it enough to loop over the set of all known stats? for (Stat statAsk:expectedStats.keySet()) { EnumSet <Stat> statAskEvaluate = statAsk.getDistribDeps(); for (Stat stat : statAskEvaluate){ how should these two queries behave... a) stats = true & stats.field = {!key=k min='false'}a_i b) stats = true & stats.field = {!key=k min='false'}a_i & stats.calcdistinct = true ...my gut says that for (a) the stats result set should be completley empty; and for (b) only countDistinct and distinctValues should be returned; – because in both cases because the use of localparams regardless of values should indicate that the implicit set of default stats should be ignored, and only explicitly requested stats should be returned – but the only explicity mentioned stat via localparams is deliberately disabled with 'false'. at the moment however, both of these cases returns all of the implicit default stats (see nocommits in tests) – we'll need to fix this
          Hide
          Hoss Man added a comment -

          in general, i want to refactor this to replace it with commons-math's Combinations class - i'll look into that tomorow

          This patch replaces the confusing (to me) recursive method for testing all possible combinations of params with simple iteration using commons-math's Combinations iterator.

          also new in this patch: tweaks to DistributedFacetPivotSmallAdvancedTest to assert that when stats hang off pivots the local params correctly control which stats are returned ... even when refinement happens (this test already forces that condition).

          Show
          Hoss Man added a comment - in general, i want to refactor this to replace it with commons-math's Combinations class - i'll look into that tomorow This patch replaces the confusing (to me) recursive method for testing all possible combinations of params with simple iteration using commons-math's Combinations iterator. also new in this patch: tweaks to DistributedFacetPivotSmallAdvancedTest to assert that when stats hang off pivots the local params correctly control which stats are returned ... even when refinement happens (this test already forces that condition).
          Hide
          Hoss Man added a comment -

          SOLR-6349

          I did a bit of crude benchmarking this morning with the following two uses cases in mind:

          • user currently asks for stats on fields, cares about all 8 of the stats
          • user currently asks for stats on fields, only cares about 4of8 of them

          the attached script shows my methodology – it generates a CSV file with 10 million docs + 2 bash files that use curl to hit Solr with 300 : query urls using randomly selected stats.field. the sequence of stat field requests are identicle between the 2 bash files, but in one URLs include localparams to only compute min/max/mean/stddev for the field.

          Here's the results...

          NOW     BASELINE: 126.008 seconds (ie: all stats ... queries-old.sh)
          
          PATCH  ALL STATS: 133.571 seconds (6% slower ... queries-old.sh)
          PATCH FOUR STATS: 130.515 seconds (3% slower ... queries-new.sh)
          

          So not only has asking for all stats on a field gotten slower with this patch, but even asking for only 4 of the 8 possible numeric stats on a field is still slower then the existing code when all of them are returned.

          A key thing to note here is that this is the total wall clock time from the perspective of the client, including reading the response from Solr. Not only are we (in theory) computing only only 1/2 as much math per request in the "FOUR STATS" situation, the XML response size of each query is only ~3/4ths the size of the original queryies. This should mean a lot less time both in processing the results and in writing/reading the data over the wire ... and yet instead of seeing some perf improvements, we see performance suffer.

          I suspect a key factor here goes back to one of the concerns i mentioned earlier...

          if (statsField.calculateStat(X)) { 
            X = calculateX() 
          }
          

          ...pattern you mentioned in so much code - that's one of the reasons i abandomed my last patch (and before i abandoned it, i was focusingon trying to ensure that it was at least always a comarison with a final boolean in the hops that the JVM could optimize the if away)

          ...the cumulative overhead of those method calls for every possible stat is probably counter acting any gains made by reducing the stats that are computed.


          My next step is to focus on fixing the current patch code so the few remaining nocommit assertions in the test start passing (see earlier comments re "min='false'") – but once the behavior is locked down and solid i think we really need to re-assess and re-factor the code to see some perf gains before there's any point in moving towards adding this feature.

          (NOTE: if anyone spots any flaws in my little mini-benchmark, please speak up – i would be very happy to be wrong)

          Show
          Hoss Man added a comment - SOLR-6349 I did a bit of crude benchmarking this morning with the following two uses cases in mind: user currently asks for stats on fields, cares about all 8 of the stats user currently asks for stats on fields, only cares about 4of8 of them the attached script shows my methodology – it generates a CSV file with 10 million docs + 2 bash files that use curl to hit Solr with 300 : query urls using randomly selected stats.field. the sequence of stat field requests are identicle between the 2 bash files, but in one URLs include localparams to only compute min/max/mean/stddev for the field. Here's the results... NOW BASELINE: 126.008 seconds (ie: all stats ... queries-old.sh) PATCH ALL STATS: 133.571 seconds (6% slower ... queries-old.sh) PATCH FOUR STATS: 130.515 seconds (3% slower ... queries-new.sh) So not only has asking for all stats on a field gotten slower with this patch, but even asking for only 4 of the 8 possible numeric stats on a field is still slower then the existing code when all of them are returned. A key thing to note here is that this is the total wall clock time from the perspective of the client, including reading the response from Solr. Not only are we (in theory) computing only only 1/2 as much math per request in the "FOUR STATS" situation, the XML response size of each query is only ~3/4ths the size of the original queryies. This should mean a lot less time both in processing the results and in writing/reading the data over the wire ... and yet instead of seeing some perf improvements, we see performance suffer. I suspect a key factor here goes back to one of the concerns i mentioned earlier... if (statsField.calculateStat(X)) { X = calculateX() } ...pattern you mentioned in so much code - that's one of the reasons i abandomed my last patch (and before i abandoned it, i was focusingon trying to ensure that it was at least always a comarison with a final boolean in the hops that the JVM could optimize the if away) ...the cumulative overhead of those method calls for every possible stat is probably counter acting any gains made by reducing the stats that are computed. My next step is to focus on fixing the current patch code so the few remaining nocommit assertions in the test start passing (see earlier comments re "min='false'") – but once the behavior is locked down and solid i think we really need to re-assess and re-factor the code to see some perf gains before there's any point in moving towards adding this feature. (NOTE: if anyone spots any flaws in my little mini-benchmark, please speak up – i would be very happy to be wrong)
          Hide
          Hoss Man added a comment -

          Two steps forward, one step back: I fixed the bugs i mentioned before dealing with localparam stats that are all 'false', and cleaned up a bit of the stats returned/calculated logic – this lead me to understanding that the "whitebox" isShard test i was confused by before was definitely broken (see below) ... i tried to use SOLR-7147 to do a true whitebox test and inspect real shard requests/responses, but got blocked by some bugs in the test framework

          Full notes about changes in this patch...

          • StatsField
            • add Set<Stat> DEFAULT_STATS
            • populateStatsRequested(...)
              • rename to populateStatsSets() and eliminate args – method is already mucking with other state directly
              • use DEFAULT_STATS if and only if there are no stat localparams (even if false)
            • calculateStats()
              • fix to not assume anything about the empty set - empty set means calculate nothing
          • StatsComponentTest
            • testIndividualStatLocalParams
              • my review & bug fixing in StatsField confirmed that the whitebox testing here wasn bogus - the types of "isShard=true" requests simulated here don't actaully look like what a real shard request looks like. This test was asking for each of the individual stats that appear as in getDistribDeps() of any stat in an isShard request and confirmed that only that single stat was returned – but the actual code doesn't ask for individual distribDeps that way. it asks for the oroginally requested stat and trusts that the isShard indication will cause only the deps to be returned
              • fixed the whitebox queries to simulate the real requests for every stats, and assert all deps come back - even if the original request excluded those deps
          • TestDistributedSearch
            • beefed up testing of distrib requests where the deps of a stat requested by the client are also explicitly excluded by the client (ie: "mean=true sum=false count=false"
            • started down the path of leveraging SOLR-7147 to inspect the shard req/rsp to ensure we only get what we need - but got blocked by SOLR-7171 (see nocommits)
          Show
          Hoss Man added a comment - Two steps forward, one step back: I fixed the bugs i mentioned before dealing with localparam stats that are all 'false', and cleaned up a bit of the stats returned/calculated logic – this lead me to understanding that the "whitebox" isShard test i was confused by before was definitely broken (see below) ... i tried to use SOLR-7147 to do a true whitebox test and inspect real shard requests/responses, but got blocked by some bugs in the test framework Full notes about changes in this patch... StatsField add Set<Stat> DEFAULT_STATS populateStatsRequested(...) rename to populateStatsSets() and eliminate args – method is already mucking with other state directly use DEFAULT_STATS if and only if there are no stat localparams (even if false) calculateStats() fix to not assume anything about the empty set - empty set means calculate nothing StatsComponentTest testIndividualStatLocalParams my review & bug fixing in StatsField confirmed that the whitebox testing here wasn bogus - the types of "isShard=true" requests simulated here don't actaully look like what a real shard request looks like. This test was asking for each of the individual stats that appear as in getDistribDeps() of any stat in an isShard request and confirmed that only that single stat was returned – but the actual code doesn't ask for individual distribDeps that way. it asks for the oroginally requested stat and trusts that the isShard indication will cause only the deps to be returned fixed the whitebox queries to simulate the real requests for every stats, and assert all deps come back - even if the original request excluded those deps TestDistributedSearch beefed up testing of distrib requests where the deps of a stat requested by the client are also explicitly excluded by the client (ie: "mean=true sum=false count=false" started down the path of leveraging SOLR-7147 to inspect the shard req/rsp to ensure we only get what we need - but got blocked by SOLR-7171 (see nocommits)
          Hide
          Hoss Man added a comment -

          Well, after wrapping up SOLR-7171, i came back to this patch and started making the following improvements...

          changes in this patch
          • TestDistributedSearch
            • beefed up inspection of shard responses now that SOLR-7171 is fixed
          • StatsValuesFactory
            • introduce new final booleans to track the stats we want to encourage JIT optimization
            • fix some inconsistencies in when/if various stats are returned to clients depending on values seen (notably with Dates before the epoch)
          • StatsComponentTest
            • testFieldStatisticsResultsDateFieldAlwaysMissing
              • the set of stats that we've returned for Dates in the past has been inconsistent depending on if there were "non-missing" docs ... updated this test to expect everything

          ...after making these changes, i went to re-run that mini-benchmark i wrote before, and in looking at it realized i made a serious conceptual mistake...

          • user currently asks for stats on fields, only cares about 4of8 of them
            ...
            ...the sequence of stat field requests are identicle between the 2 bash files, but in one URLs include localparams to only compute min/max/mean/stddev for the field.

          ...the problem being that because these are distributed requests, the new style requests still require that sum, count & sumOfSquares be computed on every shard (in addition to the min & max) ... the final responses to the query client are smaller, but we're still computing virtually the same amount of math calculations and each shard is returning virtually the same amount of data. (the only calculation "skipped" is the "missing" stat which is actually irrelevant in this test because every doc has a value in every field)

          So i did a quick tweak to make-data-and-queries.pl so that the only stats requested are "min" & "mean" – so the shards only have to compute min, sum, count. With that change, and the new patch, the numbers look much better...

                  pre-patch        with patch       with patch 
          Run #   old style        old style        new style
                  (all stats)      (all stats)      (2 stats, 3 deps)
            1      135.7 sec        134.4 sec        109.9 sec
            2      130.1 sec        132.4 sec        109.0 sec
            3      132.3 sec        132.8 sec        108.2 sec
          total    398.1 sec        399.6 sec        327.1 sec
                                  00.3 % slower    17.8 % faster
          

          ...so these numbers are a lot more promising.

          this also makes me want to run some more perf tests, on more permutations of stats - in particular i want to check the non-cloud mode, make sure we haven't slowed that down.

          Show
          Hoss Man added a comment - Well, after wrapping up SOLR-7171 , i came back to this patch and started making the following improvements... changes in this patch TestDistributedSearch beefed up inspection of shard responses now that SOLR-7171 is fixed StatsValuesFactory introduce new final booleans to track the stats we want to encourage JIT optimization fix some inconsistencies in when/if various stats are returned to clients depending on values seen (notably with Dates before the epoch) StatsComponentTest testFieldStatisticsResultsDateFieldAlwaysMissing the set of stats that we've returned for Dates in the past has been inconsistent depending on if there were "non-missing" docs ... updated this test to expect everything ...after making these changes, i went to re-run that mini-benchmark i wrote before, and in looking at it realized i made a serious conceptual mistake... user currently asks for stats on fields, only cares about 4of8 of them ... ...the sequence of stat field requests are identicle between the 2 bash files, but in one URLs include localparams to only compute min/max/mean/stddev for the field. ...the problem being that because these are distributed requests, the new style requests still require that sum, count & sumOfSquares be computed on every shard (in addition to the min & max) ... the final responses to the query client are smaller, but we're still computing virtually the same amount of math calculations and each shard is returning virtually the same amount of data. (the only calculation "skipped" is the "missing" stat which is actually irrelevant in this test because every doc has a value in every field) So i did a quick tweak to make-data-and-queries.pl so that the only stats requested are "min" & "mean" – so the shards only have to compute min, sum, count. With that change, and the new patch, the numbers look much better... pre-patch with patch with patch Run # old style old style new style (all stats) (all stats) (2 stats, 3 deps) 1 135.7 sec 134.4 sec 109.9 sec 2 130.1 sec 132.4 sec 109.0 sec 3 132.3 sec 132.8 sec 108.2 sec total 398.1 sec 399.6 sec 327.1 sec 00.3 % slower 17.8 % faster ...so these numbers are a lot more promising. this also makes me want to run some more perf tests, on more permutations of stats - in particular i want to check the non-cloud mode, make sure we haven't slowed that down.
          Hide
          Hoss Man added a comment -

          Here's an updated version of make-data-and-queries.pl that:

          • introduces some missing values into each field
          • only hits port 8983 so singled node & cloud can be tested
          • tests more diff types of individual stats and pairs of stats

          My results are below, these numbers make me feel pretty good about the state of the code (although i'm perplexed as to why "min" is consistently slower then other stats that are more numerically complicated like mean & sum).

          ## single node (seconds)
                  trunk    patch...
          Run      all      all      mean   mean_stdv  min      sum    sum_miss
            1     168.22   165.99   112.89   114.39   137.12   107.83   108.01
            2     167.63   166.31   113.28   114.29   136.58   107.72   108.63
            3     168.83   167.45   112.73   114.50   132.84   109.06   108.91
          total   504.68   499.75   338.90   343.18   406.54   324.61   325.55
          
          
          ## two node cloud (seconds)
                  trunk    patch...
          Run      all      all      mean   mean_stdv  min      sum    sum_miss
            1     115.66   115.32    70.27    73.57    90.72    68.06    68.97
            2     111.68   111.16    70.27    71.20    89.62    68.82    67.53
            3     112.16   112.85    70.38    70.47    91.48    68.39    70.00
          total   339.50   339.33   210.92   215.24   271.82   205.27   206.50
          

          ...my next steps:

          • another pass of code review (i kind of glossed over it when doing the refactoring
          • more tests of other field types
            • current patch has detailed tests of individual stats on nmeric fields, but i remember thinking the string,enum, and date code looked brittle and i supsect they have bugs and NPEs when computing individual stats
          • javadocs
          • poke around "min" (and "max") and see if i can figure out why they are slow (probably tangential to this issue, may punt)
          Show
          Hoss Man added a comment - Here's an updated version of make-data-and-queries.pl that: introduces some missing values into each field only hits port 8983 so singled node & cloud can be tested tests more diff types of individual stats and pairs of stats My results are below, these numbers make me feel pretty good about the state of the code (although i'm perplexed as to why "min" is consistently slower then other stats that are more numerically complicated like mean & sum). ## single node (seconds) trunk patch... Run all all mean mean_stdv min sum sum_miss 1 168.22 165.99 112.89 114.39 137.12 107.83 108.01 2 167.63 166.31 113.28 114.29 136.58 107.72 108.63 3 168.83 167.45 112.73 114.50 132.84 109.06 108.91 total 504.68 499.75 338.90 343.18 406.54 324.61 325.55 ## two node cloud (seconds) trunk patch... Run all all mean mean_stdv min sum sum_miss 1 115.66 115.32 70.27 73.57 90.72 68.06 68.97 2 111.68 111.16 70.27 71.20 89.62 68.82 67.53 3 112.16 112.85 70.38 70.47 91.48 68.39 70.00 total 339.50 339.33 210.92 215.24 271.82 205.27 206.50 ...my next steps: another pass of code review (i kind of glossed over it when doing the refactoring more tests of other field types current patch has detailed tests of individual stats on nmeric fields, but i remember thinking the string,enum, and date code looked brittle and i supsect they have bugs and NPEs when computing individual stats javadocs poke around "min" (and "max") and see if i can figure out why they are slow (probably tangential to this issue, may punt)
          Hide
          Hoss Man added a comment -

          but i remember thinking the string,enum, and date code looked brittle ...

          ...looks i already fixed most of what i remember being worried about in my last patch? reviewing this did however job my memory about SOLR-6682 - so i've folded the patch from that issue into here

          Updates since last patch...

          • Incorporated SOLR-6682 in here to help ensure good test coverage of enum stats
          • TestDistributedSearch
            • beef up testing of stat subset permutations on other field types
            • add some trivial sanity check asserts to the test queries added by SOLR-6682
          • StatsComponentTest
            • removed nocommit reminders to sanity check stat values when no values available (haven't seen any complaints)
            • cleaned up some bogus assert msgs from the SOLR-6682 patch
          • StatsValuesFactory
            • EnumStatsValues.updateMinMax
              • my previous code had a bug here using computeMin to update the max stat
          • StatsField
            • clean up some lingering nocommits - mostly related to javadocs
          • add SHA1, LICENSE, and NOTICE for new commons-math3 dependency

          ...i think this is good to go .. going to do another total review of things and aim to commit tomorow.

          Show
          Hoss Man added a comment - but i remember thinking the string,enum, and date code looked brittle ... ...looks i already fixed most of what i remember being worried about in my last patch? reviewing this did however job my memory about SOLR-6682 - so i've folded the patch from that issue into here Updates since last patch... Incorporated SOLR-6682 in here to help ensure good test coverage of enum stats TestDistributedSearch beef up testing of stat subset permutations on other field types add some trivial sanity check asserts to the test queries added by SOLR-6682 StatsComponentTest removed nocommit reminders to sanity check stat values when no values available (haven't seen any complaints) cleaned up some bogus assert msgs from the SOLR-6682 patch StatsValuesFactory EnumStatsValues.updateMinMax my previous code had a bug here using computeMin to update the max stat StatsField clean up some lingering nocommits - mostly related to javadocs add SHA1, LICENSE, and NOTICE for new commons-math3 dependency ...i think this is good to go .. going to do another total review of things and aim to commit tomorow.
          Hide
          ASF subversion and git services added a comment -

          Commit 1665579 from hossman@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1665579 ]

          SOLR-6349 + SOLR-6682: Added support for stats.field localparams to enable/disable individual stats; Fix response when using EnumField with StatsComponent

          Show
          ASF subversion and git services added a comment - Commit 1665579 from hossman@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1665579 ] SOLR-6349 + SOLR-6682 : Added support for stats.field localparams to enable/disable individual stats; Fix response when using EnumField with StatsComponent
          Hide
          ASF subversion and git services added a comment -

          Commit 1665635 from hossman@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1665635 ]

          SOLR-6349 + SOLR-6682: test workaround since (deprecated) stats.facet doesn't garuntee order of list

          Show
          ASF subversion and git services added a comment - Commit 1665635 from hossman@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1665635 ] SOLR-6349 + SOLR-6682 : test workaround since (deprecated) stats.facet doesn't garuntee order of list
          Hide
          ASF subversion and git services added a comment -

          Commit 1665639 from hossman@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1665639 ]

          SOLR-6349 + SOLR-6682: Added support for stats.field localparams to enable/disable individual stats; Fix response when using EnumField with StatsComponent (merge r1665579, r1665635)

          Show
          ASF subversion and git services added a comment - Commit 1665639 from hossman@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1665639 ] SOLR-6349 + SOLR-6682 : Added support for stats.field localparams to enable/disable individual stats; Fix response when using EnumField with StatsComponent (merge r1665579, r1665635)
          Hide
          Hoss Man added a comment -

          Thanks Tomas & Xu

          Show
          Hoss Man added a comment - Thanks Tomas & Xu
          Hide
          ASF subversion and git services added a comment -

          Commit 1665730 from hossman@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1665730 ]

          SOLR-6349: real fix for out of order stats facet's

          Show
          ASF subversion and git services added a comment - Commit 1665730 from hossman@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1665730 ] SOLR-6349 : real fix for out of order stats facet's
          Hide
          ASF subversion and git services added a comment -

          Commit 1665733 from hossman@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1665733 ]

          SOLR-6349: real fix for out of order stats facet's (merge r1665730)

          Show
          ASF subversion and git services added a comment - Commit 1665733 from hossman@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1665733 ] SOLR-6349 : real fix for out of order stats facet's (merge r1665730)
          Hide
          Timothy Potter added a comment -

          Bulk close after 5.1 release

          Show
          Timothy Potter added a comment - Bulk close after 5.1 release

            People

            • Assignee:
              Hoss Man
              Reporter:
              Hoss Man
            • Votes:
              2 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development