Solr
  1. Solr
  2. SOLR-5428

new statistics results to StatsComponent - distinctValues and countDistinct

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.7, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      I thought it would be very useful to display the distinct values (and the count) of a field among other statistics. Attached a patch implementing this in StatsComponent.

      Added results :
      "distinctValues" - list of all distnict values
      "countDistinct" - distnict values count.

      1. SOLR-5428.patch
        28 kB
        Elran Dvir
      2. SOLR-5428.patch
        11 kB
        Elran Dvir

        Issue Links

          Activity

          Hide
          Yago Riveiro added a comment -

          This patch works in distribute queries?

          Show
          Yago Riveiro added a comment - This patch works in distribute queries?
          Hide
          Elran Dvir added a comment -

          Yes, as far as I know.
          If there aren't any bugs in distributed queries in StatsComponent, this feature works as others.

          Show
          Elran Dvir added a comment - Yes, as far as I know. If there aren't any bugs in distributed queries in StatsComponent, this feature works as others.
          Hide
          Yago Riveiro added a comment -

          This tiny patch is very very useful.

          One question, in the case of the Stats component, Is all work done on the heap or leverages the benefits of docValues?

          Show
          Yago Riveiro added a comment - This tiny patch is very very useful. One question, in the case of the Stats component, Is all work done on the heap or leverages the benefits of docValues?
          Hide
          Elran Dvir added a comment -

          I am not sure my understanding of Solr is good enough to answer this question.
          you can look at the code to determine the answer.
          it will very much appreciated if you update me what you found out.

          Thanks.

          Show
          Elran Dvir added a comment - I am not sure my understanding of Solr is good enough to answer this question. you can look at the code to determine the answer. it will very much appreciated if you update me what you found out. Thanks.
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks for the patch Elran. Collecting the 'distinctValues' is a very expensive operation. There should be a way to stop the collection of these two statistics.

          Have you seen the LukeRequestHandler? Using the fl and maxTerms params I think you can get the same information.
          http://wiki.apache.org/solr/LukeRequestHandler

          Show
          Shalin Shekhar Mangar added a comment - Thanks for the patch Elran. Collecting the 'distinctValues' is a very expensive operation. There should be a way to stop the collection of these two statistics. Have you seen the LukeRequestHandler? Using the fl and maxTerms params I think you can get the same information. http://wiki.apache.org/solr/LukeRequestHandler
          Hide
          Elran Dvir added a comment -

          Thanks, Shalin.
          My use case requires 'distinctValues' alongside the other results, so I am afraid using LukeRequestHandler is not suitable.
          In what way is it expensive? Is tere a way to improve it?
          What do you mean when you say "There should be a way to stop the collection "?

          Thanks.

          Show
          Elran Dvir added a comment - Thanks, Shalin. My use case requires 'distinctValues' alongside the other results, so I am afraid using LukeRequestHandler is not suitable. In what way is it expensive? Is tere a way to improve it? What do you mean when you say "There should be a way to stop the collection "? Thanks.
          Hide
          Yago Riveiro added a comment -

          Collect the distinctValues can be expensive but in my case is a requirement that Solr can't give me in a easy way. I need to do a facet query limit -1 to get all uniq terms that match the query.

          If the StatsComponent can do the same thing, expensive or not, I vote to have the feature. The way how use it and the pros and cons of use it must be a decision made by the user.

          Show
          Yago Riveiro added a comment - Collect the distinctValues can be expensive but in my case is a requirement that Solr can't give me in a easy way. I need to do a facet query limit -1 to get all uniq terms that match the query. If the StatsComponent can do the same thing, expensive or not, I vote to have the feature. The way how use it and the pros and cons of use it must be a decision made by the user.
          Hide
          Elran Dvir added a comment -

          Anthother thing I thout about:
          my queries have q,fq and are distributed. Does LukeRequestHandler support it?

          Show
          Elran Dvir added a comment - Anthother thing I thout about: my queries have q,fq and are distributed. Does LukeRequestHandler support it?
          Hide
          Shalin Shekhar Mangar added a comment -

          You're both right. We can't replace this functionality with LukeRequestHandler. At the same time, forcing everyone to keep a set of distinct values in memory, when someone just needs min, max or count is bad.

          Show
          Shalin Shekhar Mangar added a comment - You're both right. We can't replace this functionality with LukeRequestHandler. At the same time, forcing everyone to keep a set of distinct values in memory, when someone just needs min, max or count is bad.
          Hide
          Yago Riveiro added a comment -

          Ok, I forgot that the StatsComponent return all metrics in one call.

          Maybe the StatsCompement needs some tweaking to only return the metrics that we need and not all. If the analytics component could working with distributed searchs this patch would not necessary.

          Show
          Yago Riveiro added a comment - Ok, I forgot that the StatsComponent return all metrics in one call. Maybe the StatsCompement needs some tweaking to only return the metrics that we need and not all. If the analytics component could working with distributed searchs this patch would not necessary.
          Hide
          Shalin Shekhar Mangar added a comment -

          Maybe the StatsCompement needs some tweaking to only return the metrics that we need and not all.

          Yes, if we are going to add such memory intensive stats then we need to do that. We can't let everyone pay the penalty of this feature.

          If the analytics component could working with distributed searchs this patch would not necessary.

          I think that is bound to happen sooner or later. Anyways, I'm happy to review and commit this if someone is willing to put up the patches.

          Show
          Shalin Shekhar Mangar added a comment - Maybe the StatsCompement needs some tweaking to only return the metrics that we need and not all. Yes, if we are going to add such memory intensive stats then we need to do that. We can't let everyone pay the penalty of this feature. If the analytics component could working with distributed searchs this patch would not necessary. I think that is bound to happen sooner or later. Anyways, I'm happy to review and commit this if someone is willing to put up the patches.
          Hide
          Elran Dvir added a comment -

          I attached a new patch. Now stats component contains new parameter "calcdistinict".
          By default it's false. It can be set per field (f.field.stats.calcdistinct)
          Let me know what you think.

          Show
          Elran Dvir added a comment - I attached a new patch. Now stats component contains new parameter "calcdistinict". By default it's false. It can be set per field (f.field.stats.calcdistinct) Let me know what you think.
          Hide
          Otis Gospodnetic added a comment -

          Isn't SOLR-5302 a replacement for StatsComponent? If so, shouldn't this patch be implemented on top of it instead of on top of StatsComponent?

          Show
          Otis Gospodnetic added a comment - Isn't SOLR-5302 a replacement for StatsComponent? If so, shouldn't this patch be implemented on top of it instead of on top of StatsComponent?
          Hide
          Yago Riveiro added a comment -

          I think that analytics component doesn't support distributed queries.

          Show
          Yago Riveiro added a comment - I think that analytics component doesn't support distributed queries.
          Hide
          Otis Gospodnetic added a comment -

          Yago - I'm guessing only because of percentiles and such. See my comment about that in SOLR-5302 - I think it's doable.

          Show
          Otis Gospodnetic added a comment - Yago - I'm guessing only because of percentiles and such. See my comment about that in SOLR-5302 - I think it's doable.
          Hide
          Yago Riveiro added a comment -

          For me the utility of this patch is about the possibility to get distinctValues and countDistinct in a distribute environment. If it's possible implement this patch on top of AnalyticComponent I think that should be done, by the simple fact that, eventually, the StatsComponent will be deprecated.

          The question is that SOLR-5302 will not be released soon, maybe in Solr 5.0, and in some way this patch is straightforward enough that can be released in Solr 4.7 with some tweaks.

          Show
          Yago Riveiro added a comment - For me the utility of this patch is about the possibility to get distinctValues and countDistinct in a distribute environment. If it's possible implement this patch on top of AnalyticComponent I think that should be done, by the simple fact that, eventually, the StatsComponent will be deprecated. The question is that SOLR-5302 will not be released soon, maybe in Solr 5.0, and in some way this patch is straightforward enough that can be released in Solr 4.7 with some tweaks.
          Hide
          Shalin Shekhar Mangar added a comment -

          Isn't SOLR-5302 a replacement for StatsComponent? If so, shouldn't this patch be implemented on top of it instead of on top of StatsComponent?

          We have a working patch here. I'd rather commit it and worry about AnalyticsComponent later.

          Show
          Shalin Shekhar Mangar added a comment - Isn't SOLR-5302 a replacement for StatsComponent? If so, shouldn't this patch be implemented on top of it instead of on top of StatsComponent? We have a working patch here. I'd rather commit it and worry about AnalyticsComponent later.
          Hide
          David Smiley added a comment -

          We have a working patch here. I'd rather commit it and worry about AnalyticsComponent later.

          +1

          Show
          David Smiley added a comment - We have a working patch here. I'd rather commit it and worry about AnalyticsComponent later. +1
          Hide
          ASF subversion and git services added a comment -

          Commit 1544043 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1544043 ]

          SOLR-5428: New 'stats.calcdistinct' parameter in StatsComponent returns set of distinct values and their count. This can also be specified per field e.g. 'f.field.stats.calcdistinct'

          Show
          ASF subversion and git services added a comment - Commit 1544043 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1544043 ] SOLR-5428 : New 'stats.calcdistinct' parameter in StatsComponent returns set of distinct values and their count. This can also be specified per field e.g. 'f.field.stats.calcdistinct'
          Hide
          ASF subversion and git services added a comment -

          Commit 1544044 from shalin@apache.org in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1544044 ]

          SOLR-5428: New 'stats.calcdistinct' parameter in StatsComponent returns set of distinct values and their count. This can also be specified per field e.g. 'f.field.stats.calcdistinct'

          Show
          ASF subversion and git services added a comment - Commit 1544044 from shalin@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1544044 ] SOLR-5428 : New 'stats.calcdistinct' parameter in StatsComponent returns set of distinct values and their count. This can also be specified per field e.g. 'f.field.stats.calcdistinct'
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks Elran!

          Show
          Shalin Shekhar Mangar added a comment - Thanks Elran!
          Hide
          Elran Dvir added a comment -

          Thank you, Shalin!

          Show
          Elran Dvir added a comment - Thank you, Shalin!
          Hide
          Steven Bower added a comment -

          does this work on multi-value fields?

          Show
          Steven Bower added a comment - does this work on multi-value fields?
          Hide
          Elran Dvir added a comment -

          I think it should work on multi-value fields. It works as any other stats functions.

          Show
          Elran Dvir added a comment - I think it should work on multi-value fields. It works as any other stats functions.

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Elran Dvir
            • Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development