Solr
  1. Solr
  2. SOLR-2242

Get distinct count of names for a facet field

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.9, 5.0
    • Component/s: Response Writers
    • Labels:
      None

      Description

      When returning facet.field=<name of field> you will get a list of matches for distinct values. This is normal behavior. This patch tells you how many distinct values you have (# of rows). Use with limit=-1 and mincount=1.

      The feature is called "namedistinct". Here is an example:

      Parameters:
      facet.numTerms or f.<field>.facet.numTerms = true (default is false) - turn on distinct counting of terms

      facet.field - the field to count the terms
      It creates a new section in the facet section...

      http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=*:*&facet=true&facet.mincount=1&facet.numTerms=true&facet.limit=-1&facet.field=price

      http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=*:*&facet=true&facet.mincount=1&facet.numTerms=false&facet.limit=-1&facet.field=price

      http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=*:*&facet=true&facet.mincount=1&facet.numTerms=true&facet.limit=-1&facet.field=price

      This currently only works on facet.field.

      
      <lst name="facet_counts">
      <lst name="facet_queries"/>
      <lst name="facet_fields">...</lst>
      <lst name="facet_numTerms">
      <lst name="localhost:8983/solr/">
      <int name="price">14</int>
      </lst>
      <lst name="localhost:8080/solr/">
      <int name="price">14</int>
      </lst>
      </lst>
      <lst name="facet_dates"/>
      <lst name="facet_ranges"/>
      </lst>
      
      OR with no sharding-
      
      <lst name="facet_numTerms">
      <int name="price">14</int>
      </lst>
      
      

      Several people use this to get the group.field count (the # of groups).

      1. SOLR.2242.solr3.1.patch
        4 kB
        Dmitry Drozdov
      2. SOLR-2242.patch
        4 kB
        James Dyer
      3. SOLR-2242.solr3.1.patch
        4 kB
        Lance Norskog
      4. SOLR-2242.shard.withtests.patch
        13 kB
        Bill Bell
      5. SOLR-2242.patch
        15 kB
        Simon Willnauer
      6. SOLR-2242.solr3.1-fix.patch
        4 kB
        Nguyen Kien Trung
      7. SOLR-2242.patch
        14 kB
        Erick Erickson
      8. SOLR-2242-solr40-3.patch
        6 kB
        Bill Bell
      9. SOLR-2242-3x.patch
        6 kB
        Erick Erickson
      10. SOLR-2242-3x_5_tests.patch
        18 kB
        Bill Bell

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Move issue to Solr 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Solr 4.9.
          Hide
          Brett Hoerner added a comment -

          There is no public patch that I know of that does the HyperLogLog stuff. Terrance A. Snyder mentioned it in his comment above, but that's it.

          I haven't started any work here yet but I hoped to in the future.

          Show
          Brett Hoerner added a comment - There is no public patch that I know of that does the HyperLogLog stuff. Terrance A. Snyder mentioned it in his comment above, but that's it. I haven't started any work here yet but I hoped to in the future.
          Hide
          Shalin Shekhar Mangar added a comment -

          It sounds like stats.calcDistinct=true does the "correct, but slow" thing?

          Yes, that is why I did not close the ticket.

          this ticket took a turn towards approximate counts using probabilistic data structures (specifically HyperLogLog). That's to support fast approximate unique counts in systems like SolrCloud where each shard could have hundreds of millions of unique values.

          Do you know what is the state of this patch? Are people using the hyperloglog implementation in production? Apart from a committer's attention, what does this issue need?

          Show
          Shalin Shekhar Mangar added a comment - It sounds like stats.calcDistinct=true does the "correct, but slow" thing? Yes, that is why I did not close the ticket. this ticket took a turn towards approximate counts using probabilistic data structures (specifically HyperLogLog). That's to support fast approximate unique counts in systems like SolrCloud where each shard could have hundreds of millions of unique values. Do you know what is the state of this patch? Are people using the hyperloglog implementation in production? Apart from a committer's attention, what does this issue need?
          Hide
          Brett Hoerner added a comment -

          Shalin Shekhar Mangar, this ticket took a turn towards approximate counts using probabilistic data structures (specifically HyperLogLog). That's to support fast approximate unique counts in systems like SolrCloud where each shard could have hundreds of millions of unique values. It sounds like stats.calcDistinct=true does the "correct, but slow" thing?

          Show
          Brett Hoerner added a comment - Shalin Shekhar Mangar , this ticket took a turn towards approximate counts using probabilistic data structures (specifically HyperLogLog). That's to support fast approximate unique counts in systems like SolrCloud where each shard could have hundreds of millions of unique values. It sounds like stats.calcDistinct=true does the "correct, but slow" thing?
          Hide
          Jonathan Rochkind added a comment -

          I am out of the office on vacation until Wednesday February 26, 2014. I will not be checking email.

          For urgent Systems Department business, please contact Mercy Anaba, manaba@jhu.edu, (410) 516-5306.

          Show
          Jonathan Rochkind added a comment - I am out of the office on vacation until Wednesday February 26, 2014. I will not be checking email. For urgent Systems Department business, please contact Mercy Anaba, manaba@jhu.edu, (410) 516-5306.
          Hide
          Shalin Shekhar Mangar added a comment -

          I think this is possible now with SOLR-5428 - StatsComponent can count distinct values of a field with stats.calcDistinct=true parameter.

          Show
          Shalin Shekhar Mangar added a comment - I think this is possible now with SOLR-5428 - StatsComponent can count distinct values of a field with stats.calcDistinct=true parameter.
          Hide
          Vassil Velichkov added a comment -

          I really hope that this issue will be resolved in SOLR 4.7...Fingers crossed

          Show
          Vassil Velichkov added a comment - I really hope that this issue will be resolved in SOLR 4.7...Fingers crossed
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Bill Bell added a comment -

          The one use case (2 parts) that I want to make sure we are satisfying is:

          . Ability to get total number of distinct terms in the facet.field.
          For example, if facet.field=gender, I would expect the distinct to be 1 or 2 (Male/Female) depending on filters.
          . For Sharding, Terrance might be the right approach, but is it accurate or an approximation? For small sets sharding will work fine (< 100 results). For example, if you were asking for distinct counts from 2 shards, and the shards were setup for 20 states in one shard, and 30 in the other, I would expect distinct states = 50. Will your solution do that?

          Thanks - so happy this is moving forward. Not sure I understand the syntax from Terrance yet...

          Show
          Bill Bell added a comment - The one use case (2 parts) that I want to make sure we are satisfying is: . Ability to get total number of distinct terms in the facet.field. For example, if facet.field=gender, I would expect the distinct to be 1 or 2 (Male/Female) depending on filters. . For Sharding, Terrance might be the right approach, but is it accurate or an approximation? For small sets sharding will work fine (< 100 results). For example, if you were asking for distinct counts from 2 shards, and the shards were setup for 20 states in one shard, and 30 in the other, I would expect distinct states = 50. Will your solution do that? Thanks - so happy this is moving forward. Not sure I understand the syntax from Terrance yet...
          Hide
          Otis Gospodnetic added a comment -

          Terrance A. Snyder - you're quick.
          Answers:

          • stream-lib and license - perfectly fine. At Sematext we use their stuff as well.
          • 82% test case coverage - good!
          • documentation - eventually it should be put on the Solr Wiki, but don't let that stop you!
          • smell - precisely!
          Show
          Otis Gospodnetic added a comment - Terrance A. Snyder - you're quick. Answers: stream-lib and license - perfectly fine. At Sematext we use their stuff as well. 82% test case coverage - good! documentation - eventually it should be put on the Solr Wiki, but don't let that stop you! smell - precisely!
          Hide
          J Mohamed Zahoor added a comment -

          As i mentioned in my earlier comment, We have experimented with stream-lib implementation of HLL for similar purpose... and it is good.
          It makes good sense to have probabilistic data structures for large number of docs.
          BTW, we are using SOLR as a analytics engine with great success.

          Show
          J Mohamed Zahoor added a comment - As i mentioned in my earlier comment, We have experimented with stream-lib implementation of HLL for similar purpose... and it is good. It makes good sense to have probabilistic data structures for large number of docs. BTW, we are using SOLR as a analytics engine with great success.
          Hide
          Terrance A. Snyder added a comment - - edited

          Otis Gospodnetic I got the email - I'll give some background as we've enhanced and combined but I should be able to put together a patch in the following week. There is an old version on github I need to update to trunk I'll spend time doing this, most of this work was enhancing two existing JIRA items which are wonderful.

          Core Work:
          https://issues.apache.org/jira/browse/SOLR-2894
          https://issues.apache.org/jira/browse/SOLR-3583

          Newer features:

          + Some of the issues that have been discussed around distributed counting has already been done in larger installations (counting billions of items). I work in the advertising space and counting/slicing dicing things and sending between shards 90+ billion documents on highly unique facet counts such as session id, or cookie ID is hugely wasteful and doesn't scale.

          + The Ad industry is great at counting stuff "at scale" - sessions, web events, etc. We take the stance that counting stuff can be "roughly" right when we get to billions + or - 0-1.5% error rate is OK when the response goes from minutes to milliseconds. As such, optional parameters for "estimated count" is added which will leverage a HyperLogLog implementation to do a 98.5% correct response. By default this is turned on for us - on a large installation (multiple billions of POS transactions)

          HyperLogLog

          http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html
          http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
          http://metamarkets.com/2012/fast-cheap-and-98-right-cardinality-estimation-for-big-data/

          Syntax

          ../select/?q=:&facet=true&facet.pivot=impression_date,state_name,consumer_id&facet.distinct.estimate=true

          Again - estimate is used as there might be 300 million uniques here (consumer_id) and sharding to 10 servers results in a huge waste of time - when we're that high of a number 98% right is good enough. As we narrow down we can throttle to 100% correct by doing "facet.distinct.estimate=false".

          We still use "pivot" to drive this as the patch for distributed works - we simply send around HyperLogLog instances in serialized form and they can be unioned and intersected appropriately.

          Questions as I'd like to actually do this right

          + Rather than re-invent the wheel I use stream-lib (https://github.com/clearspring/stream-lib). It is apache licensed and includes HyperLogLog, HyperLogLogPlus, BloomFilters, TopK, QDigest, etc. Is this an issue?

          + Test cases - I've got 82% code coverage - is this good enough?

          + Documentation - I've got markdown documents that cover the commands and syntax - is this the right format?

          + SOLR-2894, SOLR-3583 - It makes logical sense that these start to be joined together. When using all these I sometimes start smelling solr as an analytic engine (and it's a very nice one when combining probabilistic data structures).

          If someone can answer the above questions while I sync to /trunk please let me know.

          Old Version for posterity until I get around to updating to latest trunk and including the HyperLogLog implementation - doesn't include HyperLogLog sketching - minor updates.
          https://github.com/terrancesnyder/solr-analytics/blob/master/solr/core/src/java/org/apache/solr/handler/component/PivotFacetHelper.java

          Show
          Terrance A. Snyder added a comment - - edited Otis Gospodnetic I got the email - I'll give some background as we've enhanced and combined but I should be able to put together a patch in the following week. There is an old version on github I need to update to trunk I'll spend time doing this, most of this work was enhancing two existing JIRA items which are wonderful. Core Work: https://issues.apache.org/jira/browse/SOLR-2894 https://issues.apache.org/jira/browse/SOLR-3583 Newer features: + Some of the issues that have been discussed around distributed counting has already been done in larger installations (counting billions of items). I work in the advertising space and counting/slicing dicing things and sending between shards 90+ billion documents on highly unique facet counts such as session id, or cookie ID is hugely wasteful and doesn't scale. + The Ad industry is great at counting stuff "at scale" - sessions, web events, etc. We take the stance that counting stuff can be "roughly" right when we get to billions + or - 0-1.5% error rate is OK when the response goes from minutes to milliseconds. As such, optional parameters for "estimated count" is added which will leverage a HyperLogLog implementation to do a 98.5% correct response. By default this is turned on for us - on a large installation (multiple billions of POS transactions) HyperLogLog http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/ http://metamarkets.com/2012/fast-cheap-and-98-right-cardinality-estimation-for-big-data/ Syntax ../select/?q= : &facet=true&facet.pivot=impression_date,state_name,consumer_id&facet.distinct.estimate=true Again - estimate is used as there might be 300 million uniques here (consumer_id) and sharding to 10 servers results in a huge waste of time - when we're that high of a number 98% right is good enough. As we narrow down we can throttle to 100% correct by doing "facet.distinct.estimate=false". We still use "pivot" to drive this as the patch for distributed works - we simply send around HyperLogLog instances in serialized form and they can be unioned and intersected appropriately. Questions as I'd like to actually do this right + Rather than re-invent the wheel I use stream-lib ( https://github.com/clearspring/stream-lib ). It is apache licensed and includes HyperLogLog, HyperLogLogPlus, BloomFilters, TopK, QDigest, etc. Is this an issue? + Test cases - I've got 82% code coverage - is this good enough? + Documentation - I've got markdown documents that cover the commands and syntax - is this the right format? + SOLR-2894 , SOLR-3583 - It makes logical sense that these start to be joined together. When using all these I sometimes start smelling solr as an analytic engine (and it's a very nice one when combining probabilistic data structures). If someone can answer the above questions while I sync to /trunk please let me know. Old Version for posterity until I get around to updating to latest trunk and including the HyperLogLog implementation - doesn't include HyperLogLog sketching - minor updates. https://github.com/terrancesnyder/solr-analytics/blob/master/solr/core/src/java/org/apache/solr/handler/component/PivotFacetHelper.java
          Hide
          Otis Gospodnetic added a comment -

          This issue looks very interesting and it looks like it's >2 years old with Bill Bell having moved on, most likely.
          Based on my reading of the last 2 years worth of comments above, Terrance A. Snyder's comment (see https://issues.apache.org/jira/browse/SOLR-2242?focusedCommentId=13275101&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13275101) seemed the most thorough and his idea the most advanced. He posted a patch to S3.... which is no longer there.

          I'll email Terrance now in hopes of getting his patch attached here, but it would be great if somebody with more knowledge of faceting/pivot area of Solr could push this. I saw Yonik Seeley did look at this issue a while back...

          Show
          Otis Gospodnetic added a comment - This issue looks very interesting and it looks like it's >2 years old with Bill Bell having moved on, most likely. Based on my reading of the last 2 years worth of comments above, Terrance A. Snyder 's comment (see https://issues.apache.org/jira/browse/SOLR-2242?focusedCommentId=13275101&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13275101 ) seemed the most thorough and his idea the most advanced. He posted a patch to S3.... which is no longer there. I'll email Terrance now in hopes of getting his patch attached here, but it would be great if somebody with more knowledge of faceting/pivot area of Solr could push this. I saw Yonik Seeley did look at this issue a while back...
          Hide
          Shawn Heisey added a comment -

          Which patch represents the best work? I got SOLR-2242-solr40-3.patch to apply to trunk with a little love, but tests having to do with facets are failing. It is also quite a bit smaller than the newest patch for 3x.

          Show
          Shawn Heisey added a comment - Which patch represents the best work? I got SOLR-2242 -solr40-3.patch to apply to trunk with a little love, but tests having to do with facets are failing. It is also quite a bit smaller than the newest patch for 3x.
          Hide
          Robert Muir added a comment -

          Apparently what makes it tricky to implement is the distributed environment?

          because you have to merge all the values to get the unique count.

          Show
          Robert Muir added a comment - Apparently what makes it tricky to implement is the distributed environment? because you have to merge all the values to get the unique count.
          Hide
          Shawn Heisey added a comment -

          That will indeed help as I find time to look things over. Thanks!

          Show
          Shawn Heisey added a comment - That will indeed help as I find time to look things over. Thanks!
          Hide
          Jonathan Rochkind added a comment -

          Shawn Heisey: Forgive me if I'm misunderstanding what you don't understand, but, here's what this feature does, at the high level:

          You can ask Solr for facet response already. You get, for instance, the first 10 (or first `facet.limit`) facet values, sorted by your chosen sort criteria. You can, already, then choose to page through all the facet values, using facet.offset combined with facet.limit.

          You can page through them, but you don't know how long you'll be paging for – at some point your request with a given facet.offset will just stop returning results because you've exhausted all the facet values available. But you have no way to know when that will be until you get there. There is no way to get the total number of facet results available.

          This feature is meant to add that, a way to get in the response the count of the total number of unique facet values, the ones you'd be paging through with facet.offset.

          Apparently what makes it tricky to implement is the distributed environment?

          Some of the language used in this ticket to refer to the feature is indeed confusing IMO. I hope this helps.

          Show
          Jonathan Rochkind added a comment - Shawn Heisey: Forgive me if I'm misunderstanding what you don't understand, but, here's what this feature does, at the high level: You can ask Solr for facet response already. You get, for instance, the first 10 (or first `facet.limit`) facet values, sorted by your chosen sort criteria. You can, already, then choose to page through all the facet values, using facet.offset combined with facet.limit. You can page through them, but you don't know how long you'll be paging for – at some point your request with a given facet.offset will just stop returning results because you've exhausted all the facet values available. But you have no way to know when that will be until you get there. There is no way to get the total number of facet results available. This feature is meant to add that, a way to get in the response the count of the total number of unique facet values, the ones you'd be paging through with facet.offset. Apparently what makes it tricky to implement is the distributed environment? Some of the language used in this ticket to refer to the feature is indeed confusing IMO. I hope this helps.
          Hide
          J Mohamed Zahoor added a comment -

          One way to achieve this in distributed environment is to have some approximation techniques like HyperLogLog.

          Show
          J Mohamed Zahoor added a comment - One way to achieve this in distributed environment is to have some approximation techniques like HyperLogLog.
          Hide
          Shawn Heisey added a comment -

          Bill Bell Yago Riveiro:

          I am having a hard time understanding what this feature actually DOES, in concrete terms. That's my failing, the info is probably in the description and comments, it's just not sinking in.

          I am willing to pursue this to the best of my ability, but I will admit in advance that my ability may not be quite enough. I'm new to the committer role, which means that I don't work very fast and I'm still learning the ropes. I think I can commit it and backport to 4.x if the following criteria are met:

          *) We can get the feature to apply to trunk and consistently pass tests (not counting what's failing due to other problems).
          *) There are new tests for all new functionality.
          *) We put it up for review by other committers, particularly Robert Muir and Yonik Seeley, and there are no negative votes.

          Show
          Shawn Heisey added a comment - Bill Bell Yago Riveiro : I am having a hard time understanding what this feature actually DOES, in concrete terms. That's my failing, the info is probably in the description and comments, it's just not sinking in. I am willing to pursue this to the best of my ability, but I will admit in advance that my ability may not be quite enough. I'm new to the committer role, which means that I don't work very fast and I'm still learning the ropes. I think I can commit it and backport to 4.x if the following criteria are met: *) We can get the feature to apply to trunk and consistently pass tests (not counting what's failing due to other problems). *) There are new tests for all new functionality. *) We put it up for review by other committers, particularly Robert Muir and Yonik Seeley , and there are no negative votes.
          Hide
          J Mohamed Zahoor added a comment -

          +1 for this feature with sharding support.... its a killer really...

          Show
          J Mohamed Zahoor added a comment - +1 for this feature with sharding support.... its a killer really...
          Hide
          Yago Riveiro added a comment -

          It is unfortunate that this feature is stalled. With sharding this feature is killer, I've been waiting for it since that I know that is in progress.

          Show
          Yago Riveiro added a comment - It is unfortunate that this feature is stalled. With sharding this feature is killer, I've been waiting for it since that I know that is in progress.
          Hide
          Bill Bell added a comment -

          Yeah. This issue has stalled. To get it ready for release we just need to apply the patch and run all unit tests.

          Issues tend to stall when we don't have a commiter leading the work to get done. If someone will step up I will commit to do the work. the last time I made a push for this there was several approaches:

          1. Change the facet formats (Yonik)
          2. Change the parameter names and hide the fact that we are looping through all (limit=-1).
          3. Try to get the sharding working. Although I would contend that we can release without sharding and add it later. Sharding - we can send the unique terms and combine to get exact numbers, or we can separate and send (as it is now). The former is much harder to do and could cause perf issues.

          Thoughts? Maybe at the Lucene conference this can be discussed?

          Show
          Bill Bell added a comment - Yeah. This issue has stalled. To get it ready for release we just need to apply the patch and run all unit tests. Issues tend to stall when we don't have a commiter leading the work to get done. If someone will step up I will commit to do the work. the last time I made a push for this there was several approaches: 1. Change the facet formats (Yonik) 2. Change the parameter names and hide the fact that we are looping through all (limit=-1). 3. Try to get the sharding working. Although I would contend that we can release without sharding and add it later. Sharding - we can send the unique terms and combine to get exact numbers, or we can separate and send (as it is now). The former is much harder to do and could cause perf issues. Thoughts? Maybe at the Lucene conference this can be discussed?
          Hide
          J Mohamed Zahoor added a comment -

          Does the patch provide distinct counts in the case of multiple shards?

          Show
          J Mohamed Zahoor added a comment - Does the patch provide distinct counts in the case of multiple shards?
          Hide
          Amber Duque added a comment -

          I have a question on the SOLR-2242-solr40-3.patch.
          I have applied this patch on top of the Solr 4.0 release (http://svn.apache.org/repos/asf/lucene/dev/tags/ - lucene_solr_4_0_0).
          The patch builds fine, but several solr unit tests fail:

          Tests with failures:

          • org.apache.solr.request.TestFaceting.testFacets
          • org.apache.solr.request.TestFaceting.testRegularBig
          • org.apache.solr.cloud.BasicDistributedZkTest.testDistribSearch
          • org.apache.solr.TestDistributedSearch.testDistribSearch
          • org.apache.solr.TestDistributedGrouping.testDistribSearch
          • org.apache.solr.request.SimpleFacetsTest (suite)
          • org.apache.solr.TestGroupingSearch.testRandomGrouping
          • org.apache.solr.TestGroupingSearch.testGroupingGroupedBasedFaceting
          • org.apache.solr.cloud.BasicDistributedZk2Test.testDistribSearch

          Do the unit tests pass successfully for anyone (for this patch applied on top of the solr 4.0 release)?

          Thanks!

          Show
          Amber Duque added a comment - I have a question on the SOLR-2242 -solr40-3.patch. I have applied this patch on top of the Solr 4.0 release ( http://svn.apache.org/repos/asf/lucene/dev/tags/ - lucene_solr_4_0_0). The patch builds fine, but several solr unit tests fail: Tests with failures: org.apache.solr.request.TestFaceting.testFacets org.apache.solr.request.TestFaceting.testRegularBig org.apache.solr.cloud.BasicDistributedZkTest.testDistribSearch org.apache.solr.TestDistributedSearch.testDistribSearch org.apache.solr.TestDistributedGrouping.testDistribSearch org.apache.solr.request.SimpleFacetsTest (suite) org.apache.solr.TestGroupingSearch.testRandomGrouping org.apache.solr.TestGroupingSearch.testGroupingGroupedBasedFaceting org.apache.solr.cloud.BasicDistributedZk2Test.testDistribSearch Do the unit tests pass successfully for anyone (for this patch applied on top of the solr 4.0 release)? Thanks!
          Hide
          Bill Bell added a comment - - edited

          uygar,

          You are not using it properly. SOLR-2242-3x_5_tests.patch does indeed work.

          http://x.x.x.x:8985/solr/ar1/select?shards=192.168.200.202:8985/solr/ar3/,192.168.200.202:8985/solr/ar4&q=hotels&group=true&group.field=site&facet=true&f.site.facet.numFacetTerms=1&facet.mincount=1&facet.limit=-1

          You forgot the facet.field=site and the field is f.site.facet.numTerms=true

          With sample data. Do the following.

          Copy example to example2, and change jetty.xml on example2 to be port 8080.

          Run this:

          http://localhost:8983/solr/select?shards=localhost:8983/solr/,localhost:8080/solr/&q=*:*&rows=0&facet=true&facet.field=price&facet.numTerms=true&facet.mincount=1&facet.limit=-1

          Show
          Bill Bell added a comment - - edited uygar, You are not using it properly. SOLR-2242 -3x_5_tests.patch does indeed work. http://x.x.x.x:8985/solr/ar1/select?shards=192.168.200.202:8985/solr/ar3/,192.168.200.202:8985/solr/ar4&q=hotels&group=true&group.field=site&facet=true&f.site.facet.numFacetTerms=1&facet.mincount=1&facet.limit=-1 You forgot the facet.field=site and the field is f.site.facet.numTerms=true With sample data. Do the following. Copy example to example2, and change jetty.xml on example2 to be port 8080. Run this: http://localhost:8983/solr/select?shards=localhost:8983/solr/,localhost:8080/solr/&q=*:*&rows=0&facet=true&facet.field=price&facet.numTerms=true&facet.mincount=1&facet.limit=-1
          Hide
          Robert Muir added a comment -

          moving all 4.0 issues not touched in a month to 4.1

          Show
          Robert Muir added a comment - moving all 4.0 issues not touched in a month to 4.1
          Hide
          Robert Muir added a comment -

          rmuir20120906-bulk-40-change

          Show
          Robert Muir added a comment - rmuir20120906-bulk-40-change
          Hide
          Hoss Man added a comment -

          bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment

          Show
          Hoss Man added a comment - bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment
          Hide
          Jason Rutherglen added a comment -

          Terrance, can you post a patch to the Jira? It makes sense to start this Jira off non-distributed, and add a distributed version in another Jira issue...

          Show
          Jason Rutherglen added a comment - Terrance, can you post a patch to the Jira? It makes sense to start this Jira off non-distributed, and add a distributed version in another Jira issue...
          Hide
          Terrance A. Snyder added a comment -

          Hello all, new to this group and contributing. Purhaps this is a bad idea, but could we not just extend facet.pivot to drive distinct value counts of fields? Considering facet.pivot already does distinct values (it dumps everything) we can add an option to facet.pivot to return the distinct count (rather than returning every field and value). Not only does this solve the problem for distinct.facets but opens the door to using the facet.pivot option to drive "distint counts" multiple levels deep (see use-cases). And solves a problem for those people using facet.pivot to get the current distinct counts as they are now, or those people concerned about total network bandwidth/performance of using facet.pivot.

          Attached is a quick patch to facet.pivot to include a new param:

          facet.pivot.distinct=[true|false]

          Turn on/off returning distinct counts when using the facet.pivot parameter. By specifying true the facet.pivot command will return the default format, plus an additional "distinct" field. The last facet specified in "facet.pivot" will never be returned when using facet.pivot.distinct - it will only return the total # of distinct values for that field in the parent node.

          Use-Case:

          I have a catalog of data which contains logs from a server. I want to organize my view into the logs such that I can pivot the logs by date, time, and then transaction number so that I can show a chart of the # of distinct transactions that occur by day and by hour (market analysis).

          I AM NOT interested in the actual "literal" values of the transactions, as this is likely to be a very large set of data and provides no business value. Instead, I am only interested in the distinct count of items.

          It is implied that when I specify my pivot that the last item in my pivot will always be returned as the aggregate distinct count and will not return the actual values.

          Other use-cases:

          As it stands today, the pivot feature currently does do distinct counts, the only drawback to the consuming applications is that the consuming app must "count" the distinct terms.

          If the application doesn't care about the actual values, only the sum of distinct terms than it wastes CPU and network transmitting very large lists of data and iterating through them only to get the total count.

          By allowing a user to specify facet.pivot=field1,field2 along with facet.pivot.distinct=true the user will get all distinct values for all fields. The last field will always return ONLY the distinct count and will not return physical values (thereby saving network / cpu cycles).

          Notes:

          • Debated between calling the "distinct" count either "distinct", or "num_children", etc. YMMV, I called it distinct, but others can call it what they want
          • Not sure of status of distributed facet.pivot? As it stand the current pivot feature does not work distributed.
            In order to make this work 'all the time', large amounts of data would need to be shared accross all shards to determine "distinct" values?
          • YMMV, but some shards are logically partitioned to ensure no overlap, take for example date or transaction #. If I partitioned my shards by date and I asked for a distinct count, I KNOW implicitly that the distinct count is additive (due to partitioning each shard can never share a transaction id) as such a distributed query could "assume" that each shard is an additive distinct and save bandwidth.

          Example 1:

          Pivot: 2 Fields Deep with Distinct [date,transaction #]
          Shows: All Distinct Dates with total # of distinct transactions in that date range
          http://localhost:8080/solr/orders/select?q=*:*&facet=true&facet.pivot=order_date_txt,order_tran_nbr&facet.pivot.distinct=true&rows=0

           
          ...
          ...
          <lst name="facet_pivot">
            <arr name="order_date_txt,order_tran_nbr">
              <lst>
                <str name="field">order_date_txt</str>
                <str name="value">2009-11-01</str>
                <int name="distinct">12566</int>
              </lst>
              <lst>
                <str name="field">order_date_txt</str>
                <str name="value">2009-11-02</str>
                <int name="distinct">14804</int>
              </lst>
              <lst>
                <str name="field">order_date_txt</str>
                <str name="value">2009-11-03</str>
                <int name="distinct">15940</int>
              </lst>
              <lst>
                <str name="field">order_date_txt</str>
                <str name="value">2009-11-04</str>
                <int name="distinct">15540</int>
              </lst>
              <lst>
                <str name="field">order_date_txt</str>
                <str name="value">2009-11-05</str>
                <int name="distinct">15656</int>
              </lst>
              <lst>
                <str name="field">order_date_txt</str>
                <str name="value">2009-11-06</str>
                <int name="distinct">15378</int>
              </lst>
              <lst>
                <str name="field">order_date_txt</str>
                <str name="value">2009-11-07</str>
                <int name="distinct">13551</int>
              </lst>
            </arr>
          </lst>
          ...
          ...
          

          Example 2:

          Pivot: 3 Fields Deep with Distinct [date,city,transactions]
          Shows: All Distinct Dates, All Distinct Citys, total distinct orders in that time within any city named "ANAH*"
          http://localhost:8080/solr/orders/select?q=*:*&facet=true&facet.pivot=order_date_txt,store_city_nm,order_tran_nbr&facet.pivot.distinct=true&rows=0&fq=store_city_nm:ANAH*

          ...
          ...
          <?xml version="1.0"?>
          <lst name="facet_pivot">
            <arr name="order_date_txt,store_city_nm,order_tran_nbr">
              <lst>
                <str name="field">order_date_txt</str>
                <str name="value">2009-11-01</str>
                <int name="distinct">1</int>
                <arr name="pivot">
                  <lst>
                    <str name="field">store_city_nm</str>
                    <str name="value">ANAHEIM</str>
                    <int name="distinct">189</int>
                  </lst>
                </arr>
              </lst>
              <lst>
                <str name="field">order_date_txt</str>
                <str name="value">2009-11-02</str>
                <int name="distinct">1</int>
                <arr name="pivot">
                  <lst>
                    <str name="field">store_city_nm</str>
                    <str name="value">ANAHEIM</str>
                    <int name="distinct">212</int>
                  </lst>
                </arr>
              </lst>
              <lst>
                <str name="field">order_date_txt</str>
                <str name="value">2009-11-03</str>
                <int name="distinct">1</int>
                <arr name="pivot">
                  <lst>
                    <str name="field">store_city_nm</str>
                    <str name="value">ANAHEIM</str>
                    <int name="distinct">203</int>
                  </lst>
                </arr>
              </lst>
              <lst>
                <str name="field">order_date_txt</str>
                <str name="value">2009-11-04</str>
                <int name="distinct">1</int>
                <arr name="pivot">
                  <lst>
                    <str name="field">store_city_nm</str>
                    <str name="value">ANAHEIM</str>
                    <int name="distinct">180</int>
                  </lst>
                </arr>
              </lst>
              <lst>
                <str name="field">order_date_txt</str>
                <str name="value">2009-11-05</str>
                <int name="distinct">1</int>
                <arr name="pivot">
                  <lst>
                    <str name="field">store_city_nm</str>
                    <str name="value">ANAHEIM</str>
                    <int name="distinct">252</int>
                  </lst>
                </arr>
              </lst>
              <lst>
                <str name="field">order_date_txt</str>
                <str name="value">2009-11-06</str>
                <int name="distinct">1</int>
                <arr name="pivot">
                  <lst>
                    <str name="field">store_city_nm</str>
                    <str name="value">ANAHEIM</str>
                    <int name="distinct">199</int>
                  </lst>
                </arr>
              </lst>
              <lst>
                <str name="field">order_date_txt</str>
                <str name="value">2009-11-07</str>
                <int name="distinct">1</int>
                <arr name="pivot">
                  <lst>
                    <str name="field">store_city_nm</str>
                    <str name="value">ANAHEIM</str>
                    <int name="distinct">110</int>
                  </lst>
                </arr>
              </lst>
            </arr>
          </lst>
          ...
          ...
          

          Right now I can't make an attachment, but I posted it to my s3 account.

          https://s3.amazonaws.com/behemoth.io/distinct.pivot.patch

          Show
          Terrance A. Snyder added a comment - Hello all, new to this group and contributing. Purhaps this is a bad idea, but could we not just extend facet.pivot to drive distinct value counts of fields? Considering facet.pivot already does distinct values (it dumps everything) we can add an option to facet.pivot to return the distinct count (rather than returning every field and value). Not only does this solve the problem for distinct.facets but opens the door to using the facet.pivot option to drive "distint counts" multiple levels deep (see use-cases). And solves a problem for those people using facet.pivot to get the current distinct counts as they are now, or those people concerned about total network bandwidth/performance of using facet.pivot. Attached is a quick patch to facet.pivot to include a new param: facet.pivot.distinct= [true|false] Turn on/off returning distinct counts when using the facet.pivot parameter. By specifying true the facet.pivot command will return the default format, plus an additional "distinct" field. The last facet specified in "facet.pivot" will never be returned when using facet.pivot.distinct - it will only return the total # of distinct values for that field in the parent node. Use-Case: I have a catalog of data which contains logs from a server. I want to organize my view into the logs such that I can pivot the logs by date, time, and then transaction number so that I can show a chart of the # of distinct transactions that occur by day and by hour (market analysis). I AM NOT interested in the actual "literal" values of the transactions, as this is likely to be a very large set of data and provides no business value. Instead, I am only interested in the distinct count of items. It is implied that when I specify my pivot that the last item in my pivot will always be returned as the aggregate distinct count and will not return the actual values. Other use-cases: As it stands today, the pivot feature currently does do distinct counts, the only drawback to the consuming applications is that the consuming app must "count" the distinct terms. If the application doesn't care about the actual values, only the sum of distinct terms than it wastes CPU and network transmitting very large lists of data and iterating through them only to get the total count. By allowing a user to specify facet.pivot=field1,field2 along with facet.pivot.distinct=true the user will get all distinct values for all fields. The last field will always return ONLY the distinct count and will not return physical values (thereby saving network / cpu cycles). Notes: Debated between calling the "distinct" count either "distinct", or "num_children", etc. YMMV, I called it distinct, but others can call it what they want Not sure of status of distributed facet.pivot? As it stand the current pivot feature does not work distributed. In order to make this work 'all the time', large amounts of data would need to be shared accross all shards to determine "distinct" values? YMMV, but some shards are logically partitioned to ensure no overlap, take for example date or transaction #. If I partitioned my shards by date and I asked for a distinct count, I KNOW implicitly that the distinct count is additive (due to partitioning each shard can never share a transaction id) as such a distributed query could "assume" that each shard is an additive distinct and save bandwidth. Example 1: Pivot: 2 Fields Deep with Distinct [date,transaction #] Shows: All Distinct Dates with total # of distinct transactions in that date range http://localhost:8080/solr/orders/select?q=*:*&facet=true&facet.pivot=order_date_txt,order_tran_nbr&facet.pivot.distinct=true&rows=0 ... ... <lst name= "facet_pivot" > <arr name= "order_date_txt,order_tran_nbr" > <lst> <str name= "field" > order_date_txt </str> <str name= "value" > 2009-11-01 </str> <int name= "distinct" > 12566 </int> </lst> <lst> <str name= "field" > order_date_txt </str> <str name= "value" > 2009-11-02 </str> <int name= "distinct" > 14804 </int> </lst> <lst> <str name= "field" > order_date_txt </str> <str name= "value" > 2009-11-03 </str> <int name= "distinct" > 15940 </int> </lst> <lst> <str name= "field" > order_date_txt </str> <str name= "value" > 2009-11-04 </str> <int name= "distinct" > 15540 </int> </lst> <lst> <str name= "field" > order_date_txt </str> <str name= "value" > 2009-11-05 </str> <int name= "distinct" > 15656 </int> </lst> <lst> <str name= "field" > order_date_txt </str> <str name= "value" > 2009-11-06 </str> <int name= "distinct" > 15378 </int> </lst> <lst> <str name= "field" > order_date_txt </str> <str name= "value" > 2009-11-07 </str> <int name= "distinct" > 13551 </int> </lst> </arr> </lst> ... ... Example 2: Pivot: 3 Fields Deep with Distinct [date,city,transactions] Shows: All Distinct Dates, All Distinct Citys, total distinct orders in that time within any city named "ANAH*" http://localhost:8080/solr/orders/select?q=*:*&facet=true&facet.pivot=order_date_txt,store_city_nm,order_tran_nbr&facet.pivot.distinct=true&rows=0&fq=store_city_nm:ANAH* ... ... <?xml version= "1.0" ?> <lst name= "facet_pivot" > <arr name= "order_date_txt,store_city_nm,order_tran_nbr" > <lst> <str name= "field" > order_date_txt </str> <str name= "value" > 2009-11-01 </str> <int name= "distinct" > 1 </int> <arr name= "pivot" > <lst> <str name= "field" > store_city_nm </str> <str name= "value" > ANAHEIM </str> <int name= "distinct" > 189 </int> </lst> </arr> </lst> <lst> <str name= "field" > order_date_txt </str> <str name= "value" > 2009-11-02 </str> <int name= "distinct" > 1 </int> <arr name= "pivot" > <lst> <str name= "field" > store_city_nm </str> <str name= "value" > ANAHEIM </str> <int name= "distinct" > 212 </int> </lst> </arr> </lst> <lst> <str name= "field" > order_date_txt </str> <str name= "value" > 2009-11-03 </str> <int name= "distinct" > 1 </int> <arr name= "pivot" > <lst> <str name= "field" > store_city_nm </str> <str name= "value" > ANAHEIM </str> <int name= "distinct" > 203 </int> </lst> </arr> </lst> <lst> <str name= "field" > order_date_txt </str> <str name= "value" > 2009-11-04 </str> <int name= "distinct" > 1 </int> <arr name= "pivot" > <lst> <str name= "field" > store_city_nm </str> <str name= "value" > ANAHEIM </str> <int name= "distinct" > 180 </int> </lst> </arr> </lst> <lst> <str name= "field" > order_date_txt </str> <str name= "value" > 2009-11-05 </str> <int name= "distinct" > 1 </int> <arr name= "pivot" > <lst> <str name= "field" > store_city_nm </str> <str name= "value" > ANAHEIM </str> <int name= "distinct" > 252 </int> </lst> </arr> </lst> <lst> <str name= "field" > order_date_txt </str> <str name= "value" > 2009-11-06 </str> <int name= "distinct" > 1 </int> <arr name= "pivot" > <lst> <str name= "field" > store_city_nm </str> <str name= "value" > ANAHEIM </str> <int name= "distinct" > 199 </int> </lst> </arr> </lst> <lst> <str name= "field" > order_date_txt </str> <str name= "value" > 2009-11-07 </str> <int name= "distinct" > 1 </int> <arr name= "pivot" > <lst> <str name= "field" > store_city_nm </str> <str name= "value" > ANAHEIM </str> <int name= "distinct" > 110 </int> </lst> </arr> </lst> </arr> </lst> ... ... Right now I can't make an attachment, but I posted it to my s3 account. https://s3.amazonaws.com/behemoth.io/distinct.pivot.patch
          Hide
          uygar bayar added a comment -

          hi
          I tried it 3.6.0 with SOLR-2242-3x_5_tests.patch but it didn't work. Results are grouped but all facets empty.

          <lst name="facet_counts">
          <lst name="facet_queries"/>
          <lst name="facet_fields"/>
          <lst name="facet_numTerms"/>
          <lst name="facet_dates"/>
          <lst name="facet_ranges"/>

          http://x.x.x.x:8985/solr/ar1/select?shards=192.168.200.202:8985/solr/ar3/,192.168.200.202:8985/solr/ar4&q=hotels&group=true&group.field=site&facet=true&f.site.facet.numFacetTerms=1&facet.mincount=1&facet.limit=-1

          Show
          uygar bayar added a comment - hi I tried it 3.6.0 with SOLR-2242 -3x_5_tests.patch but it didn't work. Results are grouped but all facets empty. <lst name="facet_counts"> <lst name="facet_queries"/> <lst name="facet_fields"/> <lst name="facet_numTerms"/> <lst name="facet_dates"/> <lst name="facet_ranges"/> http://x.x.x.x:8985/solr/ar1/select?shards=192.168.200.202:8985/solr/ar3/,192.168.200.202:8985/solr/ar4&q=hotels&group=true&group.field=site&facet=true&f.site.facet.numFacetTerms=1&facet.mincount=1&facet.limit=-1
          Hide
          Bill Bell added a comment -

          Ready for 3x merge. Test with:

          ant test -Dtestcase=NumFacetTermsFacetsTest

          Show
          Bill Bell added a comment - Ready for 3x merge. Test with: ant test -Dtestcase=NumFacetTermsFacetsTest
          Hide
          Bill Bell added a comment -

          3X version with test cases

          Show
          Bill Bell added a comment - 3X version with test cases
          Hide
          Bill Bell added a comment -

          All tests pass on branch_3x now.

          Show
          Bill Bell added a comment - All tests pass on branch_3x now.
          Hide
          Bill Bell added a comment -

          Fixed one of the tests that was failing.
          SOLR-2242-3x_4.patch

          Show
          Bill Bell added a comment - Fixed one of the tests that was failing. SOLR-2242 -3x_4.patch
          Hide
          Bill Bell added a comment -

          Latest 3x patch. SOLR-2242-3x_3.patch

          Show
          Bill Bell added a comment - Latest 3x patch. SOLR-2242 -3x_3.patch
          Hide
          Bill Bell added a comment -

          Found a bug and attaching new patch.

          Show
          Bill Bell added a comment - Found a bug and attaching new patch.
          Hide
          Bill Bell added a comment -

          Latest 3x patch is uploaded: SOLR-2242-3x_2.patch

          Show
          Bill Bell added a comment - Latest 3x patch is uploaded: SOLR-2242 -3x_2.patch
          Hide
          Bill Bell added a comment -

          I changed the sharing response to check the size and only return the shard name if there is a response.

          <lst name="facet_numTerms">
          <lst name="localhost:8983/solr"/>
          </lst>
          
          Changed to 
          
          <lst name="facet_numTerms"/>
          

          Also, the code for field_facets was wrong. It needs to return the name of the field even if the size is 0 or null.

          See latest patch for 3x.

          Show
          Bill Bell added a comment - I changed the sharing response to check the size and only return the shard name if there is a response. <lst name= "facet_numTerms" > <lst name= "localhost:8983/solr" /> </lst> Changed to <lst name= "facet_numTerms" /> Also, the code for field_facets was wrong. It needs to return the name of the field even if the size is 0 or null. See latest patch for 3x.
          Hide
          Bill Bell added a comment -

          Yonik agreed. However what is the alternative. We are talking distinct terms, and unless I limit the number of terms there could be a performance issue on using this with sharding. Since I would need to sent the terms and combine them and look for uniques. I am willing to do that work (not that much coding - more worried about CPU and network performance). The one I submitted does change the format by ADDING a new section. It shouldn't break other facets (usually adding sections to the JSON/XML output should not be a hard break). The latest version does not change the facet_field section so it is compatible.

          I am working on getting the tests to work. Most seem trivial fixes and not more serious. Since we changed the format...

          However, several people would like to use this. If I fix the test cases that are breaking can we consider a commit?

          Show
          Bill Bell added a comment - Yonik agreed. However what is the alternative. We are talking distinct terms, and unless I limit the number of terms there could be a performance issue on using this with sharding. Since I would need to sent the terms and combine them and look for uniques. I am willing to do that work (not that much coding - more worried about CPU and network performance). The one I submitted does change the format by ADDING a new section. It shouldn't break other facets (usually adding sections to the JSON/XML output should not be a hard break). The latest version does not change the facet_field section so it is compatible. I am working on getting the tests to work. Most seem trivial fixes and not more serious. Since we changed the format... However, several people would like to use this. If I fix the test cases that are breaking can we consider a commit?
          Hide
          Yonik Seeley added a comment -

          There are other JIRA issues open for adding more facet-related data as well, and adding a new section for each doesn't seem desirable.
          I think I'm still in favor of biting the bullet and changing the facet response format for 4.0, while having some sort of flag to enable the older format for back compat.

          Show
          Yonik Seeley added a comment - There are other JIRA issues open for adding more facet-related data as well, and adding a new section for each doesn't seem desirable. I think I'm still in favor of biting the bullet and changing the facet response format for 4.0, while having some sort of flag to enable the older format for back compat.
          Hide
          Erick Erickson added a comment -

          Bill:

          Tests do not pass on either 3.x or trunk with this patch.
          some 3.x failures:

          ant test -Dtestcase=TestDistributedSearch
          ant test -Dtestcase=testGroupingGroupedBasedFaceting
          ant test -Dtestcase=TestDistributedGrouping

          some 4x failures:
          ant test -Dtestcase=BasicDistributedZkTest
          ant test -Dtestcase=TestGroupingSearch

          I'm not sure whether these are test problems or more serious...

          Show
          Erick Erickson added a comment - Bill: Tests do not pass on either 3.x or trunk with this patch. some 3.x failures: ant test -Dtestcase=TestDistributedSearch ant test -Dtestcase=testGroupingGroupedBasedFaceting ant test -Dtestcase=TestDistributedGrouping some 4x failures: ant test -Dtestcase=BasicDistributedZkTest ant test -Dtestcase=TestGroupingSearch I'm not sure whether these are test problems or more serious...
          Hide
          Erick Erickson added a comment -

          This patch applies against the 3.x code line, Bill you might want to check it, I had to do some merging by hand.

          Show
          Erick Erickson added a comment - This patch applies against the 3.x code line, Bill you might want to check it, I had to do some merging by hand.
          Hide
          Bill Bell added a comment -

          PAtch for SOLR 3.5 branch. There is something wrong with branch_3x but this one commits and is on 3.5

          http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_5
          Last Changed Rev: 1207561

          Show
          Bill Bell added a comment - PAtch for SOLR 3.5 branch. There is something wrong with branch_3x but this one commits and is on 3.5 http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_5 Last Changed Rev: 1207561
          Hide
          Bill Bell added a comment -

          Fixed order of facet_numTerms and fixed the getShard call to be consistent with SOLR 3.5

          I think this is ready...

          Show
          Bill Bell added a comment - Fixed order of facet_numTerms and fixed the getShard call to be consistent with SOLR 3.5 I think this is ready...
          Show
          Bill Bell added a comment - https://issues.apache.org/jira/secure/attachment/12519406/SOLR-2242-solr40-2.patch is the latest patch.
          Hide
          Bill Bell added a comment -

          I added sharding as discussed by Antoine.

          <lst name="facet_numTerms">
          <lst name="http://localhost:8983/solr">
          <int name="price">14</int>
          <int name="cat">15</int>
          </lst>
          <lst name="http://localhost:8081/solr">
          <int name="price">23</int>
          <int name="cat">3</int>
          </lst>
          </lst>
          

          Example call

          http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:8081/solr&indent=true&q=*:*&facet=true&facet.mincount=1&facet.numTerms=true&facet.limit=-1&facet.field=price&facet.field=cat

          Show
          Bill Bell added a comment - I added sharding as discussed by Antoine. <lst name= "facet_numTerms" > <lst name= "http: //localhost:8983/solr" > < int name= "price" >14</ int > < int name= "cat" >15</ int > </lst> <lst name= "http: //localhost:8081/solr" > < int name= "price" >23</ int > < int name= "cat" >3</ int > </lst> </lst> Example call http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:8081/solr&indent=true&q=*:*&facet=true&facet.mincount=1&facet.numTerms=true&facet.limit=-1&facet.field=price&facet.field=cat
          Hide
          Bill Bell added a comment -

          Added Sharding

          Show
          Bill Bell added a comment - Added Sharding
          Hide
          Antoine Le Floc'h added a comment -

          Bill,

          Just a thought, how are you going to plug in SOLR-3134 then ?
          Since we are not able to aggregate distinct count over shards, shouldn't you do something like:

          <lst name="facet_numTerms">
            <lst name="localhost:7777/solr">
              <int name="cat">15</int>
              <int name="price">14</int>
            </lst>
            <lst name="localhost:8888/solr">
              <int name="cat">3</int>
              <int name="price">23</int>
            </lst>
          </lst>
          
          Show
          Antoine Le Floc'h added a comment - Bill, Just a thought, how are you going to plug in SOLR-3134 then ? Since we are not able to aggregate distinct count over shards, shouldn't you do something like: <lst name= "facet_numTerms" > <lst name= "localhost:7777/solr" > < int name= "cat" >15</ int > < int name= "price" >14</ int > </lst> <lst name= "localhost:8888/solr" > < int name= "cat" >3</ int > < int name= "price" >23</ int > </lst> </lst>
          Hide
          Erick Erickson added a comment -

          I won't get to this for 3.6

          Show
          Erick Erickson added a comment - I won't get to this for 3.6
          Show
          Bill Bell added a comment - See https://issues.apache.org/jira/secure/attachment/12519024/SOLR-2242-solr40.patch for the patch.
          Hide
          Bill Bell added a comment -

          How does it work?

          http://localhost:8983/solr/select?q=*:*&facet=true&facet.field=cat&facet.field=price&f.price.facet.numTerms=true&facet.limit=-1&f.cat.facet.numTerms=true&f.price.facet.limit=1
          

          Parameters:

          facet.numTerms or f.<field>.facet.numTerms = true (default is false) - turn on distinct counting of terms
          facet.field - the field to count the terms

          It creates a new section in the facet section... For example:

          <lst name="facet_counts">
            <lst name="facet_queries"/>
            <lst name="facet_fields">
              <lst name="cat">
                <int name="camera">1</int>
                <int name="connector">2</int>
                <int name="copier">1</int>
                <int name="currency">4</int>
                <int name="electronics">14</int>
                <int name="graphics card">2</int>
                <int name="hard drive">2</int>
                <int name="memory">3</int>
                <int name="monitor">2</int>
                <int name="multifunction printer">1</int>
                <int name="music">1</int>
                <int name="printer">1</int>
                <int name="scanner">1</int>
                <int name="search">2</int>
                <int name="software">2</int>
              </lst>
              <lst name="price">
                <int name="0.0">3</int>
              </lst>
            </lst>
            <lst name="facet_numTerms">
              <int name="cat">15</int>
              <int name="price">14</int>
            </lst>
            <lst name="facet_dates"/>
            <lst name="facet_ranges"/>
          </lst>
          
          Show
          Bill Bell added a comment - How does it work? http: //localhost:8983/solr/select?q=*:*&facet= true &facet.field=cat&facet.field=price&f.price.facet.numTerms= true &facet.limit=-1&f.cat.facet.numTerms= true &f.price.facet.limit=1 Parameters: facet.numTerms or f.<field>.facet.numTerms = true (default is false) - turn on distinct counting of terms facet.field - the field to count the terms It creates a new section in the facet section... For example: <lst name= "facet_counts" > <lst name= "facet_queries" /> <lst name= "facet_fields" > <lst name= "cat" > < int name= "camera" >1</ int > < int name= "connector" >2</ int > < int name= "copier" >1</ int > < int name= "currency" >4</ int > < int name= "electronics" >14</ int > < int name= "graphics card" >2</ int > < int name= "hard drive" >2</ int > < int name= "memory" >3</ int > < int name= "monitor" >2</ int > < int name= "multifunction printer" >1</ int > < int name= "music" >1</ int > < int name= "printer" >1</ int > < int name= "scanner" >1</ int > < int name= "search" >2</ int > < int name= "software" >2</ int > </lst> <lst name= "price" > < int name= "0.0" >3</ int > </lst> </lst> <lst name= "facet_numTerms" > < int name= "cat" >15</ int > < int name= "price" >14</ int > </lst> <lst name= "facet_dates" /> <lst name= "facet_ranges" /> </lst>
          Hide
          Bill Bell added a comment -

          SOLR 4.0 TRUNK version.

          Show
          Bill Bell added a comment - SOLR 4.0 TRUNK version.
          Hide
          Bill Bell added a comment -

          Cody,

          I love your suggestion. I am actually ready to work on it.

          <lst name="facet_numTerms">
             <int name="text">124</int>
          </lst>
          

          After we get it committed we should then fix the shard issues as per SOLR-3134.

          We can also create a new JIRA ticket for that.

          Everyone agreed?

          I will do it on SOLR 4.0 and back port to 3.5.

          Show
          Bill Bell added a comment - Cody, I love your suggestion. I am actually ready to work on it. <lst name= "facet_numTerms" > < int name= "text" >124</ int > </lst> After we get it committed we should then fix the shard issues as per SOLR-3134 . We can also create a new JIRA ticket for that. Everyone agreed? I will do it on SOLR 4.0 and back port to 3.5.
          Hide
          Antoine Le Floc'h added a comment -

          About the distribution issue, it looks like https://issues.apache.org/jira/browse/SOLR-3134 has some similar thinking as my post from 03/Jan/12 : show the info per shard. Even though the counter info cannot be aggregated across shards, knowing what the counter is for each shard would allow each user to use the info as he wants. It would work in single shard too.

          Show
          Antoine Le Floc'h added a comment - About the distribution issue, it looks like https://issues.apache.org/jira/browse/SOLR-3134 has some similar thinking as my post from 03/Jan/12 : show the info per shard. Even though the counter info cannot be aggregated across shards, knowing what the counter is for each shard would allow each user to use the info as he wants. It would work in single shard too.
          Hide
          Ethan Gruber added a comment - - edited

          +1 for me too. I have been using this feature for almost a year. I plan to upgrade to the newest patch/Solr trunk code, but the patch doesn't apply to the current trunk. Do I have to check out the revision that dates to 12/21/11 to get this to work?

          edit: nevermind, the answer is yes. I had to check out revision 1221500 from Dec. 20.

          Show
          Ethan Gruber added a comment - - edited +1 for me too. I have been using this feature for almost a year. I plan to upgrade to the newest patch/Solr trunk code, but the patch doesn't apply to the current trunk. Do I have to check out the revision that dates to 12/21/11 to get this to work? edit: nevermind, the answer is yes. I had to check out revision 1221500 from Dec. 20.
          Hide
          Cody Young added a comment -

          Had another idea that maintains backwards compatibility. We could add a new facet section:

           
          <lst name="facet_fields">
            <lst name="text">
              <int name="electronics">14</int>
              <int name="inc">8</int>
              <int name="2.0">5</int>
              <int name="lcd">5</int>
              <int name="memory">5</int>
            </lst>
          </lst>
          <lst name="facet_numTerms">
             <int name="text">124</int>
          </lst>
          

          facet.query, facet.date and facet.range all show up in a different section, what about facet.numTerms.

          That brings up an interesting question actually, we'll want to control this on a per facet field basis, what about something like facet.numTerms=FieldName. That brings it more in line with facet.date and facet.range.

          Cody

          Show
          Cody Young added a comment - Had another idea that maintains backwards compatibility. We could add a new facet section: <lst name= "facet_fields" > <lst name= "text" > <int name= "electronics" > 14 </int> <int name= "inc" > 8 </int> <int name= "2.0" > 5 </int> <int name= "lcd" > 5 </int> <int name= "memory" > 5 </int> </lst> </lst> <lst name= "facet_numTerms" > <int name= "text" > 124 </int> </lst> facet.query, facet.date and facet.range all show up in a different section, what about facet.numTerms. That brings up an interesting question actually, we'll want to control this on a per facet field basis, what about something like facet.numTerms=FieldName. That brings it more in line with facet.date and facet.range. Cody
          Hide
          Antoine Le Floc'h added a comment -

          People who need to be back-compat won't be able to use

           &facet.numTerms=true 

          . Isn't it fair ?

          About the distribution issue, maybe the distinct counter could be displayed per shard, something like:

          <lst name="facet_fields">
            <lst name="shop_id">
              <lst name="numTerms"> 
                <int ip="192.168.0.100">58</int>
                <int ip="192.168.0.101">158</int>
              </lst>
              <lst name="counts">
                <int name="28013756">7032406</int>
                <int name="28009589">3616625</int>
                <int name="976">3497825</int>
                <int name="635">1398780</int>
                <int name="28021713">440118</int>
              </lst>
            </lst>
          </lst>
          

          Like this, people who don't use shards are happy, and people who do, can display what makes sense for them, waiting for better in the future. This would allow to move forward with this JIRA.

          Show
          Antoine Le Floc'h added a comment - People who need to be back-compat won't be able to use &facet.numTerms= true . Isn't it fair ? About the distribution issue, maybe the distinct counter could be displayed per shard, something like: <lst name= "facet_fields" > <lst name= "shop_id" > <lst name= "numTerms" > < int ip= "192.168.0.100" >58</ int > < int ip= "192.168.0.101" >158</ int > </lst> <lst name= "counts" > < int name= "28013756" >7032406</ int > < int name= "28009589" >3616625</ int > < int name= "976" >3497825</ int > < int name= "635" >1398780</ int > < int name= "28021713" >440118</ int > </lst> </lst> </lst> Like this, people who don't use shards are happy, and people who do, can display what makes sense for them, waiting for better in the future. This would allow to move forward with this JIRA.
          Hide
          Erick Erickson added a comment -

          Just to be clear. I'm not volunteering to actually implement this patch. I'll gladly guide it through the process if someone wants to work on it and address the concerns raised. And I'll keep prodding it along and try to keep it from dying on the vine, and certainly volunteer to test various incarnations. Or I'll try to kill it if it comes to that.

          There are two open issues really, of which the most pressing seems to be back-compat. Cody's initial suggestion doesn't work with all the various response formats. Working out a way to change the response format without breaking back-compat seems like a worthy goal in itself, but does that mean we need to create another JIRA for that and make this JIRA dependent on the new one? Note that this is the inverse of my original point <3>, I'm suggesting we fix the back-compat issue before we address this one. I have no real clue yet how to approach that mind you.

          Again, I want a clear goal in mind before we put work into any solution.

          Show
          Erick Erickson added a comment - Just to be clear. I'm not volunteering to actually implement this patch. I'll gladly guide it through the process if someone wants to work on it and address the concerns raised. And I'll keep prodding it along and try to keep it from dying on the vine, and certainly volunteer to test various incarnations. Or I'll try to kill it if it comes to that. There are two open issues really, of which the most pressing seems to be back-compat. Cody's initial suggestion doesn't work with all the various response formats. Working out a way to change the response format without breaking back-compat seems like a worthy goal in itself, but does that mean we need to create another JIRA for that and make this JIRA dependent on the new one? Note that this is the inverse of my original point <3>, I'm suggesting we fix the back-compat issue before we address this one. I have no real clue yet how to approach that mind you. Again, I want a clear goal in mind before we put work into any solution.
          Hide
          Antoine Le Floc'h added a comment - - edited

          To help with the specification, my use case is this: I am using this patch and possibly want to add extra infos in the facet results, and want to use sharding... Basically, this is what I have today with the patch:

          <lst name="shop_id">
            <int name="numTerms">10251</int>
            <lst name="counts">
              <int name="28013756">7032406</int>
              <int name="28009589">3616625</int>
              <int name="976">3497825</int>
              <int name="635">1398780</int>
              <int name="28021713">440118</int>
              <int name="29047336">368921</int>
              <int name="411">244689</int>
            </lst>
          </lst>
          

          and I want to subclass/modify SimpleFacets to add more data for each item (since I don't see other way to do it)

          Show
          Antoine Le Floc'h added a comment - - edited To help with the specification, my use case is this: I am using this patch and possibly want to add extra infos in the facet results, and want to use sharding... Basically, this is what I have today with the patch: <lst name= "shop_id" > < int name= "numTerms" >10251</ int > <lst name= "counts" > < int name= "28013756" >7032406</ int > < int name= "28009589" >3616625</ int > < int name= "976" >3497825</ int > < int name= "635" >1398780</ int > < int name= "28021713" >440118</ int > < int name= "29047336" >368921</ int > < int name= "411" >244689</ int > </lst> </lst> and I want to subclass/modify SimpleFacets to add more data for each item (since I don't see other way to do it)
          Hide
          Erick Erickson added a comment - - edited

          First step in resurrecting this. This patch should apply cleanly to trunk. It incorporates the SOLR-2242.patch from 28-June and the NumFacetTermsFacetsTest from 9-July. It accounts for the fact that things seem to have been moved around a bit. All I guarantee is that the code compiles and the NumFacetTermsFacetsTest runs from inside IntelliJ.

          Show
          Erick Erickson added a comment - - edited First step in resurrecting this. This patch should apply cleanly to trunk. It incorporates the SOLR-2242 .patch from 28-June and the NumFacetTermsFacetsTest from 9-July. It accounts for the fact that things seem to have been moved around a bit. All I guarantee is that the code compiles and the NumFacetTermsFacetsTest runs from inside IntelliJ.
          Hide
          Yonik Seeley added a comment -

          I'm also slightly anti the min/max idea. I'm not sure what value there is in telling someone "there are between 10,000 and 90,000 distinct values".

          I think we could come up with a pretty good estimate (but we should tell them it's an estimate somehow). Anyway, that could optionally be handled in a different issue.

          2> back compat. Cody's suggestion seems to be the slickest in terms of not breaking things, but we use attributes in just a few places, are there reasons NOT to do it that way? Or does this mess up JSON, PHP, etc?

          Yes, it messes up JSON, binary format, etc. We'd need to figure out how to add attributes into our data model (that gets sent to response writers) in a generic way.

          3> Possibly add a new JIRA for changing the facet response format to be tolerant of sub-fields, but don't do that here.

          Not sure how that's possible... it's either more magic field names in with the individual constraints, or the facet response format has got to change.

          Show
          Yonik Seeley added a comment - I'm also slightly anti the min/max idea. I'm not sure what value there is in telling someone "there are between 10,000 and 90,000 distinct values". I think we could come up with a pretty good estimate (but we should tell them it's an estimate somehow). Anyway, that could optionally be handled in a different issue. 2> back compat. Cody's suggestion seems to be the slickest in terms of not breaking things, but we use attributes in just a few places, are there reasons NOT to do it that way? Or does this mess up JSON, PHP, etc? Yes, it messes up JSON, binary format, etc. We'd need to figure out how to add attributes into our data model (that gets sent to response writers) in a generic way. 3> Possibly add a new JIRA for changing the facet response format to be tolerant of sub-fields, but don't do that here. Not sure how that's possible... it's either more magic field names in with the individual constraints, or the facet response format has got to change.
          Hide
          Jonathan Rochkind added a comment -

          I would find this feature valuable even if it simply did not work at all
          on a distributed index. (Refusing to return a value rather than
          returning a known incorrect value would seem like the right way to go).
          Because my index is not distributed, and I would find this feature
          valuable, heh.

          I don't know if Solr currently has any policies against committing
          features that can't work on distributed, but personally my 'vote' would
          be doing that here, with clear documentation that it doesn't work on
          distributed (and the hope that future enhancements may make it more
          feasible to do so, as Erick suggests may possibly maybe happen).

          Show
          Jonathan Rochkind added a comment - I would find this feature valuable even if it simply did not work at all on a distributed index. (Refusing to return a value rather than returning a known incorrect value would seem like the right way to go). Because my index is not distributed, and I would find this feature valuable, heh. I don't know if Solr currently has any policies against committing features that can't work on distributed, but personally my 'vote' would be doing that here, with clear documentation that it doesn't work on distributed (and the hope that future enhancements may make it more feasible to do so, as Erick suggests may possibly maybe happen).
          Hide
          Erick Erickson added a comment - - edited

          OK, it seems like we have several themes here. I'd like to get a reasonable consensus before going forward... I'll put out a straw-man proposal here and we can go from there.

          But lets figure out where we're going before revamping stuff yet again.

          1> Distributed support. I sure don't see a good way to support this currently. Perhaps some of the future enhancements will make this easier (thinking distributed TF/IDF & such while being totally ignorant of that code), but returning the entire list of constraints (or names or terms or whatever we call it) is just a bad idea. The first time someone tries this on a field with 1,000,000 terms (yes, I've seen this) it'll just blow things up. I'm also slightly anti the min/max idea. I'm not sure what value there is in telling someone "there are between 10,000 and 90,000 distinct values". And if it's a field with just a few pre-defined values, that information is already known anyway.... But if someone can show a use-case here I'm not completely against it. But I'd like to see the use case first, not "someone might find it useful" <G>.

          2> back compat. Cody's suggestion seems to be the slickest in terms of not breaking things, but we use attributes in just a few places, are there reasons NOT to do it that way? Or does this mess up JSON, PHP, etc?

          3> Possibly add a new JIRA for changing the facet response format to be tolerant of sub-fields, but don't do that here.

          Again, I want a clearly defined end point for the concerns raised before we dive back in here....

          Show
          Erick Erickson added a comment - - edited OK, it seems like we have several themes here. I'd like to get a reasonable consensus before going forward... I'll put out a straw-man proposal here and we can go from there. But lets figure out where we're going before revamping stuff yet again. 1> Distributed support. I sure don't see a good way to support this currently. Perhaps some of the future enhancements will make this easier (thinking distributed TF/IDF & such while being totally ignorant of that code), but returning the entire list of constraints (or names or terms or whatever we call it) is just a bad idea. The first time someone tries this on a field with 1,000,000 terms (yes, I've seen this) it'll just blow things up. I'm also slightly anti the min/max idea. I'm not sure what value there is in telling someone "there are between 10,000 and 90,000 distinct values". And if it's a field with just a few pre-defined values, that information is already known anyway.... But if someone can show a use-case here I'm not completely against it. But I'd like to see the use case first, not "someone might find it useful" <G>. 2> back compat. Cody's suggestion seems to be the slickest in terms of not breaking things, but we use attributes in just a few places, are there reasons NOT to do it that way? Or does this mess up JSON, PHP, etc? 3> Possibly add a new JIRA for changing the facet response format to be tolerant of sub-fields, but don't do that here. Again, I want a clearly defined end point for the concerns raised before we dive back in here....
          Hide
          Cody Young added a comment -

          Simon, any plans for this patch?

          The general consensus seems to be that this is a good patch and desired functionality. The biggest issues seem to be the magic name and distributed support. I see a proposed solution by Yonik of changing the output format but that breaks distributed search. In addition, there is a worry about backwards compatibility and possibly supporting that through a parameter.

          What if we choose a format that doesn't break backwards compatibility and possibly commit without supporting distributed for the first pass (or supporting the simple case of just adding it all together). This would let us get some progress on this issue without having a magic name in the facet list.

          If we went with a format like below then it wouldn't break backwards compatibility and it shouldn't affect anyone unless they choose to use the feature. This is also consistent with the way numFound works for the main search results. (Admittedly, it's different than ngroups, although we still see numFound used to represent the number of documents in a group.)

           
          <lst name="facet_fields">
            <lst name="text" numFacetTerms="385">
              <int name="electronics">14</int>
              <int name="inc">8</int>
              <int name="2.0">5</int>
              <int name="lcd">5</int>
              <int name="memory">5</int>
            </lst>
          </lst>
          

          Other smaller issues that appear to be outstanding:
          Change code to cache the numFacetTerms/numTerms and remove the code that caches the huge term list.
          Determine the parameter name: facet.nconstraints=true|false was proposed, allowing facet.count to control the rest of the behavior.

          Show
          Cody Young added a comment - Simon, any plans for this patch? The general consensus seems to be that this is a good patch and desired functionality. The biggest issues seem to be the magic name and distributed support. I see a proposed solution by Yonik of changing the output format but that breaks distributed search. In addition, there is a worry about backwards compatibility and possibly supporting that through a parameter. What if we choose a format that doesn't break backwards compatibility and possibly commit without supporting distributed for the first pass (or supporting the simple case of just adding it all together). This would let us get some progress on this issue without having a magic name in the facet list. If we went with a format like below then it wouldn't break backwards compatibility and it shouldn't affect anyone unless they choose to use the feature. This is also consistent with the way numFound works for the main search results. (Admittedly, it's different than ngroups, although we still see numFound used to represent the number of documents in a group.) <lst name= "facet_fields" > <lst name= "text" numFacetTerms= "385" > <int name= "electronics" > 14 </int> <int name= "inc" > 8 </int> <int name= "2.0" > 5 </int> <int name= "lcd" > 5 </int> <int name= "memory" > 5 </int> </lst> </lst> Other smaller issues that appear to be outstanding: Change code to cache the numFacetTerms/numTerms and remove the code that caches the huge term list. Determine the parameter name: facet.nconstraints=true|false was proposed, allowing facet.count to control the rest of the behavior.
          Hide
          bronco added a comment -

          Will there also be a solution for 3.5 to get the correct numFound results?

          Show
          bronco added a comment - Will there also be a solution for 3.5 to get the correct numFound results?
          Hide
          Bill Bell added a comment -

          Sharding will not work if you change the format of the facet results... We would need to fix sharding for this to go out...

          I am in holding pattern until a committer helps.

          Show
          Bill Bell added a comment - Sharding will not work if you change the format of the facet results... We would need to fix sharding for this to go out... I am in holding pattern until a committer helps.
          Hide
          Nguyen Kien Trung added a comment - - edited

          I'm using Solr 3.2. Instead of patching, I extend SimpleFacets and FacetComponent, apply the changes mentioned in SOLR-2242.solr3.1.patch with a small fix (SOLR-2242.solr3.1-fix.patch).

          int offset = params.getFieldInt(facetValue, FacetParams.FACET_OFFSET, 0);
          ....
          resCount.add("numTerms", counts.size() + offset);
          

          as counts contains list of terms started from the given offset

          It accepts param facet.numTerms=true|false and produce the output

          <lst name="facet_fields">
             <lst name="color">
                <int name="numTerms">124</int>
                <lst name="counts" />
                    <int name="red">4</int>
                    <int name="blue">3</int>
                </lst>
             </lst>
          </lst>
          

          Not yet tested with sharding

          Show
          Nguyen Kien Trung added a comment - - edited I'm using Solr 3.2. Instead of patching, I extend SimpleFacets and FacetComponent , apply the changes mentioned in SOLR-2242.solr3.1.patch with a small fix ( SOLR-2242.solr3.1-fix.patch ). int offset = params.getFieldInt(facetValue, FacetParams.FACET_OFFSET, 0); .... resCount.add( "numTerms" , counts.size() + offset); as counts contains list of terms started from the given offset It accepts param facet.numTerms=true|false and produce the output <lst name= "facet_fields" > <lst name= "color" > < int name= "numTerms" >124</ int > <lst name= "counts" /> < int name= "red" >4</ int > < int name= "blue" >3</ int > </lst> </lst> </lst> Not yet tested with sharding
          Hide
          Trinh Trung Kien added a comment -

          Hi,

          I apply the patch using SOLR 4.0 revision 1140474. The patch seem working OK but i observe several issues:

          • I have one field indexed as integer:
            <field name="cell_id" type="integer" indexed="true" stored="true"/>

          When I search for cell_id:[900 TO 1000], there is no result (actually I have lots of data with cell_id between 900 to 1000)
          Then I search for cell_id:[1000 TO *], this should return data which have cell_id>=1000, however they return me all the records, the condition seems don't have that meaning.

          Can you confirm that i'm using the correct version and revision?

          here is my svn info for the trunk:

          URL: http://svn.apache.org/repos/asf/lucene/dev/trunk
          Repository Root: http://svn.apache.org/repos/asf
          Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
          Revision: 1140474
          Node Kind: directory
          Schedule: normal
          Last Changed Author: chrism
          Last Changed Rev: 1140408
          Last Changed Date: 2011-06-27 21:52:53 -0500 (Mon, 27 Jun 2011)

          Show
          Trinh Trung Kien added a comment - Hi, I apply the patch using SOLR 4.0 revision 1140474. The patch seem working OK but i observe several issues: I have one field indexed as integer: <field name="cell_id" type="integer" indexed="true" stored="true"/> When I search for cell_id: [900 TO 1000] , there is no result (actually I have lots of data with cell_id between 900 to 1000) Then I search for cell_id: [1000 TO *] , this should return data which have cell_id>=1000, however they return me all the records, the condition seems don't have that meaning. Can you confirm that i'm using the correct version and revision? here is my svn info for the trunk: URL: http://svn.apache.org/repos/asf/lucene/dev/trunk Repository Root: http://svn.apache.org/repos/asf Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68 Revision: 1140474 Node Kind: directory Schedule: normal Last Changed Author: chrism Last Changed Rev: 1140408 Last Changed Date: 2011-06-27 21:52:53 -0500 (Mon, 27 Jun 2011)
          Hide
          Guna C added a comment -

          Hi Bill
          I wanted to add that this is a great patch. Provides a way to analyze which search terms are effective without requiring to retrieve all the docs themselves. I was looking for a patch for 3.3.0. Does the latest one work?
          Thanks
          -guna

          Show
          Guna C added a comment - Hi Bill I wanted to add that this is a great patch. Provides a way to analyze which search terms are effective without requiring to retrieve all the docs themselves. I was looking for a patch for 3.3.0. Does the latest one work? Thanks -guna
          Hide
          Bill Bell added a comment -

          OK, I like the warning message idea. Also, it does depend on the shard approach since some shard by date... In that many cases the maxTerms would do what I need.

          List:

          1. Change the facet.field format.
          2. Get it working with sharding.
          3. Change code to cache the numFacetTerms/numTerms and remove the code that caches the huge term list.

          I can do all of this except would like some help with #3.

          Bill

          Show
          Bill Bell added a comment - OK, I like the warning message idea. Also, it does depend on the shard approach since some shard by date... In that many cases the maxTerms would do what I need. List: 1. Change the facet.field format. 2. Get it working with sharding. 3. Change code to cache the numFacetTerms/numTerms and remove the code that caches the huge term list. I can do all of this except would like some help with #3. Bill
          Hide
          Chris Male added a comment -

          I really want to avoid having to load the list just to calculate the counts, it seems unnecessary and a waste of memory. I think we should start simple and implement what you originally suggested.

          Show
          Chris Male added a comment - I really want to avoid having to load the list just to calculate the counts, it seems unnecessary and a waste of memory. I think we should start simple and implement what you originally suggested.
          Hide
          Ryan McKinley added a comment -

          Ya, always sending the whole seems like asking for problems. You can control how many terms it should pass around with facet.limit, and we could potentially add a warning message to the resposne if that is less then the total number of terms.

          Maybe we could also have facet.distrib.limit or something, that would bump up the number that it internally asks for, but still respect facet.limit for the final result?

          Show
          Ryan McKinley added a comment - Ya, always sending the whole seems like asking for problems. You can control how many terms it should pass around with facet.limit, and we could potentially add a warning message to the resposne if that is less then the total number of terms. Maybe we could also have facet.distrib.limit or something, that would bump up the number that it internally asks for, but still respect facet.limit for the final result?
          Hide
          Chris Male added a comment -

          I don't think its realistic to send back the whole list, it could be huge! Besides, in the situation where we are only doing counts we aren't going to store the list anywhere. The distributed environment is never going to be perfect in this situation, Ryan and my suggestion is to send the minimum and maximum number of constraints there could be.

          Show
          Chris Male added a comment - I don't think its realistic to send back the whole list, it could be huge! Besides, in the situation where we are only doing counts we aren't going to store the list anywhere. The distributed environment is never going to be perfect in this situation, Ryan and my suggestion is to send the minimum and maximum number of constraints there could be.
          Hide
          Bill Bell added a comment -

          To make this work right with distribution, if seems that it might be more complicated... Wouldn't you have to send the full list of facet terms, consolidate them, and then loop to get the distinct number? That is why I originally sent the WHOLE list of facets, and just added the magic number to the end.

          One machine:

          male: 10000
          numFacetTerms: 1

          Another machine:

          female: 7000
          male: 500
          numFacetTerms: 2

          The numFacetTerms that we want is 2. Since if you combined them and looped you get 2:

          male: 10500
          female: 7000
          numFacetTerms: 2

          If we add numFacetTerms you get 1+2 = 3.

          The other 2 are easier:

          distribMaxTerms: 2
          distribSumTerms: 3

          This is not ideal but may be acceptable, the perfect solution is to send the whole list, dedupe them, and then count.... Thoughts?

          Show
          Bill Bell added a comment - To make this work right with distribution, if seems that it might be more complicated... Wouldn't you have to send the full list of facet terms, consolidate them, and then loop to get the distinct number? That is why I originally sent the WHOLE list of facets, and just added the magic number to the end. One machine: male: 10000 numFacetTerms: 1 Another machine: female: 7000 male: 500 numFacetTerms: 2 The numFacetTerms that we want is 2. Since if you combined them and looped you get 2: male: 10500 female: 7000 numFacetTerms: 2 If we add numFacetTerms you get 1+2 = 3. The other 2 are easier: distribMaxTerms: 2 distribSumTerms: 3 This is not ideal but may be acceptable, the perfect solution is to send the whole list, dedupe them, and then count.... Thoughts?
          Hide
          Ryan McKinley added a comment -

          Perhaps we should return the maximum and sum of all shard counts? That way, assuming the client knew how many shards exist, they could handle most scenarios.

          Once we change the output format, we should be able to add a few thigns to the output. Perhaps something like

          <lst name="text">
              <int name="numTerms">385</int>
              <int name="distribMaxTerms">385</int>
              <int name="distribSumTerms">845</int>
              <lst name="counts">
                ...
          
          Show
          Ryan McKinley added a comment - Perhaps we should return the maximum and sum of all shard counts? That way, assuming the client knew how many shards exist, they could handle most scenarios. Once we change the output format, we should be able to add a few thigns to the output. Perhaps something like <lst name= "text" > <int name= "numTerms" > 385 </int> <int name= "distribMaxTerms" > 385 </int> <int name= "distribSumTerms" > 845 </int> <lst name= "counts" > ...
          Hide
          Chris Male added a comment - - edited

          That seems reasonable – though I think we would also want to be able to have the sum when you know that all shards have unique values.

          Perhaps we should return the maximum and sum of all shard counts? That way, assuming the client knew how many shards exist, they could handle most scenarios.

          I don't think bill is referring to the accuracy/meaning of distinct count in distributed search. His problem is that if we change the output format, we also need to update the code that collects the various values and passes them along. This patch just add a magic value (numFacetTerms) to the count list so that the value is handled with existing distributed response parsing. This is a fine one-off solution, but I am -1 for adding any more magic field names to solr. To add this feature, i think we need to bite the bullet and update the facet response format.

          Absolutely. I hadn't even considered the prospect of not changing the distributed response parsing.

          Show
          Chris Male added a comment - - edited That seems reasonable – though I think we would also want to be able to have the sum when you know that all shards have unique values. Perhaps we should return the maximum and sum of all shard counts? That way, assuming the client knew how many shards exist, they could handle most scenarios. I don't think bill is referring to the accuracy/meaning of distinct count in distributed search. His problem is that if we change the output format, we also need to update the code that collects the various values and passes them along. This patch just add a magic value (numFacetTerms) to the count list so that the value is handled with existing distributed response parsing. This is a fine one-off solution, but I am -1 for adding any more magic field names to solr. To add this feature, i think we need to bite the bullet and update the facet response format. Absolutely. I hadn't even considered the prospect of not changing the distributed response parsing.
          Hide
          Ryan McKinley added a comment -

          The simplest option seems to be to return the max constraint count taken from all the shards

          That seems reasonable – though I think we would also want to be able to have the sum when you know that all shards have unique values.

          I don't think bill is referring to the accuracy/meaning of distinct count in distributed search. His problem is that if we change the output format, we also need to update the code that collects the various values and passes them along. This patch just add a magic value (numFacetTerms) to the count list so that the value is handled with existing distributed response parsing. This is a fine one-off solution, but I am -1 for adding any more magic field names to solr. To add this feature, i think we need to bite the bullet and update the facet response format.

          Show
          Ryan McKinley added a comment - The simplest option seems to be to return the max constraint count taken from all the shards That seems reasonable – though I think we would also want to be able to have the sum when you know that all shards have unique values. I don't think bill is referring to the accuracy/meaning of distinct count in distributed search. His problem is that if we change the output format, we also need to update the code that collects the various values and passes them along. This patch just add a magic value (numFacetTerms) to the count list so that the value is handled with existing distributed response parsing. This is a fine one-off solution, but I am -1 for adding any more magic field names to solr. To add this feature, i think we need to bite the bullet and update the facet response format.
          Hide
          Chris Male added a comment -

          Having walked through the SimpleFacet codebase, I see PerSegmentSingleValuedFaceting has already introduced a FacetCollector. I think we should take this and make it used throughout all the different faceting 'Strategies'. That way we can push the counting of constraints into the Collector.

          I've also thought about the distribution issue. The simplest option seems to be to return the max constraint count taken from all the shards. With this, no matter if shards have distinct or overlapping constraints sets, clients can alway see this as the minimum number of constraints that do exist.

          Show
          Chris Male added a comment - Having walked through the SimpleFacet codebase, I see PerSegmentSingleValuedFaceting has already introduced a FacetCollector. I think we should take this and make it used throughout all the different faceting 'Strategies'. That way we can push the counting of constraints into the Collector. I've also thought about the distribution issue. The simplest option seems to be to return the max constraint count taken from all the shards. With this, no matter if shards have distinct or overlapping constraints sets, clients can alway see this as the minimum number of constraints that do exist.
          Hide
          Chris Male added a comment -

          I'm just jumping into this issue and considering the problem of loading all constraints just to get their size (or in fact, not wanting to do this). Is there scope in the SimpleFacets to have some sort of 'Collector' idea added? That way it would be easy to choose if we want to collect the constraints, their counts and the total number of constraints, or whether we just want to total number.

          Does anybody have any thoughts on the distribution issue?

          Show
          Chris Male added a comment - I'm just jumping into this issue and considering the problem of loading all constraints just to get their size (or in fact, not wanting to do this). Is there scope in the SimpleFacets to have some sort of 'Collector' idea added? That way it would be easy to choose if we want to collect the constraints, their counts and the total number of constraints, or whether we just want to total number. Does anybody have any thoughts on the distribution issue?
          Hide
          Bill Bell added a comment -

          Simon - thoughts?

          Show
          Bill Bell added a comment - Simon - thoughts?
          Hide
          Bill Bell added a comment -

          Yonik,

          Yes I know about groups.ngroups. But the use case still stands. We need a way to add up facet terms without actually counting them.

          I had the restructured facet_fields XML like you recommended (twice). And the issue is it breaks ALL sharding. The reason why it breaks distribution is that it is looking for <int> and not <lst>... Several people have wanted me to change the name to count, to term, to distinct... I really don't care what the name is, since it makes sense when you try it. I think changing the distribution is a MUCH larger project. If you want to jump in on the sharding/distribution to make it work with lists, then please help. The format change is a HUGE issue. The magic names could also be an issue but ONLY if you use this new feature. It is not an issue for all APIs and usage - which is why I added it as a magic variable.

          Do we have any examples with Boolean? I have not seen any... Do we use True/False or on/off? Do you mean like facet=true ? The reason why I have a 1 and 2 is to get the count of terms, but only return a smaller set (internal limit=-1, but user types limit=5). That is the reason for that. I believe it is very useful.

          Having the numFacetTerms like every other term pretty much works with sharding/distribution. It just adds it together like any other facet count. One server returns 5, and the other returns numFacetTerms=10, and the combined result returns 15. It may break some new feature with distribution or something I am not aware of and not using...

          Concerning building in memory. Having it cached is what I was trying to achieve. If there is another way to cache the result then let me know other options. Not having it cached at all is a huge performance problem. If you are using mode 2, it does not matter that much since you need to return the list and in most cases you have it in memory... Mode 1 hides it a bit and builds the entire list in memory when we only need to cache the one value... Again - without breaking something else, not sure how to achieve that.

          As long as there are not more gotchas in distribution, most of the other things you are listing (XML, name change, boolean) are almost preferences and the XML format change will be a huge issue, and we should be able to commit? Also, would like to not cache the entire list in memory when using this - need some assistance.

          1. Any other distribution/sharing issues with adding a magic variable in facet_field for a new feature?
          2. Where and how do we store a cache value without using the array that is present so we don't cache the whole facet term list when we only need to cache the resulting number?

          Thanks.

          Show
          Bill Bell added a comment - Yonik, Yes I know about groups.ngroups. But the use case still stands. We need a way to add up facet terms without actually counting them. I had the restructured facet_fields XML like you recommended (twice). And the issue is it breaks ALL sharding. The reason why it breaks distribution is that it is looking for <int> and not <lst>... Several people have wanted me to change the name to count, to term, to distinct... I really don't care what the name is, since it makes sense when you try it. I think changing the distribution is a MUCH larger project. If you want to jump in on the sharding/distribution to make it work with lists, then please help. The format change is a HUGE issue. The magic names could also be an issue but ONLY if you use this new feature. It is not an issue for all APIs and usage - which is why I added it as a magic variable. Do we have any examples with Boolean? I have not seen any... Do we use True/False or on/off? Do you mean like facet=true ? The reason why I have a 1 and 2 is to get the count of terms, but only return a smaller set (internal limit=-1, but user types limit=5). That is the reason for that. I believe it is very useful. Having the numFacetTerms like every other term pretty much works with sharding/distribution. It just adds it together like any other facet count. One server returns 5, and the other returns numFacetTerms=10, and the combined result returns 15. It may break some new feature with distribution or something I am not aware of and not using... Concerning building in memory. Having it cached is what I was trying to achieve. If there is another way to cache the result then let me know other options. Not having it cached at all is a huge performance problem. If you are using mode 2, it does not matter that much since you need to return the list and in most cases you have it in memory... Mode 1 hides it a bit and builds the entire list in memory when we only need to cache the one value... Again - without breaking something else, not sure how to achieve that. As long as there are not more gotchas in distribution, most of the other things you are listing (XML, name change, boolean) are almost preferences and the XML format change will be a huge issue, and we should be able to commit? Also, would like to not cache the entire list in memory when using this - need some assistance. 1. Any other distribution/sharing issues with adding a magic variable in facet_field for a new feature? 2. Where and how do we store a cache value without using the array that is present so we don't cache the whole facet term list when we only need to cache the resulting number? Thanks.
          Hide
          Yonik Seeley added a comment -

          This issue was a bit tricky to review, given that the output doesn't seem to quite match the examples.
          I also wasn't exactly sure what the latest patch was, so I just looked at the patch uploaded on 28/Jun/11.

          Here's my summary on what the patch currently does:

          If you add facet.facetTermCounts=2 to a faceting request, you get the following:

          <lst name="facet_fields">
            <lst name="text">
              <int name="electronics">14</int>
              <int name="inc">8</int>
              <int name="2.0">5</int>
              <int name="lcd">5</int>
              <int name="memory">5</int>
              <int name="numFacetTerms">385</int>
            </lst>
          </lst>
          

          If you add facet.facetTermCounts=1 to a faceting request, you get the following:

          <lst name="facet_fields">
            <lst name="text">
              <int name="numFacetTerms">385</int>
            </lst>
          </lst>
          

          w.r.t. the interface, I agree with a number of Lance's observations.

          • facet.numFacetTerms name: the second "Facet" is a bit redundant. And we probably should be talking in terms of "constraints" instead of "terms". Perhaps facet.numConstraints or (facet.nconstraints to be consistent with group.ngroups).
          • facet.nconstraints should just be a boolean... no need for "1" or "2". If the user doesn't want to see any constraints, then they can set facet.limit=0. This is also consistent with grouping.
          • we're mixing units in the same list, and that's probably not a great idea? Constraints have units of documents (number of documents that matched that constraint) while "numFacetTerms" has units of number of constraints.
          • I think this also breaks distributed faceting due to mixing of units? The distributed faceting code thinks that numFacetTerms is a constraint.
          • We need to figure out what we are going to do in distributed mode... it doesn't seem easy to actually figure out the number of constraints without streaming them all back and merging (i.e. you can't just add up the numbers)
          • I also agree that we should not built the entire list in memory just to get the size of that list.

          It seems like rather than adding more magic names to the list (and risk a real collision with the actual name of a constraint), we should add more structure to the response, as previously discussed.

          So if we added facet.nconstraints=true, we would get

          <lst name="facet_fields">
            <lst name="text">
              <int name="numFacetTerms">385</int>
              <lst name="counts">
                <int name="electronics">14</int>
                <int name="inc">8</int>
                <int name="2.0">5</int>
                <int name="lcd">5</int>
                <int name="memory">5</int>
             </lst>
            </lst>
          </lst>
          

          And when we use this new format, we should consider using a separate "missing" name for facet.missing=true instead of using the null name in with the counts.

          This format change is where we need to be careful about back compat - this interface is one of the widest used and with all the 3rd party clients and libraries out there, we should still support the old format via a facet.format parameter or something.

          Bill: You originally opened this issue for use with grouping to get the total number of groups. Are you aware of the group.ngroups parameter that was added that does this?

          Show
          Yonik Seeley added a comment - This issue was a bit tricky to review, given that the output doesn't seem to quite match the examples. I also wasn't exactly sure what the latest patch was, so I just looked at the patch uploaded on 28/Jun/11. Here's my summary on what the patch currently does: If you add facet.facetTermCounts=2 to a faceting request, you get the following: <lst name= "facet_fields" > <lst name= "text" > < int name= "electronics" >14</ int > < int name= "inc" >8</ int > < int name= "2.0" >5</ int > < int name= "lcd" >5</ int > < int name= "memory" >5</ int > < int name= "numFacetTerms" >385</ int > </lst> </lst> If you add facet.facetTermCounts=1 to a faceting request, you get the following: <lst name= "facet_fields" > <lst name= "text" > < int name= "numFacetTerms" >385</ int > </lst> </lst> w.r.t. the interface, I agree with a number of Lance's observations. facet.numFacetTerms name: the second "Facet" is a bit redundant. And we probably should be talking in terms of "constraints" instead of "terms". Perhaps facet.numConstraints or (facet.nconstraints to be consistent with group.ngroups). facet.nconstraints should just be a boolean... no need for "1" or "2". If the user doesn't want to see any constraints, then they can set facet.limit=0. This is also consistent with grouping. we're mixing units in the same list, and that's probably not a great idea? Constraints have units of documents (number of documents that matched that constraint) while "numFacetTerms" has units of number of constraints. I think this also breaks distributed faceting due to mixing of units? The distributed faceting code thinks that numFacetTerms is a constraint. We need to figure out what we are going to do in distributed mode... it doesn't seem easy to actually figure out the number of constraints without streaming them all back and merging (i.e. you can't just add up the numbers) I also agree that we should not built the entire list in memory just to get the size of that list. It seems like rather than adding more magic names to the list (and risk a real collision with the actual name of a constraint), we should add more structure to the response, as previously discussed. So if we added facet.nconstraints=true, we would get <lst name= "facet_fields" > <lst name= "text" > < int name= "numFacetTerms" >385</ int > <lst name= "counts" > < int name= "electronics" >14</ int > < int name= "inc" >8</ int > < int name= "2.0" >5</ int > < int name= "lcd" >5</ int > < int name= "memory" >5</ int > </lst> </lst> </lst> And when we use this new format, we should consider using a separate "missing" name for facet.missing=true instead of using the null name in with the counts. This format change is where we need to be careful about back compat - this interface is one of the widest used and with all the 3rd party clients and libraries out there, we should still support the old format via a facet.format parameter or something. Bill: You originally opened this issue for use with grouping to get the total number of groups. Are you aware of the group.ngroups parameter that was added that does this?
          Hide
          Bill Bell added a comment -

          Just replace this test file to fix the insanity.

          Show
          Bill Bell added a comment - Just replace this test file to fix the insanity.
          Hide
          Simon Willnauer added a comment -

          Are we ready to commit?

          bill, isnt't there a test failure still on this issue related to FC? Yonik mentioned BW compat issues here and promised to comment. I will ping him again.

          thanks for the patience

          simon

          Show
          Simon Willnauer added a comment - Are we ready to commit? bill, isnt't there a test failure still on this issue related to FC? Yonik mentioned BW compat issues here and promised to comment. I will ping him again. thanks for the patience simon
          Hide
          Bill Bell added a comment -

          Are we ready to commit?

          Show
          Bill Bell added a comment - Are we ready to commit?
          Hide
          Bill Bell added a comment -

          Thanks... If you look at my tests that I commented out, you will notice you get the Insane FieldCache usage(s) problem.

          It does it every time on my PC...

          This patch does not appear to gave any issues until you pull in the group issue.

          Show
          Bill Bell added a comment - Thanks... If you look at my tests that I commented out, you will notice you get the Insane FieldCache usage(s) problem. It does it every time on my PC... This patch does not appear to gave any issues until you pull in the group issue.
          Hide
          Simon Willnauer added a comment -

          Bill, thanks for the unit test. I need to look into the FieldCache issue before we go further though. Yet, I don't see a NPE here though.

          I fixed some whitespace issues in the patch and refactored your impl to use a switch statement instead of if / else I think is less verbose and has less duplication but as you said thats a style issue mainly.

          I will look into the FC issue and move forward here ASAP. Thanks Bill

          Show
          Simon Willnauer added a comment - Bill, thanks for the unit test. I need to look into the FieldCache issue before we go further though. Yet, I don't see a NPE here though. I fixed some whitespace issues in the patch and refactored your impl to use a switch statement instead of if / else I think is less verbose and has less duplication but as you said thats a style issue mainly. I will look into the FC issue and move forward here ASAP. Thanks Bill
          Hide
          Bill Bell added a comment -

          I left the group in there, we can uncomment when it starts working again (if it does).

          Show
          Bill Bell added a comment - I left the group in there, we can uncomment when it starts working again (if it does).
          Hide
          Bill Bell added a comment -
          junit-sequential:
              [junit] Testsuite: org.apache.solr.request.NumFacetTermsFacetsTest
              [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 3.48 sec
              [junit] 
          

          I fixed the NamedList() generic too.

          Show
          Bill Bell added a comment - junit-sequential: [junit] Testsuite: org.apache.solr.request.NumFacetTermsFacetsTest [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 3.48 sec [junit] I fixed the NamedList() generic too.
          Hide
          Bill Bell added a comment -

          I think it has to do with a NPE in group ion 4.0 it fails on other code. Nothing to do with this patch.

          
            assertQ("check group and facet counts with numFacetTerms=1",
                      req("q", "id:[1 TO 6]"
                          ,"indent", "on"
                          ,"facet", "true"
                          ,"group", "true"
                          ,"group.field", "hgid_i1"
                          ,"f.hgid_i1.facet.limit", "-1"
                          ,"f.hgid_i1.facet.mincount", "1"
                          ,"f.hgid_i1.facet.numFacetTerms", "1"
                          ,"facet.field", "hgid_i1"
                          )
                      ,"*[count(//arr[@name='groups'])=1]"
                      ,"*[count(//lst[@name='facet_fields']/lst[@name='hgid_i1']/int)=1]" // there are 1 unique items
                      ,"//lst[@name='hgid_i1']/int[@name='numFacetTerms'][.='4']"
                      );
          
          
          Show
          Bill Bell added a comment - I think it has to do with a NPE in group ion 4.0 it fails on other code. Nothing to do with this patch. assertQ( "check group and facet counts with numFacetTerms=1" , req( "q" , "id:[1 TO 6]" , "indent" , "on" , "facet" , " true " , "group" , " true " , "group.field" , "hgid_i1" , "f.hgid_i1.facet.limit" , "-1" , "f.hgid_i1.facet.mincount" , "1" , "f.hgid_i1.facet.numFacetTerms" , "1" , "facet.field" , "hgid_i1" ) , "*[count( //arr[@name='groups'])=1]" , "*[count( //lst[@name='facet_fields']/lst[@name='hgid_i1']/ int )=1]" // there are 1 unique items , " //lst[@name='hgid_i1']/ int [@name='numFacetTerms'][.='4']" );
          Hide
          Bill Bell added a comment -

          The test case gives an error. Not familiar with this error

          Show
          Bill Bell added a comment - The test case gives an error. Not familiar with this error
          Hide
          Bill Bell added a comment -

          OK. Here are some test cases.

          I am getting a weird error on running it: ant -Dtestcase=NumFacetTermsFacetsTest test

          junit-sequential:
              [junit] Testsuite: org.apache.solr.request.NumFacetTermsFacetsTest
              [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 4.072 sec
              [junit] 
              [junit] ------------- Standard Error -----------------
              [junit] NOTE: reproduce with: ant test -Dtestcase=NumFacetTermsFacetsTest -Dtestmethod=testNumFacetTermsFacetCounts -Dtests.seed=3921835369594659663:-3219730304883530389
              [junit] *** BEGIN org.apache.solr.request.NumFacetTermsFacetsTest.testNumFacetTermsFacetCounts: Insane FieldCache usage(s) ***
              [junit] SUBREADER: Found caches for descendants of DirectoryReader(segments_3 _0(4.0):C6)+hgid_i1
              [junit] 	'DirectoryReader(segments_3 _0(4.0):C6)'=>'hgid_i1',class org.apache.lucene.search.FieldCache$DocTermsIndex,org.apache.lucene.search.cache.DocTermsIndexCreator@603bb3eb=>org.apache.lucene.search.cache.DocTermsIndexCreator$DocTermsIndexImpl#1026179434 (size =~ 372 bytes)
              [junit] 	'org.apache.lucene.index.SegmentCoreReaders@7e8905bd'=>'hgid_i1',int,org.apache.lucene.search.cache.IntValuesCreator@30781822=>org.apache.lucene.search.cache.CachedArray$IntValues#291172425 (size =~ 92 bytes)
              [junit] 
              [junit] *** END org.apache.solr.request.NumFacetTermsFacetsTest.testNumFacetTermsFacetCounts: Insane FieldCache usage(s) ***
              [junit] ------------- ---------------- ---------------
              [junit] Testcase: testNumFacetTermsFacetCounts(org.apache.solr.request.NumFacetTermsFacetsTest):	FAILED
              [junit] org.apache.solr.request.NumFacetTermsFacetsTest.testNumFacetTermsFacetCounts: Insane FieldCache usage(s) found expected:<0> but was:<1>
              [junit] junit.framework.AssertionFailedError: org.apache.solr.request.NumFacetTermsFacetsTest.testNumFacetTermsFacetCounts: Insane FieldCache usage(s) found expected:<0> but was:<1>
              [junit] 	at org.apache.lucene.util.LuceneTestCase.assertSaneFieldCaches(LuceneTestCase.java:725)
              [junit] 	at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:620)
              [junit] 	at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:96)
              [junit] 	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1430)
              [junit] 	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1348)
              [junit] 
              [junit] 
              [junit] Test org.apache.solr.request.NumFacetTermsFacetsTest FAILED
          
          
          Show
          Bill Bell added a comment - OK. Here are some test cases. I am getting a weird error on running it: ant -Dtestcase=NumFacetTermsFacetsTest test junit-sequential: [junit] Testsuite: org.apache.solr.request.NumFacetTermsFacetsTest [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 4.072 sec [junit] [junit] ------------- Standard Error ----------------- [junit] NOTE: reproduce with: ant test -Dtestcase=NumFacetTermsFacetsTest -Dtestmethod=testNumFacetTermsFacetCounts -Dtests.seed=3921835369594659663:-3219730304883530389 [junit] *** BEGIN org.apache.solr.request.NumFacetTermsFacetsTest.testNumFacetTermsFacetCounts: Insane FieldCache usage(s) *** [junit] SUBREADER: Found caches for descendants of DirectoryReader(segments_3 _0(4.0):C6)+hgid_i1 [junit] 'DirectoryReader(segments_3 _0(4.0):C6)'=>'hgid_i1',class org.apache.lucene.search.FieldCache$DocTermsIndex,org.apache.lucene.search.cache.DocTermsIndexCreator@603bb3eb=>org.apache.lucene.search.cache.DocTermsIndexCreator$DocTermsIndexImpl#1026179434 (size =~ 372 bytes) [junit] 'org.apache.lucene.index.SegmentCoreReaders@7e8905bd'=>'hgid_i1', int ,org.apache.lucene.search.cache.IntValuesCreator@30781822=>org.apache.lucene.search.cache.CachedArray$IntValues#291172425 (size =~ 92 bytes) [junit] [junit] *** END org.apache.solr.request.NumFacetTermsFacetsTest.testNumFacetTermsFacetCounts: Insane FieldCache usage(s) *** [junit] ------------- ---------------- --------------- [junit] Testcase: testNumFacetTermsFacetCounts(org.apache.solr.request.NumFacetTermsFacetsTest): FAILED [junit] org.apache.solr.request.NumFacetTermsFacetsTest.testNumFacetTermsFacetCounts: Insane FieldCache usage(s) found expected:<0> but was:<1> [junit] junit.framework.AssertionFailedError: org.apache.solr.request.NumFacetTermsFacetsTest.testNumFacetTermsFacetCounts: Insane FieldCache usage(s) found expected:<0> but was:<1> [junit] at org.apache.lucene.util.LuceneTestCase.assertSaneFieldCaches(LuceneTestCase.java:725) [junit] at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:620) [junit] at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:96) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1430) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1348) [junit] [junit] [junit] Test org.apache.solr.request.NumFacetTermsFacetsTest FAILED
          Hide
          Bill Bell added a comment -

          Just so you know I have been using the original patch in production for over 5 months. I would say that the original one is tested.

          But now that we are changing it, I agree that we need more coverage.

          That will be my #1 priority.

          Show
          Bill Bell added a comment - Just so you know I have been using the original patch in production for over 5 months. I would say that the original one is tested. But now that we are changing it, I agree that we need more coverage. That will be my #1 priority.
          Hide
          Bill Bell added a comment -

          re: whitespace

          What are the settings supposed to be for tabs? Because on my editor it looks perfect. 4 space, tabs, 2 space per tab? ??

          I will add some tests.

          I think switching from if to switch and the movement to termList != null is mostly just style and does not really improve anything. I actually think it confuses things and makes the overall patch larger and more risky that we miss something or mess it up.

          I will also look at the Integer generic... Thanks.

          Bill

          Show
          Bill Bell added a comment - re: whitespace What are the settings supposed to be for tabs? Because on my editor it looks perfect. 4 space, tabs, 2 space per tab? ?? I will add some tests. I think switching from if to switch and the movement to termList != null is mostly just style and does not really improve anything. I actually think it confuses things and makes the overall patch larger and more risky that we miss something or mess it up. I will also look at the Integer generic... Thanks. Bill
          Hide
          Simon Willnauer added a comment -

          New patch ready for commit?

          bill, I still see lots of whitespace / indentation problems in that latest patch. Anyway I looked at it and I wonder if we could restructure this a little like we could first check if termList != null and do all the cases there and if termList == null we get the TermCountsLimit that would remove all the redundant getTermCountsLimit / getListedTermCounts calls. Like the termList==null case seems very easy and straight forward:

                     if (termList != null) {
                      NamedList<Integer> counts = getListedTermCounts(facetValue, termList);
                      switch (numFacetTerms) {
                      case COUNTS:
                        final NamedList<Integer> resCount = new NamedList<Integer>();
                        counts = resCount;
                      case COUNTS_AND_VALUES:
                        counts.add("numFacetTerms", counts.size());
                        break;
                      }
                      res.add(key, counts);
                    } else {
                      ...
          

          yet, its hard to refactor this without a single test (note, there might be a bug). I would be really happy to see a test-case for this that tests all the variations.
          Regarding the constants, I think the default case should be a constant too. If you use NamedList can you make sure you put the right generic to it if possible, otherwise my IDE goes wild and adds warnings all over the place. In your case NamedList<Integer> works fine.

          simon

          Show
          Simon Willnauer added a comment - New patch ready for commit? bill, I still see lots of whitespace / indentation problems in that latest patch. Anyway I looked at it and I wonder if we could restructure this a little like we could first check if termList != null and do all the cases there and if termList == null we get the TermCountsLimit that would remove all the redundant getTermCountsLimit / getListedTermCounts calls. Like the termList==null case seems very easy and straight forward: if (termList != null ) { NamedList< Integer > counts = getListedTermCounts(facetValue, termList); switch (numFacetTerms) { case COUNTS: final NamedList< Integer > resCount = new NamedList< Integer >(); counts = resCount; case COUNTS_AND_VALUES: counts.add( "numFacetTerms" , counts.size()); break ; } res.add(key, counts); } else { ... yet, its hard to refactor this without a single test (note, there might be a bug). I would be really happy to see a test-case for this that tests all the variations. Regarding the constants, I think the default case should be a constant too. If you use NamedList can you make sure you put the right generic to it if possible, otherwise my IDE goes wild and adds warnings all over the place. In your case NamedList<Integer> works fine. simon
          Hide
          Bill Bell added a comment -

          New patch ready for commit?

          Show
          Bill Bell added a comment - New patch ready for commit?
          Hide
          Bill Bell added a comment -

          Simon,

          I made all those changes except for the termsList one. I think it is useful to have the count based on terms.

          See attachment.

          Show
          Bill Bell added a comment - Simon, I made all those changes except for the termsList one. I think it is useful to have the count based on terms. See attachment.
          Hide
          Simon Willnauer added a comment -

          Hey bill,
          I looked at your patch and I have some comments:

          • you should fix white-spaces within the try {} catch block in SimpleFacets
          • I think you should alsom make the constant came consistent with facet parameter s/FACET_NAMEDISTINCT/FACTE_NUM_FACET_TERMS/
          • as lance noted (in a not necessarily appropriate tone but this is a different issue)switch to a constant / enum rather than a number something like [ COUNTS, COUNTS_AND_VALUES ]
          • if the termList is not null the results are all implicit meaning its always the number of terms you specify in the term list, right? I think we should not support this eg. only compute the count if no term list is specified
          • If you are asking for COUNTS_AND_FACETS (the 2 case) if seems we should check if the limit is already -1 so we don't comput that twice?
          • I think you should use a switch / case or an if ELSE construct instead of having 3 plain if statements

          I only considered the last patch you uploaded let me know if I should look at something else?

          Simon

          Show
          Simon Willnauer added a comment - Hey bill, I looked at your patch and I have some comments: you should fix white-spaces within the try {} catch block in SimpleFacets I think you should alsom make the constant came consistent with facet parameter s/FACET_NAMEDISTINCT/FACTE_NUM_FACET_TERMS/ as lance noted (in a not necessarily appropriate tone but this is a different issue)switch to a constant / enum rather than a number something like [ COUNTS, COUNTS_AND_VALUES ] if the termList is not null the results are all implicit meaning its always the number of terms you specify in the term list, right? I think we should not support this eg. only compute the count if no term list is specified If you are asking for COUNTS_AND_FACETS (the 2 case) if seems we should check if the limit is already -1 so we don't comput that twice? I think you should use a switch / case or an if ELSE construct instead of having 3 plain if statements I only considered the last patch you uploaded let me know if I should look at something else? Simon
          Hide
          Simon Willnauer added a comment -

          Bill, this seems like an important issue. Many votes etc. I am on travel right now so give me some days to come back and I will work with you to get this done.
          Thanks for your patience

          simon

          Show
          Simon Willnauer added a comment - Bill, this seems like an important issue. Many votes etc. I am on travel right now so give me some days to come back and I will work with you to get this done. Thanks for your patience simon
          Hide
          Bill Bell added a comment -

          Lance,

          This patch just takes the # of lines coming out of the facet section for a field and tells you how many you have.

          It does not do anything to change the facet, or deal with white space, or anything complicated.

          This is a simple counter.

          Bill

          Show
          Bill Bell added a comment - Lance, This patch just takes the # of lines coming out of the facet section for a field and tells you how many you have. It does not do anything to change the facet, or deal with white space, or anything complicated. This is a simple counter. Bill
          Hide
          Bill Bell added a comment -

          Thanks Mike.

          I think it is committable since shards work now. We might need to fix some broken tests (and I am willing to do that).

          Then we can move to range and queries...

          Thanks.

          Show
          Bill Bell added a comment - Thanks Mike. I think it is committable since shards work now. We might need to fix some broken tests (and I am willing to do that). Then we can move to range and queries... Thanks.
          Hide
          Bill Bell added a comment -

          Lance,

          There is literally 15 lines of code changes. Not sure how you cannot follow it. I could use no memory and just loop through the results, but that would not be cached - so the speed would still be slow since I need to pull in the array in order to count it.

          The field is not called namedistinct anymore... It is called facet.numFacetTerms=2,1,0.

          All other parameters are good. Also you do not need anything else to get it to work, since I set the defaults to work for you now.

          I'll see if I can write some more tests. Here is the rub: I would be happy to wrote hundreds of test cases if I knew someone was going to actually help me get this done. I am used to having a committer actually work with me - Mike McCandless is awesome and we worked on several issues together. But I have seen tons of features die when no one is willing to help. So here I am wanting, willing and able to get this done. And I have no one willing to assist from a committer perspective... The patch works fine in sharded and normal mode. So people can use it today. It is just not committed.

          I have 4 clients using it in production and one has 100M page views a year, and so far no problems.

          http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=*:*&facet=true&facet.mincount=1&facet.numFacetTerms=2&facet.limit=-1&facet.field=price

          Show
          Bill Bell added a comment - Lance, There is literally 15 lines of code changes. Not sure how you cannot follow it. I could use no memory and just loop through the results, but that would not be cached - so the speed would still be slow since I need to pull in the array in order to count it. The field is not called namedistinct anymore... It is called facet.numFacetTerms=2,1,0. All other parameters are good. Also you do not need anything else to get it to work, since I set the defaults to work for you now. I'll see if I can write some more tests. Here is the rub: I would be happy to wrote hundreds of test cases if I knew someone was going to actually help me get this done. I am used to having a committer actually work with me - Mike McCandless is awesome and we worked on several issues together. But I have seen tons of features die when no one is willing to help. So here I am wanting, willing and able to get this done. And I have no one willing to assist from a committer perspective... The patch works fine in sharded and normal mode. So people can use it today. It is just not committed. I have 4 clients using it in production and one has 100M page views a year, and so far no problems. http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=*:*&facet=true&facet.mincount=1&facet.numFacetTerms=2&facet.limit=-1&facet.field=price
          Hide
          Lance Norskog added a comment -

          There is a lot of complexity here, and having a bunch of orthogonal parameters is not quite enough. Looking at everything around facets, and group collapse, and the join trick, the Solr query syntax looks like the database world right before SQL.

          Show
          Lance Norskog added a comment - There is a lot of complexity here, and having a bunch of orthogonal parameters is not quite enough. Looking at everything around facets, and group collapse, and the join trick, the Solr query syntax looks like the database world right before SQL.
          Hide
          Lance Norskog added a comment -

          If I was a committer which I'm not, I would demand:

          • params would be as simple as possible. 'namedistinct' would be a symbol like 'facet.method=enum'. Facets have exploded in complexity, and I can't follow how everything interlocks. The API may have to change later.
          • no white-space glitches
          • consistencyConsistencyConsistency.
          • there has to to be a way to use less memory when we're only pulling a count.
          • unit tests. It's somewhat unfair to expect you to write all the unit tests required to make sure this does not break anything else, give that so much of facet features do not have tests.

          Anyway, food calls. Hope this helps.

          Show
          Lance Norskog added a comment - If I was a committer which I'm not, I would demand: params would be as simple as possible. 'namedistinct' would be a symbol like 'facet.method=enum'. Facets have exploded in complexity, and I can't follow how everything interlocks. The API may have to change later. no white-space glitches consistencyConsistencyConsistency. there has to to be a way to use less memory when we're only pulling a count. unit tests. It's somewhat unfair to expect you to write all the unit tests required to make sure this does not break anything else, give that so much of facet features do not have tests. Anyway, food calls. Hope this helps.
          Hide
          Lance Norskog added a comment -

          Yeah, my itch started just now also

          "Constraint" means any facet value: terms, numerical ranges, query results.

          Range queries have the same situation: when I give range endpoints and a gap, I want to know how many intervals it made from the gap.That would be the analog of this count.

          I'm not saying this patch has to do range counts also, but pointing out the eventual scope of this feature. Therefore, 'numTerms' is not the word we're looking for. 'count' or 'total' seem right.

          Below, both features:{ and popularity:{ need counts.

          "facet_counts":{
              "facet_queries":{
                "*:*":27},
              "facet_fields":{
                "features":[
                  "facet_terms",[
                    "2",7,]]}
              "facet_ranges":{
                "popularity":{
                  "counts":[
                    "0",3,
                    "2",0,
                    "4",1,
                    "6",9],
                  "gap":2,
                  "start":0,
                  "end":8}}}}
          

          p.s.
          I got the above from the example electronic shop database with this query:
          click to see

          Show
          Lance Norskog added a comment - Yeah, my itch started just now also "Constraint" means any facet value: terms, numerical ranges, query results. Range queries have the same situation: when I give range endpoints and a gap, I want to know how many intervals it made from the gap.That would be the analog of this count. I'm not saying this patch has to do range counts also, but pointing out the eventual scope of this feature. Therefore, 'numTerms' is not the word we're looking for. 'count' or 'total' seem right. Below, both features:{ and popularity:{ need counts. "facet_counts" :{ "facet_queries" :{ "*:*" :27}, "facet_fields" :{ "features" :[ "facet_terms" ,[ "2" ,7,]]} "facet_ranges" :{ "popularity" :{ "counts" :[ "0" ,3, "2" ,0, "4" ,1, "6" ,9], "gap" :2, "start" :0, "end" :8}}}} p.s. I got the above from the example electronic shop database with this query: click to see
          Hide
          Mark Miller added a comment -

          Hmm...yeah, fair amount of work went on here and a fair amount of interest... unfortunately, not my field (and I'm sick, on vacation, out of the country, and blah blah blah ). But, if no one takes this, I can get up to speed eventually - I doubt that soon though. Sorry Bill - not a lot of committers fluent in this area that are not very busy with other things.

          Show
          Mark Miller added a comment - Hmm...yeah, fair amount of work went on here and a fair amount of interest... unfortunately, not my field (and I'm sick, on vacation, out of the country, and blah blah blah ). But, if no one takes this, I can get up to speed eventually - I doubt that soon though. Sorry Bill - not a lot of committers fluent in this area that are not very busy with other things.
          Hide
          Bill Bell added a comment -

          Can we PLEASE commit this? What else do we need to add?

          Show
          Bill Bell added a comment - Can we PLEASE commit this? What else do we need to add?
          Hide
          Bill Bell added a comment -

          It would be easier for Sharding to not have multiple lists... I could use some help if we want to change it - since I have not played with FacetComponent.java.

          Otherwise, it would a more simpler fix to just add it and flatten the lists.

          <lst name="facet_fields">
            <lst name="price">
              <int name="numFacetTerms">14</int>
              <int name="0.0">3</int><int name="11.5">1</int><int name="19.95">1</int><int name="74.99">1</int><int name="92.0">1</int><int name="179.99">1</int><int name="185.0">1</int><int name="279.95">1</int><int name="329.95">1</int><int name="350.0">1</int><int name="399.0">1</int><int name="479.95">1</int><int name="649.99">1</int><int name="2199.0">1</int>
            </lst>
          </lst>
          

          Not ideal, but easier for v1 ? I could also just remove numFacetTerms=2 for now.

          Will only require an if statement to ignore the type check for "numFacetTerms".

          Here is a patch that works with sharding.

          http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=*:*&facet=true&facet.mincount=1&facet.numFacetTerms=2&facet.limit=-1&facet.field=price

          Enjoy.

          Bill

          Show
          Bill Bell added a comment - It would be easier for Sharding to not have multiple lists... I could use some help if we want to change it - since I have not played with FacetComponent.java. Otherwise, it would a more simpler fix to just add it and flatten the lists. <lst name= "facet_fields" > <lst name= "price" > < int name= "numFacetTerms" >14</ int > < int name= "0.0" >3</ int >< int name= "11.5" >1</ int >< int name= "19.95" >1</ int >< int name= "74.99" >1</ int >< int name= "92.0" >1</ int >< int name= "179.99" >1</ int >< int name= "185.0" >1</ int >< int name= "279.95" >1</ int >< int name= "329.95" >1</ int >< int name= "350.0" >1</ int >< int name= "399.0" >1</ int >< int name= "479.95" >1</ int >< int name= "649.99" >1</ int >< int name= "2199.0" >1</ int > </lst> </lst> Not ideal, but easier for v1 ? I could also just remove numFacetTerms=2 for now. Will only require an if statement to ignore the type check for "numFacetTerms". Here is a patch that works with sharding. http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=*:*&facet=true&facet.mincount=1&facet.numFacetTerms=2&facet.limit=-1&facet.field=price Enjoy. Bill
          Hide
          Bill Bell added a comment -

          Since we changed the output of the facet_fields, the FacetComponent.java needs to change. This also impacts the DistribFieldFacet type. This code is not going to work, since price does not just have a list of numbers. It now has multiple lists (if we set the param). We might want to always return "counts" list in all cases. Then sharding can easily pick up on this... The DistribFieldFacet needs to be refactored.

          <lst name="facet_fields">
            <lst name="price">
              <int name="numFacetTerms">14</int>
              <lst name="counts"><int name="0.0">3</int><int name="11.5">1</int><int name="19.95">1</int><int name="74.99">1</int><int name="92.0">1</int><int name="179.99">1</int><int name="185.0">1</int><int name="279.95">1</int><int name="329.95">1</int><int name="350.0">1</int><int name="399.0">1</int><int name="479.95">1</int><int name="649.99">1</int><int name="2199.0">1</int>
              </lst>
            </lst>
          </lst>
          
          Show
          Bill Bell added a comment - Since we changed the output of the facet_fields, the FacetComponent.java needs to change. This also impacts the DistribFieldFacet type. This code is not going to work, since price does not just have a list of numbers. It now has multiple lists (if we set the param). We might want to always return "counts" list in all cases. Then sharding can easily pick up on this... The DistribFieldFacet needs to be refactored. <lst name= "facet_fields" > <lst name= "price" > < int name= "numFacetTerms" >14</ int > <lst name= "counts" >< int name= "0.0" >3</ int >< int name= "11.5" >1</ int >< int name= "19.95" >1</ int >< int name= "74.99" >1</ int >< int name= "92.0" >1</ int >< int name= "179.99" >1</ int >< int name= "185.0" >1</ int >< int name= "279.95" >1</ int >< int name= "329.95" >1</ int >< int name= "350.0" >1</ int >< int name= "399.0" >1</ int >< int name= "479.95" >1</ int >< int name= "649.99" >1</ int >< int name= "2199.0" >1</ int > </lst> </lst> </lst>
          Hide
          Bill Bell added a comment -

          From rajini:

          The patch solr 2242 for getting count of distinct facet terms doesn't
          work for distributedProcess

          (https://issues.apache.org/jira/browse/SOLR-2242)

          The error log says

          HTTP ERROR 500
          Problem accessing /solr/select. Reason:

          For input string: "numFacetTerms"

          java.lang.NumberFormatException: For input string: "numFacetTerms"
          at
          java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
          at java.lang.Long.parseLong(Long.java:403)
          at java.lang.Long.parseLong(Long.java:461)
          at org.apache.solr.schema.TrieField.readableToIndexed(TrieField.java:331)
          at org.apache.solr.schema.TrieField.toInternal(TrieField.java:344)
          at
          org.apache.solr.handler.component.FacetComponent$DistribFieldFacet.add(FacetComponent.java:619)
          at
          org.apache.solr.handler.component.FacetComponent.countFacets(FacetComponent.java:265)
          at
          org.apache.solr.handler.component.FacetComponent.handleResponses(FacetComponent.java:235)
          at
          org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:290)
          at
          org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
          at
          org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
          at
          org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
          at
          org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
          at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
          at
          org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
          at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
          at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
          at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
          at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
          at org.mortbay.jetty.Server.handle(Server.java:326)
          at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
          at
          org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
          at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
          at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
          at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
          at
          org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
          at
          org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

          The query I passed :
          http://localhost:8983/solr/select?q=*:*&facet=true&facet.field=2&facet.field=648&facet.mincount=1&facet.limit=-1&f.2.facet.numFacetTerms=1&rows=0&shards=localhost:8983/solr,localhost:8985/solrtwo

          Anyone can suggest me the changes i need to make to enable the same
          funcionality for shards?

          When i do it across single core.. I get the correct results. I have applied
          the solr 2242 patch in solr1.4.1

          Awaiting for reply

          Regards,
          Rajani

          Show
          Bill Bell added a comment - From rajini: The patch solr 2242 for getting count of distinct facet terms doesn't work for distributedProcess ( https://issues.apache.org/jira/browse/SOLR-2242 ) The error log says HTTP ERROR 500 Problem accessing /solr/select. Reason: For input string: "numFacetTerms" java.lang.NumberFormatException: For input string: "numFacetTerms" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Long.parseLong(Long.java:403) at java.lang.Long.parseLong(Long.java:461) at org.apache.solr.schema.TrieField.readableToIndexed(TrieField.java:331) at org.apache.solr.schema.TrieField.toInternal(TrieField.java:344) at org.apache.solr.handler.component.FacetComponent$DistribFieldFacet.add(FacetComponent.java:619) at org.apache.solr.handler.component.FacetComponent.countFacets(FacetComponent.java:265) at org.apache.solr.handler.component.FacetComponent.handleResponses(FacetComponent.java:235) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:290) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) The query I passed : http://localhost:8983/solr/select?q=*:*&facet=true&facet.field=2&facet.field=648&facet.mincount=1&facet.limit=-1&f.2.facet.numFacetTerms=1&rows=0&shards=localhost:8983/solr,localhost:8985/solrtwo Anyone can suggest me the changes i need to make to enable the same funcionality for shards? When i do it across single core.. I get the correct results. I have applied the solr 2242 patch in solr1.4.1 Awaiting for reply Regards, Rajani
          Hide
          Bill Bell added a comment -

          OK. Can you point me in the right direction. Are you a committer? Can we get this committed?

          Show
          Bill Bell added a comment - OK. Can you point me in the right direction. Are you a committer? Can we get this committed?
          Hide
          Lance Norskog added a comment -

          There is a lot of logic in getListedTermCounts() and getTermCountsLimit(). If we optimize, and just add a counter, we need to make sure the new methods are not forgotten about (test cases?). I have seen that happen numerous times.

          Ayup. In fact this breaks SimpleFacetsTest. Everything in facets need tests.

          Show
          Lance Norskog added a comment - There is a lot of logic in getListedTermCounts() and getTermCountsLimit(). If we optimize, and just add a counter, we need to make sure the new methods are not forgotten about (test cases?). I have seen that happen numerous times. Ayup. In fact this breaks SimpleFacetsTest. Everything in facets need tests.
          Hide
          Bill Bell added a comment -

          It would be good to be able to cache the value, instead of building a list that is cached too.

          Show
          Bill Bell added a comment - It would be good to be able to cache the value, instead of building a list that is cached too.
          Hide
          Bill Bell added a comment -

          Also I thought you wanted to change the name to numNames? I am okay with numTerms too.

          Show
          Bill Bell added a comment - Also I thought you wanted to change the name to numNames? I am okay with numTerms too.
          Hide
          Bill Bell added a comment -

          I am not seeing the performance problem.

          If you are outputting facets anyways, the loop and list is going to be called. So in that case it is as efficient as probably can be.
          That is why I had the 0/1/2. I was reusing the code and just looking at the list size:

          countFacetTerms.size()
          counts.size()

          There is a lot of logic in getListedTermCounts() and getTermCountsLimit(). If we optimize, and just add a counter, we need to make sure
          the new methods are not forgotten about (test cases?). I have seen that happen numerous times.

          Show
          Bill Bell added a comment - I am not seeing the performance problem. If you are outputting facets anyways, the loop and list is going to be called. So in that case it is as efficient as probably can be. That is why I had the 0/1/2. I was reusing the code and just looking at the list size: countFacetTerms.size() counts.size() There is a lot of logic in getListedTermCounts() and getTermCountsLimit(). If we optimize, and just add a counter, we need to make sure the new methods are not forgotten about (test cases?). I have seen that happen numerous times.
          Hide
          Lance Norskog added a comment -

          I changed it to 'facet.numTerms'.

          There is still a big performance problem: numTerms builds the entire list of facets and then reports the length of the list. This could be done more efficiently.

          Show
          Lance Norskog added a comment - I changed it to 'facet.numTerms'. There is still a big performance problem: numTerms builds the entire list of facets and then reports the length of the list. This could be done more efficiently.
          Hide
          Jonathan Rochkind added a comment -

          Wonderful much better, thanks Lance, this is a much more clear and flexible api consistent with other parts of Solr. (For a feature I could definitely really use, thanks Bill).

          But I wonder... should it be facet.numTerms to group with other facetting related params? Or wait, is it already?

          Show
          Jonathan Rochkind added a comment - Wonderful much better, thanks Lance, this is a much more clear and flexible api consistent with other parts of Solr. (For a feature I could definitely really use, thanks Bill). But I wonder... should it be facet.numTerms to group with other facetting related params? Or wait, is it already?
          Hide
          Lance Norskog added a comment - - edited

          Putting up or shutting up

          This splits apart whether to count terms v.s. whether to count docs per term. They are independent concepts.

          Instead of 'numFacetTerms=0/1/2' it is 'numTerms=true/false'.
          if you set 'numTerms=true', it counts terms.
          If you set facet.limit=0, it does not do the facet search. It does not count docs per term.
          If you set 'numTerms=false' and 'facet.limit=0', it does nothing.

          'numFacetTerms' is redundant- we know it's all about facets. Thus, 'numTerms'.

          Show
          Lance Norskog added a comment - - edited Putting up or shutting up This splits apart whether to count terms v.s. whether to count docs per term. They are independent concepts. Instead of 'numFacetTerms=0/1/2' it is 'numTerms=true/false'. if you set 'numTerms=true', it counts terms. If you set facet.limit=0, it does not do the facet search. It does not count docs per term. If you set 'numTerms=false' and 'facet.limit=0', it does nothing. 'numFacetTerms' is redundant- we know it's all about facets. Thus, 'numTerms'.
          Hide
          Bill Bell added a comment - - edited

          Lance Norskog,

          What do you want it to be called? I would use a committer to take this issue on. It has several votes, and lots of downloads. People are using it successfully already.

          Do you want me to switch the numFacetTerms to numFacetNames ? Anything else? I feel like we are going in circles on this issue.

          
          This will output the numFacetTerms AND hgid:
          http://localhost:8983/solr/select?q=*:*&facet=true&facet.field=hgid&facet.mincount=1&f.hgid.facet.numFacetNames=2
          
          <lst name="facet_fields">
            <lst name="hgid">
             <int name="numFacetNames">7</int>  <!-- this is not 11 -->
             <lst name="counts">
             	<int name="HGPY0000045FD36D4000A">1</int>
             	<int name="HGPY00000FBC6690453A9">1</int>
             	<int name="HGPY00001E44ED6C4FB3B">1</int>
             	<int name="HGPY00001FA631034A1B8">1</int>
             	<int name="HGPY00003317ABAC43B48">1</int>
             	<int name="HGPY00003A17B2294CB5A">5</int>
             	<int name="HGPY00003ADD2B3D48C39">1</int>
             </lst>
            </lst>
          </lst>
          
          
          Show
          Bill Bell added a comment - - edited Lance Norskog, What do you want it to be called? I would use a committer to take this issue on. It has several votes, and lots of downloads. People are using it successfully already. Do you want me to switch the numFacetTerms to numFacetNames ? Anything else? I feel like we are going in circles on this issue. This will output the numFacetTerms AND hgid: http: //localhost:8983/solr/select?q=*:*&facet= true &facet.field=hgid&facet.mincount=1&f.hgid.facet.numFacetNames=2 <lst name= "facet_fields" > <lst name= "hgid" > < int name= "numFacetNames" >7</ int > <!-- this is not 11 --> <lst name= "counts" > < int name= "HGPY0000045FD36D4000A" >1</ int > < int name= "HGPY00000FBC6690453A9" >1</ int > < int name= "HGPY00001E44ED6C4FB3B" >1</ int > < int name= "HGPY00001FA631034A1B8" >1</ int > < int name= "HGPY00003317ABAC43B48" >1</ int > < int name= "HGPY00003A17B2294CB5A" >5</ int > < int name= "HGPY00003ADD2B3D48C39" >1</ int > </lst> </lst> </lst>
          Hide
          James Dyer added a comment -

          I noticed that with the original patch applied, SimpleFacetsTest would fail. The reason is a tiny bug that affects backwards-compatibility in that this would wrap the counts with a "counts" element in the response. This is valid if using the "namedistinct" param, but if a user doesn't specify this, it shouldn't affect old behavior. This updated patch corrects this little issue and SimpleFacetsTest now passes.

          Show
          James Dyer added a comment - I noticed that with the original patch applied, SimpleFacetsTest would fail. The reason is a tiny bug that affects backwards-compatibility in that this would wrap the counts with a "counts" element in the response. This is valid if using the "namedistinct" param, but if a user doesn't specify this, it shouldn't affect old behavior. This updated patch corrects this little issue and SimpleFacetsTest now passes.
          Hide
          Lance Norskog added a comment - - edited

          From the patch:

          public static final String FACET_NAMEDISTINCT = FACET + ".numFacetTerms";

          So- in this issue, a name is what everything else calls a term, and a value is what everyone else calls a "count of documents with this term in this field". Please change this in the patch.

          Show
          Lance Norskog added a comment - - edited From the patch: public static final String FACET_NAMEDISTINCT = FACET + ".numFacetTerms"; So- in this issue, a name is what everything else calls a term , and a value is what everyone else calls a " count of documents with this term in this field ". Please change this in the patch.
          Hide
          Bill Bell added a comment -

          OK how do we get this committed?

          Show
          Bill Bell added a comment - OK how do we get this committed?
          Hide
          Dmitry Drozdov added a comment -

          Thanks for the patch!
          It also works for version 3.1, just the line numbers differ - attaching the adopted patch for 3.1 just in case.

          Show
          Dmitry Drozdov added a comment - Thanks for the patch! It also works for version 3.1, just the line numbers differ - attaching the adopted patch for 3.1 just in case.
          Hide
          Bill Bell added a comment - - edited

          Can someone look this patch over?

          Also requested +1 from Isha Garg <isha.garg@orkash.com>

          Thanks,.

          Show
          Bill Bell added a comment - - edited Can someone look this patch over? Also requested +1 from Isha Garg <isha.garg@orkash.com> Thanks,.
          Hide
          Bill Bell added a comment -

          OK I did the required work, can we get more feedback or get it committed? What else is needed?

          Show
          Bill Bell added a comment - OK I did the required work, can we get more feedback or get it committed? What else is needed?
          Hide
          Bill Bell added a comment -

          I am changing it. Since there is one example of upper/lower.

          facet.enum.cache.minDf

          Show
          Bill Bell added a comment - I am changing it. Since there is one example of upper/lower. facet.enum.cache.minDf
          Hide
          Otis Gospodnetic added a comment -

          Would this be more consistent? facet.numfacetterms => facet.numFacetTerms

          Show
          Otis Gospodnetic added a comment - Would this be more consistent? facet.numfacetterms => facet.numFacetTerms
          Hide
          Bill Bell added a comment - - edited

          OK this is complete.

          Sample query:

          http://localhost:8983/solr/select?q=*:*&facet=true&facet.field=cat&rows=0&facet.numFacetTerms=2&facet.limit=4

          Sample output:

          <?xml version="1.0" encoding="UTF-8" ?> 
          <response>
            <lst name="responseHeader">
              <int name="status">0</int> 
              <int name="QTime">0</int> 
              <lst name="params">
                <str name="facet.numfacetterms">2</str> 
                <str name="facet">true</str> 
                <str name="q">*:*</str> 
                <str name="facet.limit">4</str> 
                <str name="facet.field">cat</str> 
                <str name="rows">0</str> 
              </lst>
            </lst>
            <result name="response" numFound="17" start="0" /> 
            <lst name="facet_counts">
              <lst name="facet_queries" /> 
              <lst name="facet_fields">
                <lst name="cat">
                  <int name="numFacetTerms">14</int> 
                  <lst name="counts">
                    <int name="electronics">14</int> 
                    <int name="memory">3</int> 
                    <int name="connector">2</int> 
                    <int name="graphics card">2</int> 
                  </lst>
                </lst>
              </lst>
              <lst name="facet_dates" /> 
              <lst name="facet_ranges" /> 
            </lst>
            </response>
          

          In Json:

          "facet_fields":{"cat":["numFacetTerms",14,"counts",["electronics",14,"memory",3,"connector",2,"graphics card",2]]},"facet_dates":{},"facet_ranges":{}}}
          
          Show
          Bill Bell added a comment - - edited OK this is complete. Sample query: http://localhost:8983/solr/select?q=*:*&facet=true&facet.field=cat&rows=0&facet.numFacetTerms=2&facet.limit=4 Sample output: <?xml version= "1.0" encoding= "UTF-8" ?> <response> <lst name= "responseHeader" > < int name= "status" >0</ int > < int name= "QTime" >0</ int > <lst name= "params" > <str name= "facet.numfacetterms" >2</str> <str name= "facet" > true </str> <str name= "q" >*:*</str> <str name= "facet.limit" >4</str> <str name= "facet.field" >cat</str> <str name= "rows" >0</str> </lst> </lst> <result name= "response" numFound= "17" start= "0" /> <lst name= "facet_counts" > <lst name= "facet_queries" /> <lst name= "facet_fields" > <lst name= "cat" > < int name= "numFacetTerms" >14</ int > <lst name= "counts" > < int name= "electronics" >14</ int > < int name= "memory" >3</ int > < int name= "connector" >2</ int > < int name= "graphics card" >2</ int > </lst> </lst> </lst> <lst name= "facet_dates" /> <lst name= "facet_ranges" /> </lst> </response> In Json: "facet_fields" :{ "cat" :[ "numFacetTerms" ,14, "counts" ,[ "electronics" ,14, "memory" ,3, "connector" ,2, "graphics card" ,2]]}, "facet_dates" :{}, "facet_ranges" :{}}}
          Hide
          Bill Bell added a comment - - edited

          I am going to use your suggestion. You will not have to set the limit. Getting the numFacetTerms will be optional, and you also will be able to NOT get the hgids as well. I propose this (please comment):

          This will ONLY output the numFacetTerms (no hgid facet counts):
          http://localhost:8983/solr/select?q=*:*&facet=true&facet.field=hgid&f.hgid.facet.numFacetTerms=1

          This assumes the count will be limit=-1

          <lst name="facet_fields">
            <lst name="hgid">
             <int name="numFacetTerms">7</int>  <!-- this is not 11 -->
            </lst>
          </lst>
          

          This will output the numFacetTerms AND hgid:
          http://localhost:8983/solr/select?q=*:*&facet=true&facet.field=hgid&facet.mincount=1&f.hgid.facet.numFacetTerms=2

          <lst name="facet_fields">
            <lst name="hgid">
             <int name="numFacetTerms">7</int>  <!-- this is not 11 -->
             <lst name="counts">
             	<int name="HGPY0000045FD36D4000A">1</int>
             	<int name="HGPY00000FBC6690453A9">1</int>
             	<int name="HGPY00001E44ED6C4FB3B">1</int>
             	<int name="HGPY00001FA631034A1B8">1</int>
             	<int name="HGPY00003317ABAC43B48">1</int>
             	<int name="HGPY00003A17B2294CB5A">5</int>
             	<int name="HGPY00003ADD2B3D48C39">1</int>
             </lst>
            </lst>
          </lst>
          
          Show
          Bill Bell added a comment - - edited I am going to use your suggestion. You will not have to set the limit. Getting the numFacetTerms will be optional, and you also will be able to NOT get the hgids as well. I propose this (please comment): This will ONLY output the numFacetTerms (no hgid facet counts): http://localhost:8983/solr/select?q=*:*&facet=true&facet.field=hgid&f.hgid.facet.numFacetTerms=1 This assumes the count will be limit=-1 <lst name= "facet_fields" > <lst name= "hgid" > < int name= "numFacetTerms" >7</ int > <!-- this is not 11 --> </lst> </lst> This will output the numFacetTerms AND hgid: http://localhost:8983/solr/select?q=*:*&facet=true&facet.field=hgid&facet.mincount=1&f.hgid.facet.numFacetTerms=2 <lst name= "facet_fields" > <lst name= "hgid" > < int name= "numFacetTerms" >7</ int > <!-- this is not 11 --> <lst name= "counts" > < int name= "HGPY0000045FD36D4000A" >1</ int > < int name= "HGPY00000FBC6690453A9" >1</ int > < int name= "HGPY00001E44ED6C4FB3B" >1</ int > < int name= "HGPY00001FA631034A1B8" >1</ int > < int name= "HGPY00003317ABAC43B48" >1</ int > < int name= "HGPY00003A17B2294CB5A" >5</ int > < int name= "HGPY00003ADD2B3D48C39" >1</ int > </lst> </lst> </lst>
          Hide
          Jonathan Rochkind added a comment -

          There is clearly a semantic problem here. i call that the number of 'facet values', what you are calling a 'name' I am calilng a 'facet value'. I have no idea what you are calling a 'value', honestly. I'm pretty sure we're talking about the same thing. I have no idea what word to use that will mean that to both of us and everyone else.

          I guess what you are calling 'number of values',if I understand properly, I'd call 'sum of the facet counts'. facet counts are already called facet counts. Summing them up is the sum of them. It's not a 'number of values'. (I also can't imagine any use case where you'd want a sum of facet counts; for a single-valued field with no facet.missing, the sum of the facet counts will equal the document count, numRows. In other cases it may not, and I have no idea why you'd ever want it in those cases). But the name is less important than the functionality, I guess. (Except for that lack of establishment of consistent terminology in Solr is what leads us to this confusion). Okay, wait, numFacetTerms, is that maybe clear, 'terms', since Solr 'terms' is in fact what appear as the values/names in Solr facetting? From the wiki page for facet.field: "It will iterate over each Term in the field and generate a facet count using that Term as the constraint. "

          But also perhaps I misunderstood, the functionality is of use/interest to me only if it does NOT require me to set facet.limit=-1 to get this count of distinct values/names/terms. If I'm setting facet.limit=-1 anyway, that number is already implicit in the response, not much value added making it explicit. What I have need of is a way to get this number without setting facet.limit=-1, since in my use cases I can have a million or more, um, values/names/terms. (Which Solr 1.4.1 with facet.method=fc handles with aplomb!). If your patch only works if facet.limit=-1, it does not actually address my need.

          Show
          Jonathan Rochkind added a comment - There is clearly a semantic problem here. i call that the number of 'facet values', what you are calling a 'name' I am calilng a 'facet value'. I have no idea what you are calling a 'value', honestly. I'm pretty sure we're talking about the same thing. I have no idea what word to use that will mean that to both of us and everyone else. I guess what you are calling 'number of values',if I understand properly, I'd call 'sum of the facet counts'. facet counts are already called facet counts. Summing them up is the sum of them. It's not a 'number of values'. (I also can't imagine any use case where you'd want a sum of facet counts; for a single-valued field with no facet.missing, the sum of the facet counts will equal the document count, numRows. In other cases it may not, and I have no idea why you'd ever want it in those cases). But the name is less important than the functionality, I guess. (Except for that lack of establishment of consistent terminology in Solr is what leads us to this confusion). Okay, wait, numFacetTerms, is that maybe clear, 'terms', since Solr 'terms' is in fact what appear as the values/names in Solr facetting? From the wiki page for facet.field: "It will iterate over each Term in the field and generate a facet count using that Term as the constraint. " But also perhaps I misunderstood, the functionality is of use/interest to me only if it does NOT require me to set facet.limit=-1 to get this count of distinct values/names/terms. If I'm setting facet.limit=-1 anyway, that number is already implicit in the response, not much value added making it explicit. What I have need of is a way to get this number without setting facet.limit=-1, since in my use cases I can have a million or more, um, values/names/terms. (Which Solr 1.4.1 with facet.method=fc handles with aplomb!). If your patch only works if facet.limit=-1, it does not actually address my need.
          Hide
          Bill Bell added a comment - - edited

          No actually namedistinct is not the number of values. It is the number of names.

          - <lst name="facet_fields">
          - <lst name="hgid">
             <int name="HGPY0000045FD36D4000A">1</int>
             <int name="HGPY00000FBC6690453A9">1</int>
             <int name="HGPY00001E44ED6C4FB3B">1</int>
             <int name="HGPY00001FA631034A1B8">1</int>
             <int name="HGPY00003317ABAC43B48">1</int>
             <int name="HGPY00003A17B2294CB5A">5</int>
             <int name="HGPY00003ADD2B3D48C39">1</int>
             </lst>
             </lst>
          

          Becomes:

          <lst name="facet_fields">
            <lst name="hgid">
             <int name="namedistinct">7</int>  <!-- this is not 11 -->
             <lst name="counts">
             	<int name="HGPY0000045FD36D4000A">1</int>
             	<int name="HGPY00000FBC6690453A9">1</int>
             	<int name="HGPY00001E44ED6C4FB3B">1</int>
             	<int name="HGPY00001FA631034A1B8">1</int>
             	<int name="HGPY00003317ABAC43B48">1</int>
             	<int name="HGPY00003A17B2294CB5A">5</int>
             	<int name="HGPY00003ADD2B3D48C39">1</int>
             </lst>
            </lst>
          </lst>
          
          Show
          Bill Bell added a comment - - edited No actually namedistinct is not the number of values. It is the number of names. - <lst name= "facet_fields" > - <lst name= "hgid" > < int name= "HGPY0000045FD36D4000A" >1</ int > < int name= "HGPY00000FBC6690453A9" >1</ int > < int name= "HGPY00001E44ED6C4FB3B" >1</ int > < int name= "HGPY00001FA631034A1B8" >1</ int > < int name= "HGPY00003317ABAC43B48" >1</ int > < int name= "HGPY00003A17B2294CB5A" >5</ int > < int name= "HGPY00003ADD2B3D48C39" >1</ int > </lst> </lst> Becomes: <lst name= "facet_fields" > <lst name= "hgid" > < int name= "namedistinct" >7</ int > <!-- this is not 11 --> <lst name= "counts" > < int name= "HGPY0000045FD36D4000A" >1</ int > < int name= "HGPY00000FBC6690453A9" >1</ int > < int name= "HGPY00001E44ED6C4FB3B" >1</ int > < int name= "HGPY00001FA631034A1B8" >1</ int > < int name= "HGPY00003317ABAC43B48" >1</ int > < int name= "HGPY00003A17B2294CB5A" >5</ int > < int name= "HGPY00003ADD2B3D48C39" >1</ int > </lst> </lst> </lst>
          Hide
          Jonathan Rochkind added a comment -

          If the naming is the sticking point. So the value here is the total count of facet values, the number of facet values you'd get if you did facet.limit=-1, but without the need to assemble every facet value in memory and send it accross the wire. This is quite analagous to numFound in the main response, the total number of documents matching your query that you'd get if you set rows=-1, but without needing actually assemble all those and send em accross the wire. Is there some way to use this parallelism in the name of the total count of facet values? numFacetsFound?

          Show
          Jonathan Rochkind added a comment - If the naming is the sticking point. So the value here is the total count of facet values, the number of facet values you'd get if you did facet.limit=-1, but without the need to assemble every facet value in memory and send it accross the wire. This is quite analagous to numFound in the main response, the total number of documents matching your query that you'd get if you set rows=-1, but without needing actually assemble all those and send em accross the wire. Is there some way to use this parallelism in the name of the total count of facet values? numFacetsFound?
          Hide
          Bill Bell added a comment -

          Btw,

          I hope constraints means unique names. It is different than number of
          constraints. There might be a need for number of constraints, but that is
          not what this ticket is for.

          So, I think I am going to reject your proposed naming for mine:

          Proposed:
          "facet fields" : {"hgid" : {
            "missing" : 25,
            "namedistinct" : 25,
            "constraints": 1250,
            "counts" : ["constraint",10,...]
          }}
          

          Those are 2 different things.

          Show
          Bill Bell added a comment - Btw, I hope constraints means unique names. It is different than number of constraints. There might be a need for number of constraints, but that is not what this ticket is for. So, I think I am going to reject your proposed naming for mine: Proposed: "facet fields" : { "hgid" : { "missing" : 25, "namedistinct" : 25, "constraints" : 1250, "counts" : [ "constraint" ,10,...] }} Those are 2 different things.
          Hide
          Bill Bell added a comment -

          Btw,

          I hope constraints means unique names. It is different than number of
          constraints. There might be a need for number of constraints, but that is
          not what this ticket is for.

          So, I think I am going to reject your proposed naming for mine:

          Proposed:
          "facet fields" : {"hgid" : {
            "missing" : 25,
            "namedistinct" : 25,
            "constraints": 1250,
            "counts" : ["constraint",10,...]
          }}
          

          Those are 2 different things.

          Show
          Bill Bell added a comment - Btw, I hope constraints means unique names. It is different than number of constraints. There might be a need for number of constraints, but that is not what this ticket is for. So, I think I am going to reject your proposed naming for mine: Proposed: "facet fields" : { "hgid" : { "missing" : 25, "namedistinct" : 25, "constraints" : 1250, "counts" : [ "constraint" ,10,...] }} Those are 2 different things.
          Hide
          Bill Bell added a comment -

          OK. So you like the work "constraints" instead of "namedistinct". I am okay with it.

          I am going to work on this tonight.

          Show
          Bill Bell added a comment - OK. So you like the work "constraints" instead of "namedistinct". I am okay with it. I am going to work on this tonight.
          Hide
          Yonik Seeley added a comment -

          Not sure what "constraints" means?

          It's a facet value like "HGPY0000045FD36D4000A" in your example.

          Would be always include this or just add it as an option?

          It will require disabling certain optimizations, and should thus be optional (and off by default).

          FYI, the missing I threw in is also a different way to represent the count calculated via facet.missing=true, instead of being added in with the other counts as a null key (which JSON does not support).

          Show
          Yonik Seeley added a comment - Not sure what "constraints" means? It's a facet value like "HGPY0000045FD36D4000A" in your example. Would be always include this or just add it as an option? It will require disabling certain optimizations, and should thus be optional (and off by default). FYI, the missing I threw in is also a different way to represent the count calculated via facet.missing=true, instead of being added in with the other counts as a null key (which JSON does not support).
          Hide
          Bill Bell added a comment -

          Thanks.

          Not sure how to get the facet distinct count without looping, but I'll
          look into that. Not sure what "constraints" means?

          I agree that you should not have to specify limit, but mincount should
          apply, since many times I want 1 or higher.

          Would be always include this or just add it as an option?

          f.hgid.facet.namedistinct=1 ?

          Proposed:

          "facet fields" : {"hgid" : {
            "missing" : 25,
            "namedistinct" : 1250,
            "counts" : ["constraint",10,...]
          }}
          

          Then we add others as needed?

          Or do you mean?

          f.hgid.facet.constraints = namedistinct() with the option to specify more
          than one?

          f.hgid.facet.constraints = namedistinct(),missing()

          Proposed:

          "facet fields" : {"hgid" : {
            "constraints" : ["missing()",25,"namedistinct()",1250],
            "counts" : ["constraint",10,...]
          }}
          
          Show
          Bill Bell added a comment - Thanks. Not sure how to get the facet distinct count without looping, but I'll look into that. Not sure what "constraints" means? I agree that you should not have to specify limit, but mincount should apply, since many times I want 1 or higher. Would be always include this or just add it as an option? f.hgid.facet.namedistinct=1 ? Proposed: "facet fields" : { "hgid" : { "missing" : 25, "namedistinct" : 1250, "counts" : [ "constraint" ,10,...] }} Then we add others as needed? Or do you mean? f.hgid.facet.constraints = namedistinct() with the option to specify more than one? f.hgid.facet.constraints = namedistinct(),missing() Proposed: "facet fields" : { "hgid" : { "constraints" : [ "missing()" ,25, "namedistinct()" ,1250], "counts" : [ "constraint" ,10,...] }}
          Hide
          Yonik Seeley added a comment -

          It feels like we should have an option to return the number of constraints that match the criteria (mincount, etc) w/o having to specify facet.limit=-1, and you should be able to get this info in addition to the normal facet counts. We can also improve the efficiency by not building the complete list in memory just to return it's count.

          We've also talked before about having an extra metadata level for each facet.

          Current:

          "facet fields" : {"hgid" : ["constraint",10,...]}
          

          Proposed:

          "facet fields" : {"hgid" : {
            "missing" : 25,
            "constraints" : 1250,
            "counts" : ["constraint",10,...]
          }}
          
          Show
          Yonik Seeley added a comment - It feels like we should have an option to return the number of constraints that match the criteria (mincount, etc) w/o having to specify facet.limit=-1, and you should be able to get this info in addition to the normal facet counts. We can also improve the efficiency by not building the complete list in memory just to return it's count. We've also talked before about having an extra metadata level for each facet. Current: "facet fields" : { "hgid" : [ "constraint" ,10,...]} Proposed: "facet fields" : { "hgid" : { "missing" : 25, "constraints" : 1250, "counts" : [ "constraint" ,10,...] }}
          Hide
          Bill Bell added a comment -

          I am pretty new to patching stuff. Can I get some sort of committer to
          give me feedback?

          I would also LOVE to get this in the TRUNK.

          Show
          Bill Bell added a comment - I am pretty new to patching stuff. Can I get some sort of committer to give me feedback? I would also LOVE to get this in the TRUNK.
          Hide
          Peter Sturge added a comment -

          +1 Yep, me too. Useful feature, this.

          Show
          Peter Sturge added a comment - +1 Yep, me too. Useful feature, this.
          Hide
          Jonathan Rochkind added a comment -

          I would love to see this feature in trunk, I could really use it.

          Show
          Jonathan Rochkind added a comment - I would love to see this feature in trunk, I could really use it.
          Show
          Bill Bell added a comment - https://issues.apache.org/jira/secure/attachment/12459815/SOLR-236-distinctFacet.patch

            People

            • Assignee:
              Unassigned
              Reporter:
              Bill Bell
            • Votes:
              38 Vote for this issue
              Watchers:
              39 Start watching this issue

              Dates

              • Created:
                Updated:

                Development