Solr
  1. Solr
  2. SOLR-1814

select count(distinct fieldname) in SOLR

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 1.5
    • Fix Version/s: 4.7
    • Labels:
      None

      Description

      I have seen questions on the mailinglist about having the functionality for counting distinct on a field. We at Tailsweep as well want to that in for example our blogsearch.

      Example:
      "You had 1345 hits on 244 blogs"

      The 244 part is not possible in SOLR today (correct me if I am wrong). So I've written a component which does this. Attaching it.

        Issue Links

          Activity

          Hide
          Marcus Herou added a comment -

          It has dependencies to GNU Trove tested against v 2.0.2
          http://sourceforge.net/projects/trove4j/files/trove/archived/trove-2.0.2/trove-2.0.2.tar.gz/download

          Trove have more memory efficient data structures so I used those instead. Perhaps should be broken out.

          solrconfig.xml

          <arr name="last-components">
          <str>count</str>
          </arr>

          <searchComponent name="count" class="org.apache.solr.handler.component.CountComponent" />

          Show
          Marcus Herou added a comment - It has dependencies to GNU Trove tested against v 2.0.2 http://sourceforge.net/projects/trove4j/files/trove/archived/trove-2.0.2/trove-2.0.2.tar.gz/download Trove have more memory efficient data structures so I used those instead. Perhaps should be broken out. solrconfig.xml <arr name="last-components"> <str>count</str> </arr> <searchComponent name="count" class="org.apache.solr.handler.component.CountComponent" />
          Hide
          Erik Hatcher added a comment -

          I'm a bit confused here, but maybe don't quite understand what you've implemented. Doesn't faceting give you the counts you're after here? I'm assuming "blogs" in your example is a value of a "type" field or something like that. Faceting on the type field would give you that count, or doing a facet.query=type:blogs would give you just that count (for any arbitrary query).

          Show
          Erik Hatcher added a comment - I'm a bit confused here, but maybe don't quite understand what you've implemented. Doesn't faceting give you the counts you're after here? I'm assuming "blogs" in your example is a value of a "type" field or something like that. Faceting on the type field would give you that count, or doing a facet.query=type:blogs would give you just that count (for any arbitrary query).
          Hide
          Ted Dunning added a comment -

          Trove is GPL.

          The Mahout project has a partial set of replacements for Trove collections in case you want to go forward with this. Our plan is to consider breaking out the collections package from Mahout at some point in case you don't want to drag along the rest of Mahout.

          Show
          Ted Dunning added a comment - Trove is GPL. The Mahout project has a partial set of replacements for Trove collections in case you want to go forward with this. Our plan is to consider breaking out the collections package from Mahout at some point in case you don't want to drag along the rest of Mahout.
          Hide
          Marcus Herou added a comment - - edited

          Instead of having the file attached... http://svn.tailsweep.com/opensource/solr-contrib/trunk/src/main/java/org/apache/solr/handler/component/

          Erik:
          The facet counts is something else, it groups the counts based on the field supplied does it not? Perhaps facet.query (like you pointed out) can be used, I overlooked that. Never got an answer on the mailinglist so I implemented it instead

          Well the "blogs" is not a value it is a field of it's own.
          We call it feedId and is a pointer to a row in the DB.
          ...
          <field name="feedId" type="integer" indexed="true" stored="true" required="true" omitNorms="true" />
          ...

          What I have accomplished is this:

          select count(distinct feedId) from FeedItem where ...somexpression...

          One doc is in in this case a FeedItem and each belongs to Feed (many-to-one). If this already can be accomplished in SOLR, my bad. Please tell me how.

          Ted:
          Trove have two licenses GPL and ASL. I can use the ASL version if it helps. I only use Trove due to the efficiency, plain hashmaps can be used of course if it is a showstopper.

          Show
          Marcus Herou added a comment - - edited Instead of having the file attached... http://svn.tailsweep.com/opensource/solr-contrib/trunk/src/main/java/org/apache/solr/handler/component/ Erik: The facet counts is something else, it groups the counts based on the field supplied does it not? Perhaps facet.query (like you pointed out) can be used, I overlooked that. Never got an answer on the mailinglist so I implemented it instead Well the "blogs" is not a value it is a field of it's own. We call it feedId and is a pointer to a row in the DB. ... <field name="feedId" type="integer" indexed="true" stored="true" required="true" omitNorms="true" /> ... What I have accomplished is this: select count(distinct feedId) from FeedItem where ...somexpression... One doc is in in this case a FeedItem and each belongs to Feed (many-to-one). If this already can be accomplished in SOLR, my bad. Please tell me how. Ted: Trove have two licenses GPL and ASL. I can use the ASL version if it helps. I only use Trove due to the efficiency, plain hashmaps can be used of course if it is a showstopper.
          Hide
          Marcus Herou added a comment -

          Ted: I am an idiot about ASL. GNU Trove ( I mixed it up with something else ).

          I can add code which uses Trove if available in the CP or plain Hashmaps if not. Think it exists some good collection utils in commons. Will look it up. Trove however is super.

          Show
          Marcus Herou added a comment - Ted: I am an idiot about ASL. GNU Trove ( I mixed it up with something else ). I can add code which uses Trove if available in the CP or plain Hashmaps if not. Think it exists some good collection utils in commons. Will look it up. Trove however is super.
          Hide
          Hoss Man added a comment -

          Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

          A unique token for finding these 240 issues in the future: hossversioncleanup20100527

          Show
          Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
          Hide
          Bill Bell added a comment -

          If you remove trove, we can probably include this component.

          Please provide a patch that can be applied to SOLR.

          Show
          Bill Bell added a comment - If you remove trove, we can probably include this component. Please provide a patch that can be applied to SOLR.
          Hide
          Robert Muir added a comment -

          Bulk move 3.2 -> 3.3

          Show
          Robert Muir added a comment - Bulk move 3.2 -> 3.3
          Hide
          Robert Muir added a comment -

          3.4 -> 3.5

          Show
          Robert Muir added a comment - 3.4 -> 3.5
          Hide
          Antoine Le Floc'h added a comment -

          Actually, could anybody confirm that there is no other way to count the distinct number of elements in a field right now ?

          Also the advantage of JIRA SOLR-1814 is we don't need to compute actual facets like in JIRA SOLR-2242 ? Is that right ?

          Show
          Antoine Le Floc'h added a comment - Actually, could anybody confirm that there is no other way to count the distinct number of elements in a field right now ? Also the advantage of JIRA SOLR-1814 is we don't need to compute actual facets like in JIRA SOLR-2242 ? Is that right ?
          Hide
          Ryan McKinley added a comment -

          What about the luke request handler?
          http://wiki.apache.org/solr/LukeRequestHandler

          Show
          Ryan McKinley added a comment - What about the luke request handler? http://wiki.apache.org/solr/LukeRequestHandler
          Hide
          Hoss Man added a comment -

          Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently.

          email notification suppressed to prevent mass-spam
          psuedo-unique token identifying these issues: hoss20120321nofix36

          Show
          Hoss Man added a comment - Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently. email notification suppressed to prevent mass-spam psuedo-unique token identifying these issues: hoss20120321nofix36
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Shalin Shekhar Mangar added a comment -

          I think this can be done by StatsComponent in Solr 4.7 and beyond.

          Show
          Shalin Shekhar Mangar added a comment - I think this can be done by StatsComponent in Solr 4.7 and beyond.

            People

            • Assignee:
              Unassigned
              Reporter:
              Marcus Herou
            • Votes:
              2 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development