Issue Details (XML | Word | Printable)

Key: SOLR-236
Type: New Feature New Feature
Status: Open Open
Priority: Major Major
Assignee: Unassigned
Reporter: Emmanuel Keller
Votes: 49
Watchers: 61
Operations

If you were logged in you would be able to see more operations.
Solr

Field collapsing

Created: 11/May/07 10:13 PM   Updated: Sunday 09:56 PM
Return to search
Component/s: search
Affects Version/s: 1.3
Fix Version/s: 1.5

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works collapsing-patch-to-1.3.0-dieter.patch 2009-01-29 02:40 PM dieter grad 26 kB
Text File Licensed for inclusion in ASF works collapsing-patch-to-1.3.0-ivan.patch 2008-11-13 04:33 PM Iván de Prado 24 kB
Text File Licensed for inclusion in ASF works collapsing-patch-to-1.3.0-ivan_2.patch 2008-12-10 04:31 PM Iván de Prado 24 kB
Text File Licensed for inclusion in ASF works collapsing-patch-to-1.3.0-ivan_3.patch 2008-12-17 12:43 PM Iván de Prado 24 kB
Text File Licensed for inclusion in ASF works field-collapse-3.patch 2009-07-25 12:58 PM Martijn van Groningen 52 kB
Text File Licensed for inclusion in ASF works field-collapse-4-with-solrj.patch 2009-08-10 08:20 PM Martijn van Groningen 66 kB
Text File Licensed for inclusion in ASF works field-collapse-5.patch 2009-11-29 09:55 PM Martijn van Groningen 251 kB
Text File Licensed for inclusion in ASF works field-collapse-5.patch 2009-11-22 10:00 PM Martijn van Groningen 244 kB
Text File Licensed for inclusion in ASF works field-collapse-5.patch 2009-11-15 08:55 PM Martijn van Groningen 239 kB
Text File Licensed for inclusion in ASF works field-collapse-5.patch 2009-11-11 06:07 AM Martijn van Groningen 218 kB
Text File Licensed for inclusion in ASF works field-collapse-5.patch 2009-10-27 04:28 PM Martijn van Groningen 218 kB
Text File Licensed for inclusion in ASF works field-collapse-5.patch 2009-10-25 10:13 PM Martijn van Groningen 216 kB
Text File Licensed for inclusion in ASF works field-collapse-5.patch 2009-10-14 09:22 PM Martijn van Groningen 144 kB
Text File Licensed for inclusion in ASF works field-collapse-5.patch 2009-09-26 03:32 PM Martijn van Groningen 146 kB
Text File Licensed for inclusion in ASF works field-collapse-5.patch 2009-09-14 08:15 PM Martijn van Groningen 136 kB
Text File Licensed for inclusion in ASF works field-collapse-5.patch 2009-09-12 03:31 PM Martijn van Groningen 134 kB
Text File Licensed for inclusion in ASF works field-collapse-5.patch 2009-09-12 11:22 AM Martijn van Groningen 134 kB
Text File Licensed for inclusion in ASF works field-collapse-5.patch 2009-09-10 11:24 PM Martijn van Groningen 133 kB
Text File Licensed for inclusion in ASF works field-collapse-5.patch 2009-08-24 08:59 PM Martijn van Groningen 122 kB
Text File Licensed for inclusion in ASF works field-collapse-solr-236-2.patch 2009-05-30 11:26 AM Martijn van Groningen 52 kB
Text File Licensed for inclusion in ASF works field-collapse-solr-236.patch 2009-05-29 12:52 PM Martijn van Groningen 49 kB
Text File Licensed for inclusion in ASF works field-collapsing-extended-592129.patch 2007-11-06 10:05 PM Karsten Sperling 31 kB
Text File Licensed for inclusion in ASF works field_collapsing_1.1.0.patch 2007-05-19 02:24 PM Emmanuel Keller 12 kB
Text File Licensed for inclusion in ASF works field_collapsing_1.3.patch 2007-10-28 09:20 PM Emmanuel Keller 14 kB
File Licensed for inclusion in ASF works field_collapsing_dsteigerwald.diff 2008-02-14 11:38 PM Oleg Gnatovskiy 25 kB
File Licensed for inclusion in ASF works field_collapsing_dsteigerwald.diff 2008-01-10 01:17 AM Charles Hornberger 25 kB
File Licensed for inclusion in ASF works field_collapsing_dsteigerwald.diff 2008-01-04 07:40 PM Doug Steigerwald 25 kB
Text File Licensed for inclusion in ASF works quasidistributed.additional.patch 2009-11-10 04:12 PM Michael Gundlach 1 kB
Text File Licensed for inclusion in ASF works SOLR-236-FieldCollapsing.patch 2007-06-27 03:40 PM Emmanuel Keller 18 kB
Text File Licensed for inclusion in ASF works SOLR-236-FieldCollapsing.patch 2007-06-15 06:31 PM Ryan McKinley 18 kB
Text File Licensed for inclusion in ASF works SOLR-236-FieldCollapsing.patch 2007-06-04 02:47 AM Ryan McKinley 16 kB
Text File Licensed for inclusion in ASF works solr-236.patch 2008-06-07 12:36 PM Bojan Smid 24 kB
Text File Licensed for inclusion in ASF works SOLR-236_collapsing.patch 2009-05-06 10:48 PM Thomas Traeger 25 kB
Text File Licensed for inclusion in ASF works SOLR-236_collapsing.patch 2009-03-25 08:27 AM Dmitry Lihachev 26 kB
Issue Links:
Dependants
 
Reference
 


 Description  « Hide
This patch include a new feature called "Field collapsing".

"Used in order to collapse a group of results with similar value for a given field to a single entry in the result set. Site collapsing is a special case of this, where all results for a given web site is collapsed into one or two entries in the result set, typically with an associated "more documents from this site" link. See also Duplicate detection."
http://www.fastsearch.com/glossary.aspx?m=48&amid=299

The implementation add 3 new query parameters (SolrParams):
"collapse.field" to choose the field used to group results
"collapse.type" normal (default value) or adjacent
"collapse.max" to select how many continuous results are allowed before collapsing

TODO (in progress):

  • More documentation (on source code)
  • Test cases

Two patches:

  • "field_collapsing.patch" for current development version
  • "field_collapsing_1.1.0.patch" for Solr-1.1.0

P.S.: Feedback and misspelling correction are welcome



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Emmanuel Keller added a comment - 11/May/07 10:14 PM
Field Collapsing

Emmanuel Keller added a comment - 11/May/07 10:48 PM
Remplacing HashDocSet by BitDocSet for hasMoreResult for better performances

Ryan McKinley added a comment - 13/May/07 01:59 AM
This looks good. Someone with better lucene chops should look at the IndexSearcher getDocListAndSet part...

A few comments/questions about the interface:

If you apply all the example docs and hit:
http://localhost:8983/solr/select/?q=*:*&collapse=true

you get 500. We should use: params.required().get( "collapse.field" ) to have a nicer error:

With:
http://localhost:8983/solr/select/?q=*:*&collapse=true&collapse.field=manu&collapse.max=1

the collapse info at the bottom says:

<lst name="collapse_counts">
<int name="has_more_results">3</int>
<int name="has_more_results">5</int>
<int name="has_more_results">9</int>
</lst>

what does that mean? How would you use it? How does it relate to the <result docs?


Emmanuel Keller added a comment - 13/May/07 11:03 AM
My turn to miss something
You are right, we have to use params.required().get("collapse.field").

About collapse info:
<int name="has_more_results">3</int> means that the third doc of the result has been collapsed and that some consecutive results having same field has been removed.


Yonik Seeley added a comment - 13/May/07 02:45 PM
Thanks for looking into this Emmanuel.
It appears as if this only collapses adjacent documents, correct?

We should really try to get everyone on the same page... hash out the exact semantics of "collapsing", and the most useful interface. An efficient implementation can follow.

A good starting point might be here:


Yonik Seeley added a comment - 13/May/07 02:45 PM

Emmanuel Keller added a comment - 13/May/07 03:51 PM
Yonik,

You are right, only adjacent documents are collapsed.
I work on a large index ( 2.000.000 documents) growing every day. The first goal was to group results, preserving score ranking and achieving good performances. This "light" implementation meets our needs.
I am currently working on a second implementation taking care of the semantics.

P.S.: Congratulations for this great application.


Emmanuel Keller added a comment - 13/May/07 09:09 PM
This release is more conform with the semantics of "field collapsing".

Parameters are:

collapse=true // enable collapsing
collapse.field=[field] // indexed field used for collapsing
collapse.max=[integer] // Start collapsing after n document
collapse.type=[normal|adjacent] // Default value is "normal"

  • "adjacent" collapse only consecutive documents.
  • "normal" collapse all documents having equal collapsing field.

Emmanuel Keller added a comment - 14/May/07 09:27 AM
Corrects a bug on the previous version when using a value greater than 1 as collapse.max parameter.

Otis Gospodnetic added a comment - 17/May/07 04:40 PM
Question:
Do you need collapse=true when you can detect whether collapse.field has been specified or not?

Emmanuel Keller added a comment - 18/May/07 07:38 AM
You're right. As collapse.field is a required field, we don't need more information. My first idea was to copy the behavior of facet.

Emmanuel Keller added a comment - 19/May/07 02:19 PM
The last version of the patch.
  • Results are now cached using "CollapseCache" (a new instance of SolrCache added on solrconfig.xml)
  • The parameter "collapse" has been removed.

This version has been fully tested.

Feedbacks are welcome.


Emmanuel Keller added a comment - 19/May/07 02:24 PM
I still maintain a version for the release 1.1.0 (The version we used on our production environment).

Ryan McKinley added a comment - 04/Jun/07 02:47 AM
I updated the patch so that is applies cleanly with trunk, while I was at it, I:
  • fixed a few spelling errors
  • made the "collapse.type" parameter parsing to throw an error if the passed field is unknown (rather then quietly using 'normal')
  • changed the patch name to include the number. – as we update the patch, use this same name again so it is easy to tell what is the most current.

I also made a wiki page so there are direct links to interesting queries:
http://wiki.apache.org/solr/FieldCollapsing

  • - - - - - -

Again, I will leave any discussion about the lucene implementation to other more qualified and will just focus on the response interface.

Currently if you send the query:
http://localhost:8983/solr/select/?q=*:*&collapse.field=cat&collapse.max=1&collapse.type=normal

you get a response that looks like:
<lst name="collapse_counts">
<int name="hard">1</int>
<int name="electronics">2</int>
<int name="memory">2</int>
<int name="monitor">1</int>
<int name="software">1</int>
</lst>

It looks like that says: for the field 'cat', there is one more result with cat=hard, 2 more results with cat=electronics, ...

How is a client supposed to know how to deal with that? "hard" is tokenized version of "hard drive" – unless it were a 'string' field, the client would need to know how to do that – or the response needs to change.

From a client, it would be more useful to have output that looked something like:
<lst name="collapse_counts">
<str name="field">cat</str>
<lst name="doc">
<int name="SP2514N">1</int>
<int name="6H500F0">1</int>
<int name="VS1GB400C3">2</int>
<int name="VS1GB400C3">1</int>
</lst>
<lst name="count">
<int name="hard">1</int>
<int name="electronics">1</int>
<int name="memory">2</int>
<int name="monitor">1</int>
</lst>
</lst>

"field" says what field was collapsed on,
"doc" is a map of doc id -> how many more collapsed on that field
"count" is a map of 'token'-> how many more collapsed on that field

This way, the client would know what collapse counts apply to which documents without knowing about the schema.

thoughts?


Emmanuel Keller added a comment - 04/Jun/07 10:09 PM
Right, It's more useful.

This new version includes the result as you expect it.

You should add the following constraint on the wiki: The collapsing field must be un-tokenized.


Ryan McKinley added a comment - 04/Jun/07 11:20 PM
I just took a look at this using the example data:
http://localhost:8983/solr/select/?q=*:*&collapse.field=cat&collapse.max=1&collapse.type=normal&rows=10

<lst name="collapse_counts">
<str name="field">cat</str>
<lst name="doc">
<int>1</int>
<int name="1">2</int>
<int name="2">2</int>
<int name="4">1</int>
<int name="7">1</int>
</lst>
<lst name="count">
<int>1</int>
<int name="card">2</int>
<int name="drive">2</int>
<int name="hard">1</int>
<int name="music">1</int>
</lst>
</lst>

  • - -

what is the "<int>1</int>" at the front of each response?

Perhaps the 'doc' results should be renamed 'offset' or 'index', and then have another one named 'doc' that uses the uniqueKey as the index... this would be useful to build a Map.

  • - -

Also, check:
http://localhost:8983/solr/select/?q=*:*&collapse.field=cat&collapse.max=1&collapse.type=adjacent&rows=50

ArrayIndexOutOfBoundsException:

  • - -

> You should add the following constraint on the wiki: The collapsing field must be un-tokenized.

Anyone can edit the wiki (you just have to make an account) – it would be great if you could help keep the page accurate / useful. JIRA discussion comment trails don't work so well at that...

Re: tokenized... what about it does not work? Are the limitations an different if it is mult-valued? Is it just that if any token matches within the field it will collapse and that may or may not be what you expect?

  • - -

Did you get a chance to look at the questions from the previous discussion? I just noticed Yonik posted something new there:
http://www.nabble.com/result-grouping--tf2910425.html#a10959848


Emmanuel Keller added a comment - 05/Jun/07 10:33 AM
Sorry, my last post was buggy. Here is the correct one. There is no more exception now.
About tokens, if any token matches within the field it will collapse.
When I start implementing collapsing, my need was to to group documents having exact identical field.

I believe that faceting has identical behavior. Lookt at "Graphic card" as example:
http://localhost:8983/solr/select/?q=cat:graphic%20card&version=2.2&start=0&rows=10&indent=on&facet=true&facet.field=cat

I will try to maintain the wiki page.


Yonik Seeley added a comment - 05/Jun/07 12:59 PM
I guess adjacent collapsing can make sense when one is sorting by the field that is being collapsed.

For the normal collapsing though, this patch appears to implement it by changing the sort order to the collapsing field (normally not desired). For example, if sorting by relevance and collapsing on a field, one would normally want the groups sorted by relevance (with the group relevance defined as the max score of it's members).

As far as how to do paging, it makes sense to rigidly define it in terms of number of documents, regardless of how many documents are in each group. Going back to google, it always displays the first 10 documents, but a variable number of groups. That does mean that a group could be split across pages. It would actually be much simpler (IMO) to always return a fixed number of groups rather than a fixed number of documents, but I don't think this would be less useful to people. Thoughts?


Yonik Seeley added a comment - 05/Jun/07 01:42 PM
Will Johnson brings up other use-cases:
[...]
> it's also heavily used in
> ecommerce settings. Check out BestBuy.com/circuitcity/etc and do a
> search for some really generic word like 'cable' and notice all the
> groups of items; BB shows 3 per group, CC shows 1 per group. In each
> case it's not clear that the number of docs is really limited at all, ie
> it's more important to get back all the categories with n docs per
> category and the counts per category than it is to get back a fixed
> number of results or even categories for that matter. Also notice that
> neither of these sites allow you to page through the categorized
> results.

Some of this seems very closely related to faceted search, and much of it could be implemented that way now on the client side, but it would take multiple queries to do so.

One could also think about supporting multi-valued fields in the same manner that faceting does.


Emmanuel Keller added a comment - 05/Jun/07 02:41 PM
Adjacent collapsing is useful because it preserves the pertinence of the sort.
The sorting is not modified. I copy the current sort to do a new search.

I am currently working on taking care of type field (int).


Yonik Seeley added a comment - 05/Jun/07 02:47 PM
> The sorting is not modified. I copy the current sort to do a new search.

Perhaps if you outlined the algorithm you use, it would clear up some things.

It looks like you make a copy of the Sort and insert a primary sort on the field to be collapsed, and then process the same way as you would for the "ADJACENT" option. If the original sort was by relevance, this doesn't give you the groups sorted by relevance, right?


Yonik Seeley added a comment - 05/Jun/07 03:00 PM
Oh I see... the modified sort is just to build the filter.

The building-the-filter part is a problem though... asking for all matching docs in sorted order isn't that scalable.
If we get the interface right though, more efficient implementations can follow.
For that reason, it might be good for implementatin details like "collapseCache" to be private.


Emmanuel Keller added a comment - 05/Jun/07 03:02 PM
Correct, except that collapse result is only used as filter to the final result to hide collapsed documents.

P.S.: Sorry, if my answers are a little short, I am not perfectly fluent in english.


Ryan McKinley added a comment - 09/Jun/07 10:57 PM
Any thoughts on what the faceting semantics for field collapsing should be?

That is, should faceting apply to the collapsed results or the pre-collapsed results?

I think the pre-collapsed results.


Yonik Seeley added a comment - 10/Jun/07 12:09 AM
Yes, it seems like faceting should be for pre-collapsed.

Emmanuel Keller added a comment - 10/Jun/07 11:31 AM
Do we have to make a choice ? Both behaviors are interesting.
What about a new parameter like collapse.facet=[pre|post] ?

Yonik Seeley added a comment - 10/Jun/07 04:57 PM
We facet on the complete set of documents matching a query, even when the user only requests the top 10 matches. It seems we should do the same here. The set of documents is the same, the only difference is what "top" documents are returned.

Emmanuel Keller added a comment - 11/Jun/07 08:39 AM
New release:
  • Fieldcollapsing added on DisMaxRequestHandler
  • Types are correctly handled on collapsed field

Ryan McKinley added a comment - 15/Jun/07 06:31 PM
No real changes. Updated to apply with trunk.
Moved the valid values for CollapseType to a 'common' package
  • - - -

as a side note, when you make a patch, its easiest to deal with if the path is relative to the solr root directory.

src/java/org/apache/solr/search/SolrIndexSearcher.java
is better then:
/Users/ekeller/Documents/workspace/solr/src/java/org/apache/solr/search/SolrIndexSearcher.java


Emmanuel Keller added a comment - 27/Jun/07 03:40 PM
This new patch resolves a performance issues.
I have added time informations for monitoring performances:

<str name="time">57/5</str>

The first value is the elapsed time (in milliseconds) needed to compute collapsed informations (CollapseFilter.ajacentCollapse method).
The second value is the elapsed time needed to compute results informations (CollapseFilter.getMoreResults method).

We are using Solr (with collapsing patch) on a large index in production environnment (120GB with more than 3 000 000 documents).

P.S.: This time, the patch is relative to the solr root directory.


Nuno Leitao added a comment - 28/Jul/07 12:57 AM
It would be nice for this patch to also report on what documents were actually collapsed - for example, if the result list contained:

doc1
doc2
doc3

and doc2 and doc3 were collapsed, this would be reflected in the XML result as, so that one could determine that (forgive my crap visual representation):

doc1
-> doc2
-> doc3

Regards.


Brian Mertens added a comment - 07/Sep/07 04:03 PM
Imagine a case where a Solr database contains news stories from many newspapers and some wire services.

A single wire story will typically be picked up and reprinted in many different papers, ranging from national papers like the NYTimes, to small town papers. My database will have all of them, and possibly also the original from the wire service. Each paper will choose their own headline, and will edit the story differently for length to fill a hole on the printed page, so they cannot be trivially detected as duplicates, but to my users, they basically are.

I need to detect and group together these "duplicates" when displaying search results.

So let's say every story has had an integer hash value calculated of the first X words of the lead paragraph, and that value is indexed and stored (e.g. "similarity_hash"), as a way to detect duplicate stories.

I would want to Field Collapse my results on that hash value, so that all occurrences of the same story are lumped together.

Also, my users would much prefer the most "authoritative" version of the story to be displayed as the primary result, with a count and link to the collapsed results. Authoritativeness could be coded as simple as 1) Wire Service, 2) National Paper, 3) Regional Paper, 4) Small Town Paper, which could be index and stored as an integer "authority". (For finer-grained authority we could store the newspapers circulation numbers.)

Then I could display to users:
"Dog Bites Man"
New York Times, link to see 77 other duplicates

So, finally getting to the point, would it be possible to make this feature work such that it field collapses results on one field ("similarity_hash"), selects the one to return based on another field ("authority" or "circulation')? (While allowing the results to be sorted by a third field, e.g. date or relevance.)

Perhaps by a new parameter?
collapse.authority=[field] // indexed field used for selecting which result from collapsed group to return, default being... ?

If this sounds familiar, it is somewhat similar to what Google News is doing:
http://www.pcworld.com/article/id,136680/article.html

Final question: Do you think Field Collapse could work nicely with SOLR-303 Federated Search, or is that a bridge too far?


Dima Brodsky added a comment - 18/Oct/07 06:16 PM
Hi,

I am new to the list and to Solr, so I appologize in advance if I say something silly.

I have been playing with the field collapse patch and I have a couple of questions and I have noticed a couple of issues. What is the intended use / audience for the field collapsing patch. One of the issues I see is that the sort order is changed during normal field collapsing and this causes problems if I want the results ordered based on relevancy. Another issue, is that the backfilling of the results, if there is not enough, is done from the deduped results rather than getting more results from the index. Is this by design?

Thanks!!
ttyl
Dima


Tracy Flynn added a comment - 28/Oct/07 10:29 AM
Hi,

I am new to Solr, and this thread in particular, so please excuse any questions that seem obvious.

I am investigating converting an existing FAST installation to Solr. I've been able to see how to convert all my queries to Solr/Lucene with little or no trouble, with the exception of field collapsing. I've actually implemented a demo of our main search with a Ruby/Rails front end in a few hours. Nice work everyone!

I have found this thread, looked at the patch for field collapsing and have a couple of questions.

I've looked at the Subversion tree and

  • Don't find a 1.3 branch
  • Don't find the patch code in the trunk

Is there a 'private' sandbox Solr developers work in that's not visible to the pubic (i.e. me)?

If not, what revision of the trunk does the patch apply to?

Any help would be appreciated. If I can get a demo that includes field collapsing, my management may be persuaded to let me move our main search to Solr.

Regards,

Tracy


Ryan McKinley added a comment - 28/Oct/07 05:55 PM
Hi Tracy-

There has not been much movement on this while we get SOLR-281 sorted (I hope this happens soon) – once that is in, there will hopefully be an updated patch on the 1.3 branch that will be posted here.

"1.3" is not a branch yet – it is the trunk revision that most patches work with. Only when it becomes an official release, will it actually get called 1.3 in the repository.

If you need to show field collapsing soon, I think your best bet (i have not tried it) is to apply the ' field_collapsing_1.1.0.patch' to the 1.1.0 branch ( http://svn.apache.org/repos/asf/lucene/solr/tags/release-1.1.0/ ) But if you can wait a few weeks, it will hopefully be available in trunk (or easily patchable from trunk)

ryan


Tracy Flynn added a comment - 28/Oct/07 07:17 PM
Ryan,

Thanks for the quick reply and clarification. I'll follow your suggestion as to where to apply and try the patch.

I'll be eagerly waiting for the updated trunk.

Regards,

Tracy


Emmanuel Keller added a comment - 28/Oct/07 08:54 PM - edited
Here is the patch for solr 1.3 rev 589395.

I made some performance improvement. No more cache. I use bitdocset or hashdocset depending on solrconfig.hashdocsetmaxsize variable.

Regards,
Emmanuel Keller.


Yonik Seeley added a comment - 28/Oct/07 09:11 PM
It looks like the latest patch only includes changed files and not new ones (like CollapseFilter?)

Emmanuel Keller added a comment - 28/Oct/07 09:20 PM
Thank you Yonik !
Here is the complete version.

P.S.: It's time to go to bed in Europe ...

Emmanuel.


Karsten Sperling added a comment - 02/Nov/07 02:12 AM
I've just looked at the implementation of this patch again – it ends up calling SolrIndexSearcher.getDocListC() with a DocSet derived from the CollapseFilter as the 'filter' parameter. The comment on that method says that only filter or filterList should be provided, but not both. However with the field collapsing patch both WILL be provided if filter queries are passed to the dismax request handler by the client. Can anybody shed any light on what the implications of this are?

Karsten Sperling added a comment - 06/Nov/07 10:06 PM
I've done some work on the field collapsing patch and made some additions and changes and posting this patch (against revision 592129) here for discussion.
  • Added a collapse.facet = before|after parameter to control if faceting happens before or after collapsing.
  • Changed collapse.max to collapse.threshold – this value controls after which number of collapsible hits collapsing actually kicks in (collapse.max is still supported as an alias).
  • Added a collapse.maxdocs parameter that limits the number of documents that CollapseFilter will process to create the filter DocSet. The intention of this is to be able to limit the time collapsing will take for very large result sets (obviously at the expense of accurate collapsing in those cases).
  • Inverted the logic of the filter DocSet created by CollapseFilter to contain the documents that are to be collapsed instead of the ones that are to be kept. Without this collapse.maxdocs doesn't work.
  • Added collapse.info.doc and collapse.info.count parameters to provide more control over what gets returned in the collapse_counts extra results.
  • Made a minimal change to SolrIndexSearcher.getDocListC() to support passing both the filter and filterList parameters. In most cases this was already handled anyway.
  • Did some general refactoring and added comments and a test case.

If somebody with deeper Solr/Lucene knowledge could review these changes it would be much appreciated.

Karsten


Doug Steigerwald added a comment - 04/Jan/08 07:40 PM - edited
I've created a CollapseComponent for field collapsing. Everything seems to work fine with it. Only issue I'm having is I cannot use the query component because when it isn't commented out, the non-field collapsed results are displayed and I can't figure out how to remove them. Someone might be able to figure that part out.

[http://localhost:8983/solr/search?q=id:[0%20TO%20*]&collapse=true&collapse.field=inStock&collapse.type=normal&collapse.threshold=0]

Here's the config I'm using:

<searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent" />
<requestHandler name="/search" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
</lst>
<arr name="components">
<!-- <str>query</str> -->
<str>facet</str>
<!-- <str>mlt</str> -->
<!-- <str>highlight</str> -->
<!-- <str>debug</str> -->
<str>collapse</str>
</arr>
</requestHandler>


Charles Hornberger added a comment - 07/Jan/08 07:02 PM - edited

UPDATE: Doug Steigerwald's patch (field_collapsing_dsteigerwald.diff) applies cleanly to trunk

I'm having trouble applying field_collapsing_1.3.patch to the head of trunk.

charlie@macbuntu:~/solr/src/java$ patch -p0 < /home/charlie/downloads/field_collapsing_1.3.patch 
patching file org/apache/solr/search/CollapseFilter.java
patching file org/apache/solr/search/SolrIndexSearcher.java
Hunk #1 succeeded at 694 (offset -8 lines).
Hunk #2 succeeded at 1252 (offset -1 lines).
patching file org/apache/solr/common/params/CollapseParams.java
patching file org/apache/solr/handler/StandardRequestHandler.java
Hunk #1 FAILED at 33.
Hunk #2 FAILED at 90.
Hunk #3 FAILED at 117.
3 out of 3 hunks FAILED -- saving rejects to file org/apache/solr/handler/StandardRequestHandler.java.rej
patching file org/apache/solr/handler/DisMaxRequestHandler.java
Hunk #1 FAILED at 31.
Hunk #2 FAILED at 40.
Hunk #3 FAILED at 311.
Hunk #4 FAILED at 339.
4 out of 4 hunks FAILED -- saving rejects to file org/apache/solr/handler/DisMaxRequestHandler.java.rej

I'm guessing that maybe the field collapsing patch needs to be updated for the SearchHandler refactoring that was does as part of SOLR-281? If so, I'll take a whack at migrating the changes to the SearchHandler.java, and see if I can produce a better patch.


Ryan McKinley added a comment - 07/Jan/08 10:07 PM
Charles - try applying Doug Steigerwald's latest patch: field_collapsing_dsteigerwald.diff

I have not tested it, but it does apply without errors


Charles Hornberger added a comment - 10/Jan/08 01:11 AM
Doug – I just started looking into field collapsing the other day, but from glancing at the code in QueryComponent.java and CollapseComponent.java, it seems like perhaps you're not supposed to be using both components – after all, their prepare() methods are identical, and their process() methods both execute the user's search and shove the resulting DocList into the "response" entry of the response object's internal storage Map. (The QueryComponent additionally stores the DocListAndSet in the ResponseBuilder object via builder.setResults() – I'm not sure why this is – and prefetches documents if the result set is small enough.) My guess is that if you want to enable collapsing, you should use the CollapseComponent; if you want to disable it, use the QueryComponent. Maybe someone who understand the design of the search handling components better than me can confirm this or correct my misunderstanding(s) ...

Charles Hornberger added a comment - 10/Jan/08 01:17 AM
Attaching a new copy of Doug Steigerwald's patch that omits the System.out.println() call in CollapseComponent.java.

Doug Steigerwald added a comment - 10/Jan/08 01:24 PM
I copied what was in QueryComponent.prepare() method because I was having to disable the query component because of the extra results I was getting. Initially I had CollapseComponent.prepare() empty, but I had results from the query component and then adding the collapse component results being returned (2 'response' in the results.

Easy solution for me was to copy the prepare from QueryComponent and disable the query component in the request handler. There may be another way, but I was unable to figure it out.


Oleg Gnatovskiy added a comment - 30/Jan/08 01:16 AM
Hello, I am new to Solr, so forgive me if what I say doesn't make sense... None of the patches for 1.3 work any more, since the file org.apache.solr.handler.SearchHandler has been removed from the nightly builds. Will someone write a new patch that works with teh current nightly builds? If not, could we get a copy of an old nightly build somewhere? Thanks a lot.

Charles Hornberger added a comment - 30/Jan/08 01:33 AM
It seems like SearchHandler was simply moved down into the org.apache.solr.handler.components package as part of r610426 - http://svn.apache.org/viewvc?view=rev&revision=610426

You should be able to modify the import statements field_collapsing_dsteigerwald.diff to make it work, no?


Oleg Gnatovskiy added a comment - 30/Jan/08 02:28 AM
Oh, I didn''t notice. I will give a try tomorrow morning. Thank you.

Oleg Gnatovskiy added a comment - 30/Jan/08 10:47 PM
That works, thanks

Charles Hornberger added a comment - 01/Feb/08 09:59 PM
NegatedDocSet is throwing "Unsupported Operation" exceptions:

org.apache.solr.common.SolrException:Unsupported Operation
at org.apache.solr.search.NegatedDocSet.iterator(NegatedDocSet.java:77)
at org.apache.solr.search.DocSetBase.getBits(DocSet.java:183)
at org.apache.solr.search.NegatedDocSet.getBits(NegatedDocSet.java:27)
at org.apache.solr.search.DocSetBase.intersection(DocSet.java:199)
at org.apache.solr.search.BitDocSet.intersection(BitDocSet.java:30)
at org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1109)
at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:811)
at org.apache.solr.search.SolrIndexSearcher.getDocListAndSet(SolrIndexSearcher.java:1258)
at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:103)
at org.apache.solr.handler.SearchHandler.handleRequestBody(SearchHandler.java:155)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:117)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:902)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:275)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:174)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:151)
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:874)
at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
at java.lang.Thread.run(Thread.java:595)

Not quite sure what search is triggering this path thru the code, but it is not happening on every request; just some ... am firing up the debugger now to see what I can learn, but thought I'd post this anyway to see if anyone has any tips.


Charles Hornberger added a comment - 01/Feb/08 10:55 PM - edited
Ah ... got the beginnings of a diagnosis. The problem appears when the DocSet qDocSet returned by DocSetHitCollector.getDocSet() – called at org.apache.solr.search.SolrIndexSearcher:1101 in trunk, or 1108 with the field_collapsing patch applied, inside getDocListAndSetNC()) – is a BitDocSet, and not when it's a HashDocSet. As the stack trace above shows, calling intersection() on a BitDocSet object invokes the superclass' DocSetBase.intersection() method, which invokes a call chain that blows up when it hits the iterator() method of the NegatedDocSet passed in as the filter parameter to getDocListAndSetNC(); NegatedDocSet.iterator() blows up by design:
public DocIterator iterator() {
    throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Unsupported Operation");
}

I see that DocSetBase.intersection(DocSet other) has special-casing logic for dealing with other parameters that are instances of HashDocSet; does it also need special casing logic for dealing with other parameters that are NegatedDocSets? Or should NegatedDocSet really implement iterator()? Or something else entirely?


Charles Hornberger added a comment - 02/Feb/08 02:12 AM
Here's the simplest change I could think of to make DocSetBase subclasses that don't override intersection() (which just means BitDocSet at the moment) stop choking when their intersection() gets called with a NegatedDocSet as the other parameter; it's probably horribly stupid. Also, there should be a test.
Index: src/java/org/apache/solr/search/DocSet.java
===================================================================
--- src/java/org/apache/solr/search/DocSet.java (revision 617738)
+++ src/java/org/apache/solr/search/DocSet.java (working copy)
@@ -193,7 +193,18 @@
     if (other instanceof HashDocSet) {
       return other.intersection(this);
     }
-
+    // you can't call getBits() on a NegatedDocSet, because
+    // getBits() // calls iterator(), and iterator() isn't 
+    // supported by NegatedDocSet
+    if (other instanceof NegatedDocSet) {
+        BitDocSet newdocs = new BitDocSet();
+        for (DocIterator iter = iterator(); iter.hasNext();) {
+          int next = iter.nextDoc();
+          if (other.exists(next))
+           newdocs.add(next);
+        }
+        return newdocs;
+    }
     // Default... handle with bitsets.
     OpenBitSet newbits = (OpenBitSet)(this.getBits().clone());
     newbits.and(other.getBits());

Comments?


Yonik Seeley added a comment - 02/Feb/08 02:31 AM
I haven't been following this, so I don't know why there is a need for a NegatedDocSet (or if introducing it is the best solution), but it looks like you have two cases to handle: one negative set or two negative sets.
If you have a and -b, then return a.andNot(b)
if both a and b are negative (-a.intersection(-b)) then return NegatedDocSet(a.union(b)) // per De Morgan, -a&-b == -(a|b)

That's only for intersection() of course.


Karsten Sperling added a comment - 03/Feb/08 09:31 AM
NegatedDocSet got introduced because the filter logic expects to use the intersection operation to apply a number of filters to a result. Introducing a negated docset was much easier than supporting both intersection as well as and-not type filters.

NegatedDocSet does not support iteration because the negation of a finite set is (at least theoretically) infinite. Even though it would in practice be possible to limit the negated set via the known maximum document id, this would probably not be very efficient. However, it is simply not necessary to ever iterate over the elements of a NegatedDocSet, because we know that the end-result of all DocSet operations is going to be a finite set of results, not an infinite one. A NegatedDocSet will only ever be used to "subtract" from a finite DocSet. As Yonik has pointed out, operations on a NegatedDocSet can be rewritten as (different) operations on the set being negated. The operation methods inside NegatedDocSet do this.

The reason the bug occurs is because of the naive way the binary set operation calls are dispatched: DocSet clients simply call e.g. set1.intersection(set2), arbitrarily leaving the choice of implementation to the logic defined by the class of set1. Currently, BitDocSet does not know about NegatedDocSet, and hence doesn't perform the necessary rewriting or delegation to NegatedDocSet.

However, instead of requiring each and every DocSet subclass to know about all other ones (and in the absence of language support for multiple dispatch), I think it would be better to centralize this knowledge in a single class DocSetOp with static methods that selects the appropriate implementation for an operation based on the type of both parameters. Either the client code could be changed to call DocSetOp.intersection(a, b) instead of a.intersection(b), but this would involve changing the DocSet interface. A backwards compatible solution would be to simply have final DocSetBase.intersection() delegating to DocSetOp.intersection.


Charles Hornberger added a comment - 03/Feb/08 05:39 PM

As Yonik has pointed out, operations on a NegatedDocSet can be rewritten as (different) operations on the set being negated. The operation methods inside NegatedDocSet do this.

Right. I realized, sheepishly, after I posted the first suggested patch that it'd be much simpler to just mimic the first if-clause in DocSet.intersection():

if (other instanceof NegatedDocSet) {
    other.intersection(this);
  }

Charles Hornberger added a comment - 04/Feb/08 06:59 PM

However, instead of requiring each and every DocSet subclass to know about all other ones (and in the absence of language support for multiple dispatch), I think it would be better to centralize this knowledge in a single class DocSetOp with static methods that selects the appropriate implementation for an operation based on the type of both parameters.

+1 for this ... whether or not NegatedDocSet is part of the final implementation of this feature. FWIW, I just noticed that there's another bug lurking in BitDocSet.andNot(), which will fail if a NegatedDocSet is passed in. It seems to me that it might be easier – at least for me – to read/write/extend a test suite that exercised all the paths thru DocSetOp, than to write a set of tests that exercised all the paths thru DocSetBase and its subclasses.

Also, I think that maybe there's a clear distinction to be made between intrinsic operations on a set (add(), exists(), et al.) and ones that involve another set (intersection(), union(), andNot()). Not sure it's a useful one, but it make sense to me. I don't know, though, whether it make sense to go further than that and say – as the current implementation of NegatedDocSet implies – that there are some set operations (iterator() and size()) that are in fact optional.

Off the top of my head: Would it be simpler to just modify add a filterType flag to the getDocList*() family of methods in SolrSearchInterface to cause it to call a.andNot(b) rather than a.intersection(b) when applying b as a filter? (I'm really completely ignorant – or nearly completely – of how the seach code works, so feel free not to dignify this with a response if it's a useless idea ... )


Oleg Gnatovskiy added a comment - 08/Feb/08 12:15 AM - edited
Hello everyone. I am planning to implement chain collapsing on a high traffic production environment, so I'd like to use a stable version of Solr. It doesn't seem like you have a chain collapse patch for Solr 1.2, so I tried the Solr 1.1 patch. It seems to work fine at collapsing, but how do I get a countt for the documents other then the one being displayed?

As a result I see:

<lst name="collapse_counts">
<int name="Restaurant">2414</int>
<int name="Bar/Club">9</int>
<int name="Directory & Services">37</int>
</lst>

Does that mean that there are 2414 more Restaurants, 9 more Bars and 37 more Directory & Services? If so, then that's great.

However when I collapse on some fields I get an empty collapse_counts list. It could be that those fields have a large number of different values that it collapses on. Is there a limit to the number of values that collaose_counts displays?

Thanks in advance for any help you can provide!


Oleg Gnatovskiy added a comment - 08/Feb/08 08:18 PM
Also, is field collapse going to be a part of the upcoming Solr 1.3 release, or will we need to run a patch on it?

Oleg Gnatovskiy added a comment - 08/Feb/08 10:05 PM
OK, I think I have the first issue figured out. If the current resultset (lets say the first 10 rows) doesn't have the field that we are collapsing on, the counts don't show up. Is that correct?

Oleg Gnatovskiy added a comment - 14/Feb/08 11:38 PM
Latest patch file fixes an issue where facet searching would throw a NullPointerException when using the fieldCollapse requestHandler. Also, updated the import path for SearchHandler. Thank you Dave for these tips!

Oleg Gnatovskiy added a comment - 15/Feb/08 05:25 PM
That thanks should be to Charles not Dave Sorry about that!

Nikolai Kordulla added a comment - 26/Feb/08 09:17 PM
A good thing were to apply this CollapseComponent for the mlt results.

Oleg Gnatovskiy added a comment - 03/Mar/08 09:31 PM
Are there any plans to add collapse controls to SolrJ?

Oleg Gnatovskiy added a comment - 09/Apr/08 01:29 AM
None of the patches work on the current nightly build anymore. Could anyone help? Thanks

Bojan Smid added a comment - 25/May/08 07:42 AM
I will try to bring this patch up to date. Currently I see two main problems:

1) The patch applies to trunk, but it doesn't compile. The problem occurs mainly because of changes in Search Components (for instance, some method signatures which CollapseComponent implements were changed). I have this fixed locally (more or less), but I have to test it before posting new version of patch.

2) It seems that CollapseComponent can't be used in chain with QueryComponent, but instead of it. CollapseComponent basically copies QueryComponent querying logic and adds some of it's own. I guess this isn't the right way to go. CollapseComponent should contain only collapsing logic and should be chainable with other components. Can anyone confirm if I'm right here? Of course, there might be some fundamental reason why CollapseComponent had to be implemented this way.

Does anyone else see any other issues with this component?


Oleg Gnatovskiy added a comment - 25/May/08 03:21 PM
Hey Bojan. I actually hacked collapsecomponent quite a bit, in order to get it to work with Distributed Search, but I am not going to upload it, since its horribly buggy. Do you think that's a feature that can be added?

Bojan Smid added a comment - 25/May/08 08:21 PM
Hi Oleg. I'll look into this also. In case you have any working code, you can mail it to me, and I'll see what can be reused.

Otis Gospodnetic added a comment - 27/May/08 09:10 PM
It's amazing this issue/patch has so many votes and watchers, yet it's stuck...
Ryan, Yonik, Emmanuel, Doug, Charles, Karsten

I think Bojan is onto something here. Isn't the ability to chain QueryComponent (QC) and CollapseComponent (CC) essential?

I'm looking at field_collapsing_dsteigerwald.diff and see that the CC.prepare method there is identical to the QC.prepare method, while process methods are different. Could we solve this particular copy/paste situation by making CC extend QC and simply override the process method?

As for chaining, could CC take the same approach as the MLT Component, which simply does it's thing to find "more like this" docs and stuffs them into the "moreLikeThis" element in the response?

I could be misunderstanding something, so please correct me if I'm wrong. I'd love to get this one in 1.3 – it's been waiting in JIRA for too long.


Bojan Smid added a comment - 07/Jun/08 12:36 PM
I updated the patch so that it can be compiled on Solr trunk. Also, since CollapseComponent essentially copied QueryComponent's prepare method (and it seems that it is supposed to be used instead of it), I made it extend QueryComponent (with collapsing-specific process() method, and prepare() method inherited from super class).

Oleg Gnatovskiy added a comment - 09/Jun/08 04:59 PM
I'd like to request some distributed search functionality for this feature as well.

Otis Gospodnetic added a comment - 10/Jun/08 08:49 PM
There is so little interest in this patch/functionality now, that I doubt it will get distributed search support in time for 1.3 I would like to commit Bojan's patch for 1.3, though.

Yonik Seeley added a comment - 17/Jun/08 02:23 PM
Since this is adding new interface/API, it would be very nice if one could easily review it. It's very important that the interface and the exact semantics are nailed down IMO (there seem to be a lot of options).
Is http://wiki.apache.org/solr/FieldCollapsing up-to-date?

There don't seem to be any tests either.


JList added a comment - 21/Jun/08 01:16 AM
Although field collpasing worked fine in my brief testing, when I put it to work with more documents, I got exceptions. It seems to have something to do with the queries (or documents, since different queries return different documents). With some queries, this exception does not happen.

If I remove the collapse.* parameters, the error does not happen. Any idea why this is happening? Thanks.

HTTP ERROR: 500
Unsupported Operation

org.apache.solr.common.SolrException: Unsupported Operation
at org.apache.solr.search.NegatedDocSet.iterator(NegatedDocSet.java:77)
at org.apache.solr.search.DocSetBase.getBits(DocSet.java:183)
at org.apache.solr.search.NegatedDocSet.getBits(NegatedDocSet.java:27)
at org.apache.solr.search.DocSetBase.intersection(DocSet.java:199)
at org.apache.solr.search.BitDocSet.intersection(BitDocSet.java:30)
at org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1109)
at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:811)
at org.apache.solr.search.SolrIndexSearcher.getDocListAndSet(SolrIndexSearcher.java:1282)
at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:57)
at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:156)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:125)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:965)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:272)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)


Bojan Smid added a comment - 21/Jun/08 07:28 AM
You can check discussion about this same problem in the posts above (starting with 1st Feb 2008). It seems like a rather complex issue which could require some serious refactoring of collapsing code.

JList added a comment - 21/Jun/08 04:35 PM
Sorry about the dup. I obviously didn't check the comments before I posted the bug. Anyway, it's still there, it's still happening

JList added a comment - 21/Jun/08 09:08 PM
Not sure if it's related to the query string or the documents that the query hits. If the latter, it would be trickier to reproduce.
Anyway I tried a few English words and the error didn't happen. So far I was only able to reproduce it with CJK (Simplified Chinese to be exact) queries.

This is an example query that triggers this problem (in UTF-8):
'\xe5\x9c\xb0\xe9\x9c\x87'

The query string:
http://localhost:8983/solr/select/?q=%E5%9C%B0%E9%9C%87&version=2.2&start=0&rows=10&indent=on&collapse.field=domain


Matthias Epheser added a comment - 11/Aug/08 02:59 PM
I just tried to apply the last patch and ran into 2 issues:

First:

The new getDocListAndSet(Query query, List<Query>..) method in SolrIndexSearcher calls the getDocListC(..) method using the old signature. I changed the call to the new signature and it worked very well:

DocListAndSet ret = new DocListAndSet();
QueryResult queryResult = new QueryResult();
queryResult.setDocListAndSet(ret);
queryResult.setPartialResults(false);
QueryCommand queryCommand = new QueryCommand();
queryCommand.setQuery(query);
queryCommand.setFilterList(filterList);
queryCommand.setFilter(docSet);
queryCommand.setSort(lsort);
queryCommand.setOffset(offset);
queryCommand.setLen(len);
queryCommand.setFlags(flags |= GET_DOCSET);
getDocListC(queryResult, queryCommand);

Second:

After adding more docs (~3000), I got an Exception in SolrIndexSearcher at line ~1300:
qr.setDocSet(filter == null ? qDocSet : qDocSet.intersection(filter));

As the NegotiatedDocSet doesn't implement the iterator() function, this call lead to an Unsupported Operation exception. I just naively tried to implement this funtion using "return source.iterator()". Works fine for me.

As the first issue is very clear, I wanted to check my approach for the second one before I post a patch. Maybe there are some side effects that I missed.


Doug Steigerwald added a comment - 21/Aug/08 04:32 PM
I'm in the process of updating our Solr build and I'm running into issues with this patch now. I added the code in the first issue Matthias mentioned. Unfortunately whenever I try to do any field collapsing, I get a NPE:

java.lang.NullPointerException
at org.apache.solr.search.CollapseFilter.getCollapseInfo(CollapseFilter.java:263)
at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:65)
...

My request handler for testing is simple. It only has the collapse component in it. Posting the example docs and trying to execute the following query gives me the NPE.

http://localhost:8983/solr/search?q=*:*&collapse.field=cat&collapse.type=normal

Updated my trunk this morning (r687489).


Oleg Gnatovskiy added a comment - 21/Aug/08 04:44 PM - edited
I was able to hack the latest patch in, and to get it to work, but it required some pretty heavy naive changes...

If you are getting an NPE try this: in the SolrIndexSearcher class, in the getDocListC method change out = new DocListAndSet(); to

DocListAndSet out = null;
if(qr.getDocListAndSet() == null)
out = new DocListAndSet();
else
out = qr.getDocListAndSet();


Mark Miller added a comment - 06/Oct/08 01:25 AM - edited
Sorting twice (when not sorting on the collapse field) only makes sense if we are doing external sorts (harddrive), correct ? It seems to me that this should be closer to the facet stuff (in using the field cache) and then use a hash table of accumulators: linear time (is that generally?) right? (edit: looks like thats too memory intensive)

As Otis mentions above, this issue appears very popular. We should finish it up.


Oleg Gnatovskiy added a comment - 09/Oct/08 06:36 PM
What's a hard drive sort?

Mark Miller added a comment - 09/Oct/08 06:45 PM - edited

What's a hard drive sort?

Sorry - was not very clear.

Just like sorting, finding dupes can be done in memory or using external storage (harddrive). I am only just looking into this stuff myself, but it seems in the best case you would want to do it in memory with a hash system which can be linear scalability. If you have too many items to look for dupes in, you have to use external storage - one good method is two sorts (we get one from the search), but there are other options too I think. In this case, the sorts are able to be done in memory though, but I think the hashtable method of identifying dupes is much less memory efficient (too many unique terms).


Vaijanath N. Rao added a comment - 31/Oct/08 03:22 AM
Hi All,

I am trying to apply this patch to solr-1.4 code and getting following errors.
At line number 58 of the CollapseComponent.java and the error is:
The method getDocListAndSet (Query, List<Query>, Sort, int , int , int) in the type SolrIndexSearcher is not applicable for the arguments (Query, List<Query>, DocSet, Sort, int , int , int)

Can anyone tell me the correction I need to do to get this code working.

--Thanks and Regards
Vaijanath


Vaijanath N. Rao added a comment - 31/Oct/08 08:01 AM
Hi All,

I got this patch working but for 1.3 code and not 1.4. I will try to get this working and will tell you the results. I pulled in some code from older version namely for
getDocListAndSet
getDocListNC
getDocListC.

I also added an constructor DocSetHitCollector (int maxDoc) with following code
public DocSetHitCollector(int maxDoc) {
this(HashDocSet.DEFAULT_INVERSE_LOAD_FACTOR,maxDoc,maxDoc);
// TODO Auto-generated constructor stub
}

I wanted to know if any of the additions harm any other component of solr.

Do I need to make any changes to solrconfig other than the following

Adding <arr name="first-components"> <str>collapse</str> </arr> this to standard and dismax query handler
<searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent" />

I will check this with highlighting and let you all know of any observation that I make.

--Thanks and Regards
Vaijanath


Iván de Prado added a comment - 13/Nov/08 04:33 PM
A patch for field collapsing over Solr 1.3.0. It changes the behavior to be more memory friendly when the parameter collapse.maxdocs is used.

Iván de Prado added a comment - 13/Nov/08 04:59 PM
I attached a patch named collapsing-patch-to-1.3.0-ivan.patch. The patch applies to Solr 1.3.0.

Karsten commented in the comment "Karsten Sperling - 06/Nov/07 02:06 PM":

Inverted the logic of the filter DocSet created by CollapseFilter to contain the documents that are to be collapsed instead of the ones that are to be kept. Without this collapse.maxdocs doesn't work.

I found that this way of doing consumes a lot of memory, even if your query is bounded to a few number of documents. And I found that there is not advantage on using collapse.maxdocs if you don't speed up queries and reduces the amount of needed memory.

So, I decided to revert the Karsten change in order to make field collapsing faster and less resources consuming when querying for smaller datasets.

WARNING: This patch changes the semantic of collapse.maxdocs. Before this patch, the collapse.maxdocs was used just for reduce the number of docs cheked for grouping, but presenting the rest of documents that were not grouped in the result.

With current patch, only documents that were examinated for grouping can appear in the result. This semantic have two benefits:

  • The amount of resources can be controled per each query
  • Not ungrouped content is presented.

Doug Steigerwald added a comment - 09/Dec/08 08:30 PM
I'm having an issue with Ivan's latest patch. I'm testing on a data set of 8113 documents. All the documents have a string field called site. There are only two sites, Site1 and Site2.

Site1 has 3466 documents.
Site2 has 4647 documents.

With the following simple query, I only get 1 result:
http://localhost:8983/solr/core1/search?q=*:*&collapase=true&collapse.field=site

....
<lst name="collapse_counts">
<str name="field">site</str>
<lst name="doc">
<int name="site2-doc-2981790">4646</int>
</lst>
<lst name="count">
<int name="Site2">4646</int>
</lst>
<str name="debug">HashDocSet(2) Time(ms): 0/0/0/0</str>
</lst>
<result name="response" numFound="1" start="0">
....

The only result displayed is for Site2.

I have an older patch working with Solr 1.3.0, but I can't get it to mesh with localsolr properly. My localsolr gives 1656 results, and collapsed on the site it should give 2 results but gives 8 results, some of which are duplicate documents. Without localsolr, my field collapsing patch seems to work fine.


Ryan McKinley added a comment - 09/Dec/08 09:58 PM
What is the "localsolr" field you are talking about?

Is it the solr stuff from http://sourceforge.net/projects/locallucene ?


Doug Steigerwald added a comment - 10/Dec/08 12:39 AM
Yes, that localsolr. I've just been trying to get the two components working together but haven't had much luck.

Separately they work fine, but together not so much. I can't get the field collapsing to work correctly with an existing reset set from the localsolr component in the response builder.


Iván de Prado added a comment - 10/Dec/08 04:31 PM - edited
I have attached new patch with the problems solved in my first submitted patch. Doug Steigerwald, could you check if this patch works well for you? Thanks.

Doug Steigerwald added a comment - 10/Dec/08 05:16 PM
Looks fine from my little bit of testing.

Stephen Weiss added a comment - 11/Dec/08 07:48 PM
I'm using Ivan's patch and running into some trouble with faceting...

Basically, I can tell that faceting is happening after the collapse - because the facet counts are definitely lower than they would be otherwise. For example, with one search, I'd have 196 results with no collapsing, I get 120 results with collapsing - but the facet count is 119??? In other searches the difference is more drastic - In another search, I get 61 results without collapsing, 61 with collapsing, but the facet count is 39.

Looking at it for a while now, I think I can guess what the problem might be...

The incorrect counts seem to only happen when the term in question does not occur evenly across all duplicates of a document. That is, multiple document records may exist for the same image (it's an image search engine), but each document will have different terms in different fields depending on the audience it's targeting. So, when you collapse, the counts are lower than they should be because when you actually execute a search with that facet's term included in the query, all the documents after collapsing will be ones that have that term.

Here's an illustration:

Collapse field is "link_id", facet field is "keyword":

Doc 1:
id: 123456,
link_id: 2,
keyword: Black, Printed, Dress

Doc 2:
id: 123457,
link_id: 2,
keyword: Black, Shoes, Patent

Doc 3:
id: 123458,
link_id: 2,
keyword: Red, Hat, Felt

Doc 4:
id: 123459,
link_id:1,
keyword: Felt, Hat, Black

So, when you collapse, only two of these documents are in the result set (123456, 123459), and only the keywords Black, Printed, Dress, Felt, and Hat are counted. The facet count for Black is 2, the facet count for Felt is 1. If you choose Black and add it to your query, you get 2 results (great). However, if you add Felt to your query, you get 2 results (because a different document for link_id 2 is chosen in that query than is in the more general query from which the facets are produced).

I think what needs to happen here is that all the terms for all the documents that are collapsed together need to be included (just once) with the document that gets counted for faceting. In this example, when the document for link_id 2 is counted, it would need to appear to the facet counter to have keywords Black, Printed, Dress, Shoes, Patent, Red, Hat, and Felt, as opposed to just Black, Printed, and Dress.


Iván de Prado added a comment - 12/Dec/08 10:16 AM
You can try with collapse.facet=before, but then you'll notice that the list of documents returned is all, not only the collapsed ones.

Stephen Weiss added a comment - 13/Dec/08 09:43 PM
Yes, this is basically what I'm doing for now... At least it's reasonable enough to explain to a client that the counts are for unfilitered results. However, ideally, it should be able to facet properly on filitered results as well...

Also, with simply collapse.facet=before, the results returned are the unfilitered results. You have to specify collapse.facet=after to get filtered results at all, and run the query component right before the facet component then to get the unfilitered facet counts... which doesn't seem to be ideal. This is with release version of SOLR 1.3 and Iván's most recent patch. All in all it took a lot of experimenting but at least now I have a method that works that we can go live with and then we'll just update the software as the situation improves.

Thanks for all your efforts on the patch! I complain but really, the fact it works at all is a miracle for us.


Stephen Weiss added a comment - 15/Dec/08 07:36 PM
I get an error on certain searches with Ivan's latest patch.

Dec 15, 2008 2:32:00 PM org.apache.solr.core.SolrCore execute
INFO: [ss_image_core] webapp=/solr path=/select params={collapse=true&facet.limit=5&wt=json&rows=50&json.nl=map&start=0&sort=add_date+desc,+object_id+asc&facet=true&collapse.facet=after&f.season.facet.limit=-1&facet.mincount=1&fl=object_id&q=object_type:image+AND+classif_name:(19097)+AND+market:(49154)+AND+perms:(1835+OR+4785+OR+1725+OR+1690+OR+2816+OR+3149+OR+3082+OR+2815+OR+2814+OR+3083+OR+4783)&version=1.2&f.classif_name.facet.limit=-1&collapse.field=link_id&collapse.threshold=1&facet.field=classif_name&facet.field=market&facet.field=season&facet.field=city&facet.field=designer&facet.field=category&facet.field=keywords&facet.field=lifestyle} hits=263059 status=500 QTime=4508
Dec 15, 2008 2:32:00 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.ArrayIndexOutOfBoundsException: 41386
at org.apache.solr.util.OpenBitSet.fastSet(OpenBitSet.java:235)
at org.apache.solr.search.CollapseFilter.addDoc(CollapseFilter.java:214)
at org.apache.solr.search.CollapseFilter.adjacentCollapse(CollapseFilter.java:171)
at org.apache.solr.search.CollapseFilter.<init>(CollapseFilter.java:139)
at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:52)
at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:169)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:324)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)

Unfortunate really, it happens every time this specific search is run, but many, many other searches of similar result set size and considerably more complexity or equivalent complexity will execute fine... I can't honestly tell you what's special about this one search that would make it fail.

For now the patch is offline until we can figure something out for it... I can provide access to the machine (I managed to reproduce it in a test environment) if it would help determine what the problem is / make the software better for everyone.


Karsten Sperling added a comment - 16/Dec/08 09:20 PM
I'm pretty sure the problem Stephen ran into is an off-by-one error in the bitset allocation inside the collapsing code; I ran into the same problem when I customized it for internal use about half a year ago – and unfortunately forgot all about the problem until reading Stephen's comment just now. Basically the bitset gets allocated 1 bit too small, so there's about a 1/32 chance that if the bit for the document with the highest ID gets set it will cause the AIOOB exception.

Iván de Prado added a comment - 17/Dec/08 12:43 PM
Karsten Sperling was right. Seems that there was a wrong bounds initialization for the OpenBitSet. I have solved it and attached a new patch.

Stephen Weiss, can you test if now the error has disappeared?

Thanks.


Stephen Weiss added a comment - 22/Dec/08 07:55 AM
Yes! It does work. Thank you both so much! It's been running for 5 days now without a hiccup. This is going into production use now (we'll be monitoring), they simply can't wait for the functionality. From here it looks like if you get faceting tidied up and some docs written, they should be including this soon!

Ryan McKinley added a comment - 22/Dec/08 06:09 PM
I see there is a patch agains 1.3, is there any current patch against trunk? (we would need something against trunk in order to consider this for 1.4)

Thomas Traeger added a comment - 27/Dec/08 01:46 AM
I tested 1.3 and ivans latest patch.

When I add a Filter Query (fq param) to my query I get an exception "Either filter or filterList may be set in the QueryCommand, but not both.". I'm not that familiar with java but at least disabled the exception in SolrIndexSearch.java. I can use Filter Queries now and no problems occured so far. But surely this has to be handled in another way.

Btw, I think this had already been fixed by Karsten back in 2007 in some way (patch field-collapsing-extended-592129.patch). He commented it with:

"Made a minimal change to SolrIndexSearcher.getDocListC() to support passing both the filter and filterList parameters. In most cases this was already handled anyway."


dieter grad added a comment - 29/Jan/09 02:40 PM

I had to make a patch to fix two issues that we needed for our system. I am not used to this code, so maybe someone can pick this patch and make it something useful for everybody.

The fixes are:

1) When collapsing.facet=before, only the collapsed documents are returned (and not the whole collection).

2) When collapsing is normal, the selected sort order is preserved by returning the first document of the collapsed group.

For example, if the values of the collapsing field are:

1) Y
2) X
3) X
4) Y
5)X
6)Z

the documents returned are 1, 2 and 6, in that order.

So, for example, if you sort by price ascending, you will get the result sorted by price, where each item is the cheapest item of its collapsed group.


Shalin Shekhar Mangar added a comment - 17/Feb/09 07:29 AM
Marked for 1.5

Oleg Gnatovskiy added a comment - 18/Feb/09 10:06 PM
Are the any concrete plans on where this feature is going? Is it ever going to get support for distributed search?

Stephen Weiss added a comment - 06/Mar/09 02:12 PM - edited
Help!!

We've been using this patch in production for months now, and suddenly in the last 3 days it is crashing constantly.

Edit - It's Ivan's latest patch, #3, with Solr 1.3 dist

Mar 6, 2009 5:23:50 AM org.apache.solr.common.SolrException log
SEVERE: java.lang.OutOfMemoryError: Java heap space
at org.apache.solr.util.OpenBitSet.ensureCapacityWords(OpenBitSet.java:701)
at org.apache.solr.util.OpenBitSet.ensureCapacity(OpenBitSet.java:711)
at org.apache.solr.util.OpenBitSet.expandingWordNum(OpenBitSet.java:280)
at org.apache.solr.util.OpenBitSet.set(OpenBitSet.java:221)
at org.apache.solr.search.CollapseFilter.addDoc(CollapseFilter.java:217)
at org.apache.solr.search.CollapseFilter.adjacentCollapse(CollapseFilter.java:171)
at org.apache.solr.search.CollapseFilter.<init>(CollapseFilter.java:139)
at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:52)
at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:169)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:324)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)

It seems to happen randomly - there's no special request happening, nothing new added to the index, nothing. We've made no configuration changes. The only thing that's happened is more documents have been added since then. The schema is the same, we have perhaps 200000 more documents in the index now than we did when we first went live with it.

It was a 32-bit machine allocated 2GB of RAM for Java before. We just upgraded it to 64-bit and increased the heap space to 3GB, and still it went down last night. I'm at my wits end, I don't know what to do but this functionality has been live so long now it's going to be extremely painful to take it away. Someone, please tell me if there's anything I can do to save this thing.


Iván de Prado added a comment - 06/Mar/09 02:52 PM
That is one of the problems that this patch has: The consumption of resources (memory and CPU) increases with the amount of results in the query and with the amount of requests.

Is not trivial to change that. I imaging that deep changes in Solr or Lucene would be needed to have an efficient collapsing.

The advices I can give you are:

  • Increase the amount of memory or your Solr
  • Use the parameter "collapse.maxdocs" . This parameter limits the number of document that are seen when collapsing. By using it, you'll limit the amount or memory and resources used per each query. But if the query you did has more than maxdocs documents, then the collapsing won't be perfect. Some documents won't be collapsed.

I hope it helps something.


Stephen Weiss added a comment - 06/Mar/09 03:12 PM
Thank you so much for your prompt response Ivan, I really appreciate your help.

I have already maxed out the RAM on the machine - it seems very strange to me that adding a whole other GB of RAM did not fix the issue already. So I will have to try the next option, collapse.maxdocs.

How does this work though? Does this mean, let's say I set collapse.maxdocs to 10000, that means the first 10000 documents will be collapsed, and after that they won't be? Or is it more random?


Iván de Prado added a comment - 06/Mar/09 03:24 PM
Is not random. I don't remember pretty well, but I think that documents are sorted by the collapsing field. After that, they are being grouped sequentially until reaching maxdocs. The groups that results from there are the documents that are presented. So the number of groups resulted are always smaller than the number of maxdocs.

Summary: only maxdocs are scanned to generate the resulting groups.


Stephen Weiss added a comment - 06/Mar/09 03:43 PM
Unfortunately I don't think that will work for us. The collapse.maxdocs seems to collapse the oldest documents in the index - but we sort from newest to oldest, so effectively the newest documents in the index are just left out. Not only do they not collapse but they don't appear at all. If this is the only solution then we will have to stop using the patch... and unfortunately this means in general we will probably have to stop using Solr. The company has already made clear that this functionality is required, and especially since it has been working now for several months they will be very unlikely to accept that they can't have it anymore.

Anyway I don't want to give up yet...

I'm really not convinced this is really a problem of running out of the necessary memory to complete the operation - it only started doing this very recently. How does it run for 3 months with 2GB of RAM without any trouble, and now it fails even with 3GB of RAM? It's not like we just added those 200000 documents yesterday - they have accumulated over the past few months, in the past 3 days we've only perhaps added 20,000 documents. 20,000 more documents (with barely any new search terms at all) means it needs more than 1GB of memory more than what it was already using? If we grow by 25% every year that means by December we will need 50GB of RAM in the machine.


Mark Miller added a comment - 06/Mar/09 03:49 PM
How much RAM does the machine have total? 4 GB?

Do you ever commit rapidly?

You might try decreasing your cache sizes if you are using them.


Stephen Weiss added a comment - 06/Mar/09 04:37 PM
The machine has 4GB total. In response to this issue, and especially now that we have upgraded it to be 64 bit (again, for this issue), we have already ordered another 16 GB for the machine to try and stave off the problem. We should have it in next week.

I restrict commits severely - a commit is only allowed once an hour, in practice they happen even less frequently - perhaps 5 or 6 times a day, and very spread out. We are freakishly paranoid But honestly that's all we need - new documents come in in chunks and generally they want them to go in all at once, and not piecemeal, so that the site updates cleanly (the commits are synchronized with other content updates - new images on the home page, etc).

Some more information... just trying to toss out anything that matters. We have a very small set of possible terms - only 60,000 or so which tokenize to perhaps 200,000 total distinct words. We do not use synonyms at index time (only at query time). We use faceting, collapsing, and sorting - that's about it, no more like this or spellchecker (although we'd like to, we haven't gotten there yet). Faceting we do use heavily though - there are 16 different fields on which we return facet counts. All these fields together represent no more than 15,000 unique terms. There are approx. 4M documents in the index total, and none of them are larger than 1K.

Memory usage on the machine seems to steadily increase - after restart and warming, 40% of the RAM on the machine is in use. Then, as searches come in, it steadily increases. Right now it is using 61%, in an hour it will probably be closer to 75% - the danger zone. This is also unusual because before, it used to stay pretty steady around 52-53%.

This is a multi-core system - we have 2 cores, the one I'm describing now is only one of them. The other core is very, very small - total 8000 documents, which are also no more than 1 K each. We do use faceting there but no collapsing (it is not necessary for that part). It is essentially irrelevant, with or without that core the machine consumes about the same amount of resources.

In response to this problem I have already dramatically reduced the following options:

< <mergeFactor>2</mergeFactor>
< <maxBufferedDocs>100</maxBufferedDocs>

> <mergeFactor>10</mergeFactor>
> <maxBufferedDocs>1000</maxBufferedDocs>
42c42
< <maxFieldLength>2500</maxFieldLength>

> <maxFieldLength>10000</maxFieldLength>
50,51c50,51
< <mergeFactor>2</mergeFactor>
< <maxBufferedDocs>100</maxBufferedDocs>

> <mergeFactor>10</mergeFactor>
> <maxBufferedDocs>1000</maxBufferedDocs>
53c53
< <maxFieldLength>2500</maxFieldLength>

> <maxFieldLength>10000</maxFieldLength>

( diff of solrconfig.xml - < indicates current values, > indicates values when the problem started happening).

This actually seemed to make the search much faster (strangely enough), but it doesn't seem to have helped memory consumption very much.

These are our cache parameters:

<filterCache
class="solr.LRUCache"
size="65536"
initialSize="4096"
autowarmCount="2048"/>

<queryResultCache
class="solr.LRUCache"
size="512"
initialSize="512"
autowarmCount="256"/>

<documentCache
class="solr.LRUCache"
size="16384"
initialSize="16384"
autowarmCount="0"/>

<cache name="collapseCache"
class="solr.LRUCache"
size="512"
initialSize="512"
autowarmCount="0"/>

I'm actually not sure if the collapseCache even does anything since it does not appear in the admin listing. I'm going to try reducing the filterCache to 32K entries and see if that makes a difference. I think that may be the right track since otherwise it seems like a big memory leak is happening.

Is there any way to specify the size of the cache in terms of the actual size it should take up in memory, as opposed to the number of entries? 64K sounded quite small to me but now I'm thinking that 64K could mean GB's of memory depending on what the entries are, I honestly don't understand what the correlation would be between an entry and the size that entry takes in RAM.


Mark Miller added a comment - 06/Mar/09 04:57 PM

we have already ordered another 16 GB for the machine to try and stave off the problem. We should have it in next week.

Great. You've got a lot going on here, and 4 GB is on the extremely low end of what I'd suggest.

I restrict commits severely -

Good news again.

In response to this problem I have already dramatically reduced the following options:

Dropping the merge factor is not likely to help much. It will increase the time it takes to add docs (merges occur much more often) for the benefit of maintaining an almost optimized index at all times (hence the faster search speed). Not a big RAM factor though.

Also, dropping the max buffered docs is also probably not a huge saver, and will only affect RAM usage during indexing. Going from 1000 to 100 will likely hurt indexing performance and not save that much RAM in the larger scheme of things.

And dropping the maxFieldLength will hide parts of the document that are over that length - perhaps youll end up with a handful fewer index terms, but again, not likely a big savings here and may do more harm than good.

My suggestion of lowering your cache sizes was just a thought to eek out some more RAM for you. Its not really suggested though if you can get more RAM. For best performance, those caches should be set correctly. If you are using the fieldcache method for faceting, you want the size of the filter cache to be the same as the number of unique terms you are faceting on. The other caches are not so large that I would suggest trimming them.

The reality is, you've got 4 million docs, sorting (uses field caches), faceting (likely uses field caches), and this resource intensive field collapse patch. More RAM is probably your best bet. Every document you add potentially adds to the RAM usage of each of these things. That doesn't mean you don't have a different problem (it does seem weird it ballooned all of a sudden), but your running some RAM hungry stuff here, and it wouldn't blow my mind that 3 gig is not enough to handle it. It could be that only recently the right searches started coming in at the right times to fire up all your needs at once. Much of this may be lazy loaded or loaded on the fly depending on if and how you have configured your warming searches.


Stephen Weiss added a comment - 06/Mar/09 05:25 PM
Thanks. In the wiki next to each one of these parameters it explicitly says that reducing this parameter will decrease memory usage, this is why we reduced these parameters (it did not mention the filterCache at all).

I really do hope the RAM will help. It certainly can't help.

My filterCache stats are great- you know it's set to 64K but right now, with almost all the RAM used up (we're at 71.9% now), but it's only using 36290 entries at the moment and it's holding pretty steady there (even as RAM usage increased by 10%). None of the other caches have gone up much either. We have no cache evictions, at all, but a 99% hit ratio.

I'm going to try lowering the filterCache to be just above the number it's at now, since that amount seems to be all it needs. It's possible at crash time all the sudden is uses a lot more of it for some reason - I have a feeling it might be related to a new permissions group that was added 3 days ago. That might trigger a lot more filters. It is barely used at all yet except by one client - I'm going to go check and see if there's any correspondence between when that client logs in and when the problem occurs - I bet there is.

Thanks for all your help guys.


Mark Miller added a comment - 06/Mar/09 05:39 PM

Thanks. In the wiki next to each one of these parameters it explicitly says that reducing this parameter will decrease memory usage, this is why we reduced these parameters (it did not mention the filterCache at all).

They will save RAM to a certain extent for certain situations. But not very helpful at the sizes you are working with (and not settings I would use to save RAM anyway, unless the amount I need to save was pretty small). Also, the savings are largely index side - not likely a huge part of your RAM concerns, which are search side.

My filterCache stats are great- you know it's set to 64K but right now, with almost all the RAM used up (we're at 71.9% now), but it's only using 36290 entries at the moment and it's holding pretty steady there(even as RAM usage increased by 10%). None of the other caches have gone up much either. We have no cache evictions, at all, but a 99% hit ratio.

The sizes may be higher than you need then. They should be adjusted to the best settings based on the wiki info. I was originally suggesting you might sacrifice speed with the caches for RAM - but, its always best to use the best settings and have the necessary RAM.


Dmitry Lihachev added a comment - 25/Mar/09 08:24 AM
When I add a Filter Query (fq param) to my query I get an exception "Either filter or filterList may be set in the QueryCommand, but not both."

Dmitry Lihachev added a comment - 25/Mar/09 08:26 AM
This patch (based on dieter patch) allows using fq parameter

Dave Redford added a comment - 02/Apr/09 12:54 AM - edited
There is an issue with collapsed result ordering when querying with only the unique Id and score fields in the request.

[Update: this is only an issue when both standard results and collapse results are present - which I was using for testing]

eg:
q=ford&version=2.2&start=0&rows=10&indent=on&fl=Id,score&collapse.field=PrimaryId&collapse.max=1

gives wrong ordering (note: Id is our unique Id)

but adding a another field - even a bogus one - works.
q=ford&version=2.2&start=0&rows=10&indent=on&fl=Id,score,bogus&collapse.field=PrimaryId&collapse.max=1

Also using an fq makes it work
eg:
fq=Type:articles&q=ford&version=2.2&start=0&rows=10&indent=on&fl=Id,score&collapse.field=PrimaryId&collapse.max=1

I'm using the latest Dmitry patch (25/mar/09) against 1.3.0.

Apart from that great so far...thanks to all


Jeff added a comment - 16/Apr/09 10:09 PM - edited
We have tried to integrate the most recent patch into our 1.4 install. The patching was smooth and overall it works good. However, it appears the issue with fq has returned. Whenever I try to filter the query it gives "Either filter or filterList may be set in the QueryCommand, but not both." Not sure what happened. What part of the patch makes it possible for fq to work as it may not be there now.

Additionally, the collapse.facet=before seems to not work. Any help in this area would be greatly appreciated.


Domingo Gómez García added a comment - 23/Apr/09 09:19 AM - edited
I made checkout on svn release-1.3.0 and applied SOLR-236_collapsing.patch.
Is there any way of integrate with solrj?

Oleg Gnatovskiy added a comment - 29/Apr/09 03:45 PM
How did you fix the memory issue?

Domingo Gómez García added a comment - 29/Apr/09 03:49 PM
-XX:PermSize=1524m -XX:MaxPermSize=1524m -Xmx128m
It's not a real fix, but works for now...

Thomas Traeger added a comment - 06/May/09 10:48 PM
This patch is based on the latest patch by Dmitry, it addresses the following issues:
  • the CollapseComponent now simply falls back to the process method of QueryComponent when no collapse.field is defined. This fixes issues with the fq param when collapsing was disabled and makes CollapseComponent a fully compatible replacement for QueryComponent.
  • collapse.facet=before is now fixed, the previous patch ignored any filter queries (fq) and therefore returned wrong facet counts
  • ResponseBuilder "builder" renamed to "rb" to match QueryComponent

This patch applies to trunk (rev. 772433) but works with Solr 1.3 too. For 1.3 you have to move CollapseParams.java from common/org/apache/solr/common/params to java/org/apache/solr/common/params/ as the location of this file has been changed in trunk.

This is my first contribution so any feedback is much appreciated. This is a great feature so lets get it into Solr as soon as possible.


Martijn van Groningen added a comment - 29/May/09 12:52 PM - edited
Hi,

I have modified the latest patch of Thomas and made two performance improvements:
1) Improved normal field collapsing. I tested it with an index 1.1 million documents. When collapsing on all documents and with no sorting specified (so sorting on score) the query time is around 130ms compared with the previous patch which is around 1.5 s. When I then add sorting on string field the query time is around 220 ms compared with the previous patch which is around 5.2 s.

The reason why it is faster is because the latest patch queries for a doclist instead of a docset. In the normal collapse method it keeps track of the most relevant documents, so the end result is the same, also creating a docList of 1.1 million documents (and ordering it) is very expensive.

Note: I did not improved adjacent collapsing, because the adjacent method needs (as far as I understand it) a completely sorted list of documents (docList).

2) Slightly improved facetation in combination with field collapsing, by reusing the uncollapsed docset that is created during the collapsing process (the previous patch made invoked a second search).

I also have added documentation, added a few unit tests for the collapsing process itself and made the debug information more readable.
This patch works from revision 779335 (last Wednesday) and up. This patch depends on some changes in Solr and a change inside Lucene.

I'm very interested in other people's experiences with this patch and feedback on the patch itself.

Cheers,

Martijn


Thomas Traeger added a comment - 30/May/09 07:14 AM
I made some tests with your patch and trunk (rev. 779497). It looks good so far but I have some problems with occasional null pointer exceptions when using the sort parameter:

http://localhost:8983/solr/select?q=*:*&collapse.field=manu&sort=score%20desc,alphaNameSort%20asc

java.lang.NullPointerException
at org.apache.lucene.search.FieldComparator$RelevanceComparator.copy(FieldComparator.java:421)
at org.apache.solr.search.CollapseFilter$DocumentComparator.compare(CollapseFilter.java:649)
at org.apache.solr.search.CollapseFilter$DocumentPriorityQueue.lessThan(CollapseFilter.java:596)
at org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:153)
at org.apache.solr.search.CollapseFilter.normalCollapse(CollapseFilter.java:321)
at org.apache.solr.search.CollapseFilter.<init>(CollapseFilter.java:211)
at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:67)
at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1328)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)

These queries work as expected:
http://localhost:8983/solr/select?q=*:*&collapse.field=manu&sort=score%20desc
http://localhost:8983/solr/select?q=*:*&sort=score%20desc,alphaNameSort%20asc


Martijn van Groningen added a comment - 30/May/09 11:26 AM
Thanks for the feedback, I fixed the problem you described and I have added a new patch containing the fix.
The problem occurred when sorting was done on one ore more normal fields and on scoring.

Thomas Traeger added a comment - 30/May/09 04:11 PM
The problem is solved, thanks. I will use your patch for my current project that is planned for golive in 5 weeks. If I find any more issues I will report them here.

Oleg Gnatovskiy added a comment - 30/May/09 04:25 PM
Hey guys, are there any plans to make field collapsing work on multi shard systems?

Martijn van Groningen added a comment - 30/May/09 05:39 PM
I'm looking forward in your experiences with this patch, particular in production.

I think in order to make collapsing work on multi shard systems the process method of the CollapseComponent needs to be modified.
CollapseComponent already subclasses QueryComponent (which already supports querying on multi shard systems), so it should not be that difficult.


Ron Veenstra added a comment - 04/Jun/09 02:15 AM - edited
I require assistance. I've installed a fresh Solr (1.3.0), and all appears/operates well. I then patch using SOLR-236_collapsing.patch [by Thomas Traeger] (the last patch i saw claimed to work with 1.3.0), without error. I then add to solrconfig.xml the following (per: http://wiki.apache.org/solr/FieldCollapsing) :

<searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent" />

Upon restart, I get a long configuration error, which seems to hinge on:

HTTP Status 500 - Severe errors in solr configuration. Check your log files for more detailed information on what may be wrong. If you want solr to continue after configuration errors, change: <abortOnConfigurationError>false</abortOnConfigurationError> in solrconfig.xml ------------------------------------------------------------- org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.component.CollapseComponent' at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:273)

[the full error can be included if desired.]

I've verified that the CollapseComponent file exists in the proper place.
I've moved CollapseParams as required, (move CollapseParams.java from common/org/apache/solr/common/params to java/org/apache/solr/common/params/ )
I've tried multiple iterations of the patch (on fresh installs), all with the same issue.

Are there additional steps, patches, or configurations that are required?
Is this a known issue?
Any help is very much appreciated.