First, thanks to everyone who has spent so much time working on this - lack of committer attention doesn't equate to lack of interest... this is a very much needed feature!
I'd agree with Erik that the most important thing is the interface to the client, and making it well thought out and semantically "tight". Martijn's recent improvements to the response structure is an example of improvements in this area. It's also important to think about the interface in terms of how easy it will be to add further features, optimizations, and support distributed search. If the code isn't sufficiently standalone, we also need to see how easily it fits into the rest of Solr (what APIs it adds or modifies, etc). Actually implementing performance improvements and more distributed search can come later - as long as we've thought about it now so we haven't boxed ourselves in.
It seems like field collapsing should just be additional functionality of the query component rather than a separate component since it changes the results?
The most basic question about the interface would be how to present groups. Do we stick with a linear document list and supplement that with extra info in a different part of the response (as the current approach takes)? Or stick that extra info in with some of the documents somehow? Or if collapse=true, replace the list of documents with a list of groups, each which can contain many documents? Which will be easiest for clients to deal with? If you were starting from scratch and didn't have to deal with any of Solr's current shortcomings, what would it look like?
From the wiki:
collapse.maxdocs - what does this actually mean? I assume it collects arbitrary documents up to the max (normally by index order)? Does this really make sense? Does it affect faceting, etc? If it does make sense, it seems like it would also make sense for normal non-collapsed query results too, in which case it should be implemented at that level.
collapse.info.doc - what does that do? I understand counts per group, but what's count per doc?
collapse.includeCollapsedDocs.fl - I don't understand this one, and can't find an example on the wiki or blogs. It says "Parameter indicating to return the collapsed documents in the response"... but I thought documents were included up until collapse.threshold.
collapse.debug - should perhaps just be rolled into debugQuery, or another general debug param (someone recently suggested using a comma separated list... debug=timings,query, etc.
Should I be able to specify a completely different sort within a group? collapse.sort=... seems nice... what are the implications? One bit of strangeness: it would seem to allow a highly ranked document responsible for the group being at the top of the list being dropped from the group due to a different sort criteria within the group. It's not necessarily an implementation problem though (sort values for the group should be maintained separately).
Is there a way to specify the number of groups that I want back instead of the number of documents? Or am I supposed to just over-request (rows=num_groups_I_want*threshold) and ignore if I get too many documents back?
Random thought: We need a test to make sure this works with multi-select faceting (SimpleFacets asks for the docset of be base query...)
Distributed Search: should be able to use the same type of algorithm that faceting does to ensure accurate counts.
Performance: yes, it looks like the current code uses a lot of memory.
Here's an algorithm that I thought of on my last plane ride that can do much better (assuming max() is the aggregation function):
=================== two pass collapsing algorithm for collapse.aggregate=max ====================
First pass: pretend that collapseCount=1
- Use a TreeSet as a priority queue since one can remove and insert entries.
- A HashMap<Key,TreeSetEntry> will be used to map from collapse group to top entry in the TreeSet
- compare new doc with smallest element in treeset. If smaller discard and go to the next doc.
- If new doc is bigger, look up it's group. Use the Map to find if the group has been added to the TreeSet and add it if not.
- If the new bigger doc is already in the TreeSet, compare with the document in that group. If bigger, update the node,
remove and re-add to the TreeSet to re-sort.
efficiency: the treeset and hashmap are both only the size of the top number of docs we are looking at (10 for instance)
We will now have the top 10 documents collapsed by the right field with a collapseCount of 1. Put another way, we have the top 10 groups.
Second pass (if collapseCount>1):
- create a priority queue for each group (10) of size collapseCount
- re-execute the query (or if the sort within the collapse groups does not involve score, we could just use the docids gathered during phase 1)
- for each document, find it's appropriate priority queue and insert
- optimization: we can use the previous info from phase1 to even avoid creating a priority queue if no other items matched.
So instead of creating collapse groups for every group in the set (as is done now?), we create it for only 10 groups.
Instead of collecting the score for every document in the set (40MB per request for a 10M doc index is *big*) we re-execute the query if needed.
We could optionally store the score as is done now... but I bet aggregate throughput on large indexes would be better by just re-executing.
Other thought: we could also cache the first phase in the query cache which would allow one to quickly move to the 2nd phase for any collapseCount.