I would be glad to present a case for this if it helps. Let me know if it helps. If it does not sound like a useful use-case, perhaps I could use some other tool.
Here's a quick overview of the use-case:
The requirement I have is in analytics. Search results need to be exact and we're basically 'counting' things precisely, not approximating. The no. of facets is not large but their combinations are large in number(hence the strong case for Solr).
The number of distinct data containers(collections) is small but their sizes are large and denormalizing or keeping data in single servers are not feasible options.
Therefore joins are becoming inevitable as data grows and starts to need many servers to store it due to size constraints and computing efficiency.
Right now, the only option I have is to use a glue language to collect the 'from' terms from the many 'shards' across servers, send queries with these terms to the 'to' collection shards on several servers again, apply rules to aggregate them centrally, manage timeouts and other artificial issues created by this data division and sent the aggregated data for visualisations or other processing.
As you can see, the charm and pull of Lucene's speed is getting dampened by the unnecessary data complexity and dependence on programming in a glue language , recording the number and types of shards on each server and making queries to the right targets. Redundancy/failover is another pain to handle besides managing increasing servers.
Everything I have written is already possible and avaliable in Solr except that it's not on a distributed manner ?
Solr is a beautiful tool that can easily do everything I need if my data were not needed to be distributed across machine as in my case !
If I denormalize this kind of data, I might end up making it 3-4x it's size, which obviously I dont want to do.
If Solr managed to take away this pain, it would be the ideal scalable solution for all search applications and analytic applications which have multiple large, data sets with limits to denormalization.
In my case, I know the data very well and have a good grip on the combinations of facets needed to configure a distributed system if it just allowed joins with true sharding.
I really think that adding this will bring in lots of distributed computing use-cases into the ambit of Solr. There's no telling the amount of efforts it will save for people like me, and not have everybody devising the own distributed computing management scheme when a common one could solve it for all.
Let me know if this sounds like a reasonable use-case. Besides my own use-case, I'm sure there would be a lot of people who probably dont use solr due to this missing feature.
PS : Sorry for getting carried away and the long mail ;-(