I just found this issue today when I was checking to see if what I am about to upload would be a duplicate issue or not and good thing I did since apparently there are quite a few issues about this. But considering that this is the latest one, I will post it here.
This patch add another plugin, indexer-solrshard, that allows to shard the index data on nutch side. this is mostly geared toward solr 3.x as there are still a few of them are around (including in our production
environment ), but it could have benefits even with solr 4.x to which I will get.
it adds two new properties to the nutch config file ( solr.shardkey and solr.server.urls ), the solr.shardkey would the name of the field that should be used to generate the hash code ( and if it is being used against
solr 3.x should be the uniqekey field in schema file, otherwise the delete would not work properly ), and solr.server.urls would be a comma seperated list of solr core urls or instance urls.
The plugin divide the hash value by the number of urls to figure out in which core it should put the doucment. it also uses the reset of the solr properties ( commit sieze, etc... ). the code is really the same.
But the idea behind having a solr.server.urls instead of just using solr.server.url was so that both plugin could be used simultinously which can help in migrating from 3.x to 4.x as well, Though I guess the same
argument can be made for other properties as well.
The code use String.hashCode function which is really good enough in terms of evenly distributing docs accross multiple cores ( in our case with about 85 million docs over 8 cores, the diffrence between the number
of docs in each core is less than 5% ), but changing the hash function or even makeing it customizeable as was suggested in
NUTCH-945 is trivial.
Turning the hasing mechanism off is also trivial ( again, I didn't know about this issue when I was writing this otherwise I would have done it already ) but we can add another property such as solr.usehash and by setting it to false, have the plugin to
just post the documents to all the servers which could also be quite usefull.
As for using it against the solr 4.x, it can function as a load balancer. believe me when I say watching 40 reduce jobs try to write to a single solr instance is rather horrifying.
The patch is against the trunk but porting it to 2.x is trivial ( I actually think that it can probably be applied as it is, but I haven't test it yet )