I'm using Solr to build a search service for my company. From operation or maybe performance point view, we need to use java to replicate index.
From very high level, my design is similar to what Noble mentioned here. It is like this:
1) First we have an active master, some standby masters and search slaves. The active master handles crawling data and update index; standby masters are redundant to active master. If active master goes away, one of the standby will become active. Standby masters replicate index from active master to act as backup; search slaves only replicate index from active master.
2) On active master, there is a index snapshots manager. Whenever there's an update, it takes a snapshot. On window, it uses copy (I should try fsutil) and on linux it uses hard link..The snapshot manager also clean up old snapshots. From time to time, I still got index corruption when commit update. When that happen, shapshot manager allows us to rollback to previous good snapshot.
3) On active master, there is a replication server component which listens at a specific port (The reason I did not use http port is I do not use solr as it is. I embed solr in our application server, so go through http would be not very efficient for us). Each standby and slave has replication client component. The following is the protocol between the replication client and server:
a) client ping the a directory server for the location of active master
b) connect to the active master at the specific port
c) handshake: right now just check for version and authentication. in the future, it will negotiate security, compression, etc.
d) client sends SNAPSHOT_OPEN command followed by index name. The master could manage multiple indexes. Server sends index_not_found if index does not exist or ok followed by snapshot name of the latest snapshot;
e) if the index is found, client compares the timestamp with that of local snapshot. The timestamp of snapshot is derived from snapshot name because part of snapshot name is encoded timestamps. If local is newer, tell the server to close the snapshot; otherwise, ask server for a list of files in the snapshot. If ok, server sends ok op, followed by a file list including filename, timestamp, etc.
f) client creates a tmp directory and hard link everything from its local index directory, then for each file in the file list, if it does not exit locally, get new file from server; if it is newer than local one, ask server for update like rsync; if local files do not exist in file list, delete them. in the case of compound file is used for index, the file update will update only diff blocks.
g) if everything goes well, tell server to close the snapshot, rename the tmp directory to a proper place, create solr-core using this new index, warmup any cache if necessary, route new request to this solr-core, close old solr-core, remove old index directory.
Right now a client replicates index from active master every 3 mins. for a slow change datasource. It works fine because create new solr-core and warmup cache take less than 3 mins. We plan to use it for a fast changing datasource, so create new solr-core and dump all the cache is not feasible. Any suggestion?