|
[
Permlink
| « Hide
]
Yonik Seeley added a comment - 23/May/08 07:17 PM
How about posting a snapshot of what you have, with a few paragraphs explaining how things work, etc. Early feedback is better, and it allows more people to add their expertise. I'm sure many are interested in the ease-of-use gains this patch can bring.
We shall post a patch in the next few days
The design is as follows:
ReplicationHandler Implements the following methods. Every method is invoked over http GET. These methods are usually trigerred from the slave (over http) or timer (for snappull). Admin can provide means to invoke some methods like snappull,snapshoot .
**note: The download tries to use the same stream to download the complete file . Please comment on the design I think the above sounds more or less right (read it quickly).
Thinking about the Admin display of replication information:
I imagine those wanting Enterprise Solr will desire this type of stuff, so even if we don't have any of this in the UI at this point, it might be good keeping this in mind and providing the necessary hooks, callbacks, etc. Otis: All the points you have enumerated are valid . We actually think they should be there in the final solution.
All the admin related changes are planned exactly as you have asked . But we can leave the hooks open and push through with the basic stuff. The design documentation just tries to cover everything which the scripts currently cover. There are problems with index replacement in windows.
Windows does not allow as to delete the index folder, because it is being used . How do we solve this? This looks like an extremely useful addition. More comments when the patch is available, but an initial observation:
<str name="pollInterVal">HH:MM:SS</str> For consistency, could this be specified cron-style instead? e.g. <str name="pollInterVal">*/30 * * * *</str> I think hh:MM:ss is universally recognizable and very intuitive. We should also keep in mind that this solution will be used on a multiple platforms and an OS like Windows does not have cron so it's administrators may not be familiar with the cron format.
Shalin,
I'm assuming that pollInterVal is intended to specify the frequency of replication. hh:MM:ss is universally recognizable for specifying a single time, certainly. But how do you represent "every hour", or "four times a day", or "the first tuesday of each month" with that notation? Certainly Windows doesn't have cron but we're talking about a pure java implementation, so that's not a problem. Quartz might be a perfect solution for scheduling: http://www.opensymphony.com/quartz/ Does anybody really do things like "first tuesday of each month" for polling the Solr master? The slave's poll is usually set to run every few minutes. Atleast that's how we use it in our production environments. Quartz is nice but the thing is that we don't need all those features. A timer task is good enough for our needs.
Yes, hh:MM:ss represents time but it isn't difficult to view it as a countdown timer. It's definitely easier to understand than specifying number of seconds/minutes as an integer for a poll interval. What do you think? for polling a simple interval this syntax may be enough. Polling is not a very expensive operation. It just sends a request and get the latest snapshotname. So we can schedule it to run even every minute also
If there is a need for such complex scheduling we can consider that syntax. A possible solution to the windows replication problem would be.
Re poll interval: I think the HH:MM:ss is enough. Does that allow polling, say, every 72 hours? Just use 72:00:00, right?
Re Winblows problem: I'd like the switch to the current/latest snapshot, but this prevents us from always knowing the location of the active directory. We'd have to rely on sorting the dir with snapshot names and assuming the currently active index is the one with the most recent snapshot, no? Symlinks would be great here, but again, Winblows doesn't have them (and I think using shortcuts for this wouldn't work).
Correct, 72:00:00 will work.
As Noble suggested, once the new searcher is in use and the older one is closed, hopefully windows will kindly grant us permission to delete the files in the index directory. We can then create links to the files in the snapshot being used into the index directory. The latest snapshot directory will be the active one but we'll know what index is being used through the links in the index folder. windows has symlinks/hardlink. "fsutil create hardlink " is the command. It woks well as long as your windows version>win2K
Is there a reason why IndexDeletionPolicy is not being used? It allows keeping snapshot files available for replication without creating a specific snapshot directory. This would be cleaner than creating an external process.
The strategy of keeping the index directory name hard coded is a bit tricky. We need to do a lot of File System specific jugglery. The best strategy would be.
This way we never need to make hardlinks etc .
Right, that's what I suggested in the initial email thread. Just checked: Lucene's IndexCommit.getFileNames() returns all the files associated with a particular commit point.
Yonik: This would be very useful in optimizing the file transfers .We must incorporate this if possible
BTW . What do you recommend for windows Index deletion?. Is the solution proposed by me fine? The first cut. very crude , but worx .No OS specific commands
The design for snapshoot , snappull is same as described in the design overview snapinstall is done the following way
In the Slave
A library like "Quartz" (http://www.opensymphony.com/quartz/
I have not looked at the code but it is possible to do snapshots/snappull at a certain time (e.g. every day 1am)? Quartz would give you the possibility to do that as well. Quartz would even provide scenarios like every 1st monday of the month. Thomas: This is something that can be considered. But , I am still not very convinced that people use it that way. It is going to introduce some dependency on a new library. Let us see if there is enough demand from users
note :All the operations can be triggerred using http get. So a wget from cron can do the trick in the current form
The basic functionality is much more important of course (and much harder to do).
OK. Then ignore my comment. This patch includes
Please don't ignore Thomas's repeat suggestion to use Quartz!
Having replication built-in but then having to use an external cron job to trigger the operations seems suboptimal to me. Being able to configure everything related to replication within the solr deployment seems far more elegant. This feature is far from complete . Enhanced admin features is probably the next priority
If scheduling is indeed important we will take it up. Meanwhile we need to ensure that the solution is usable and bug free.
agreed ... the challenge here is (efficient) pure java equivalents of snapshooter/snappuller/snapinstaller ... the scheduling mechanism is largely orthogonal, particularly since Paul is using a "ReplicationHandler" as the main API. it could easily be dealt with later (or in parallel if anyone wants to take on the task) i don't think the ReplicationHandler should know anything about scheduling or recurance. A generic Scheduling system could be hooked into SolrCore that can hit arbitrary RequestHandlers according to whatever configuration it has (similar to the QuerySendEventListener) which would handle this case, as well as other interesting use cases (ie: rebuild a spelling dictionary using an external datasource every hour, even if the index hasn't changed) the scheduling aspect can easily be dealt with later (or in parallel if anyone wants to take on the task)
hoss: currently the timer task itself is a part of SnapPuller.java . I endorse your idea of having a scheduling feature built into SolrCore if it is useful to more than one components. As you mentioned every operation is triggerred by the ReplicationHandler's REST API. So if another servive can give a callback at the right time it is the best solution.
Yes , this indeed is the challenge. I wish people to look into the implementation and comment on how these operations can be made more efficient. I already am thinking of caching the file checksums because there are more than one slaves requesting for the same. The other important item that needs review is the changes made to SolrCore.getNewIndexDir() Should we plan this feature for Solr 1.3 release ?. If yes, what all are the items pending to be completed?
I'd certainly like to see this in 1.3, it would make my life easier!
I'm trying out the code now and hope to feedback in depth soon. Meanwhile some initial comments: there's inconsistency between 4 space and 2 space tabs in the code, and a few System.out.println that you probably want to remove or replace with proper logging. The next step is to replicate files in conf folder.
Thre strategy is as follows,
can we make
I think so. You already started doing that with your comment from 04/Jun.
2 quick thoughts:
It;s easy to implement with a wild card . But , very few files need to be replicated .Isn't it better to explicitly mention the names so that no file accidentally gets replicated.
Yes, timestamps, the same format used by the snapshots
This is a good idea. It must be a feature of replication. Old conf files as well as indexes should be purged periodically Attaching deletion_policy.patch
This exports a SolrDeletionPolicy via UpdateHandler.getDeletionPolicy() It can be used to get the latest SolrIndexCommit, which lists the files that are part of the commit, and can be used to reserve/lease the commit point for a certain amount of time. This could be used to enable replication directly out of the index directory and avoid copying on systems like Windows. Each SolrIndexCommit has an id, which can be used by a client as a correlation id. Since a single file can be part of multiple commit points, a replication client should specify what commit point it is copying. The server can then look up that commit point and extend the lease. This can be used for a very optimized index copy. I shall incorporate this also in the next patch. A few points stand out
I'm using Solr to build a search service for my company. From operation or maybe performance point view, we need to use java to replicate index.
From very high level, my design is similar to what Noble mentioned here. It is like this: 1) First we have an active master, some standby masters and search slaves. The active master handles crawling data and update index; standby masters are redundant to active master. If active master goes away, one of the standby will become active. Standby masters replicate index from active master to act as backup; search slaves only replicate index from active master. 2) On active master, there is a index snapshots manager. Whenever there's an update, it takes a snapshot. On window, it uses copy (I should try fsutil) and on linux it uses hard link..The snapshot manager also clean up old snapshots. From time to time, I still got index corruption when commit update. When that happen, shapshot manager allows us to rollback to previous good snapshot. 3) On active master, there is a replication server component which listens at a specific port (The reason I did not use http port is I do not use solr as it is. I embed solr in our application server, so go through http would be not very efficient for us). Each standby and slave has replication client component. The following is the protocol between the replication client and server: Right now a client replicates index from active master every 3 mins. for a slow change datasource. It works fine because create new solr-core and warmup cache take less than 3 mins. We plan to use it for a fast changing datasource, so create new solr-core and dump all the cache is not feasible. Any suggestion? bq: First we have an active master, some standby masters and search slaves
This looks like a good approach. In the current design I must allow users to specify multiple 'materUrl' . This must take care of one or more standby masters. It can automatically fallback to another master if one fails.
How can I know if the index got corrupted? if I can know it the best way to implement that would be to add a command to ReplicationHandler to rollback to latest .
plain socket communication is more work than relying over the simple http protocol .The little extra efficiency you may achieve may not justify that (http is not too solw either). In this case the servlet container provides you with sockets , threads etc etc. Take a look at the patch on how efficiently is it done in the current patch.
The current implementation is more or less like what you have done. For a compound file I am not sure if a diff based sync can be more efficient. Because it is hard to get the similar blocks in the file. I rely on checksums of whole file. If there is an efficient mechanism to obtain identical blocks, share the code I can incorporate that This patch relies on the IndexDeletionPolicy to identify files to be replicated. It also supports replication of conf files. No need to register any listeners/ QueryResponseWriters
The configuration is as follows solrconfig.xml <requestHandler name="/replication" class="solr.ReplicationHandler" > <lst name="master"> <!--Replicate on 'optimize' it can also be 'commit' --> <str name="replicateAfter">commit</str> <!--Config files to be to be replicated--> <str name="confFiles">schema.xml,stopwords.txt,elevate.xml</str> </lst> </requestHandler> on slave solrconfig.xml <requestHandler name="/replication" class="solr.ReplicationHandler" > <lst name="slave"> <str name="masterUrl">http://localhost:port/solr/corename/replication</str> <str name="pollInterval">00:00:20</str> </lst> </requestHandler> The Replication strategy is changed as follows
I love how easy this is to set up!
A couple of issues I noticed while testing:
What happens when the slave is replicating an index, and some of the files become missing on the master? Seems like the slave should simply abandon the current replication effort. Next time the master is polled, the new index version will be discovered and the process can start again as normal. What happens if replication takes a really long time? I assume that no new replications will be kicked off until the current one has finished? Thanks.
I guess Lucene must be cleaning it up because that is what the deletion policy says Good point. Will incorporate that Because the file names are unique it did not matter if I used the index version (or does it) . Please clarify Yeah . It does that . If all the files are not copied completely it aborts. You are right. When the replication process starts , a lock is acquired. The lock is released only after the process completes –
OK, I found the bug that caused this one... Oh... and more internal code comments would be welcome (I don't know if it's practical to add them after the fact... I find myself adding them for my own notes/thoughts as I develop). Good catch . but it is not obvious that the refCount was incremented . Should we not have a method to return the searcher without
incrementing the refcount ? something like SolrCore#getSearcherNoIncRef() Anyone who is not using the IndexSearcher for searching will need that
Yes, file names are unique since Lucene doesn't change existing files once they are written. But, if I completely delete an index and start again, the same file name would be reused with different contents (and a different timestamp). But that's not the point I was trying to make...
Got the point. I assumed that the It is hard to drive it from the replication handler. The lease can be extended only when we get the onInit() onCommit() callback on SolrIndexDeletionPolicy. We can't reliably expect it to happen during the time of downloading I read the patch quickly. I noticed a small typo in SnapPuller.DFAULT_CHUNK_SIZE (should be DEFAULT).
I like the idea of configuration files replication (yeah, no more scp schema.xml everywhere). I usually replicate on optimize only but I wonder if people use the current ability to replicate on commit and on optimize. It doesn't seem to be possible with your current patch. Anyway, really nice work.
technically it is possible just add two entries for replicateAfter. The code is not handling it because NamedList did not have a getAll() method at that time. The next patch will take care of it
Nice.
I'm thinking of a use case: if you have a lot of synonym/stopwords dictionaries for different languages and field types, it might be a bit awkward to specify each file. A synonyms_*.txt, stopwords_*.txt would be welcome. Furthermore, I wonder if we shouldn't disable explicitely the replication of solrconfig.xml. Any opinion? Replication can be disabled by not registering the handler in solrconfig.xml and an HTTP API call should be added to disable replication on master/slave.
Could you guys pull out all the changes to MultiCore, CoreDescriptor, SolrCore, etc (everything not related to replication) into a separate patch. I think that will help things get committed. Ryan also has a need to get the MultiCore and I think perhaps a getMultiCore() should just be added to the CoreDescriptor.
Sure Yonik, we shall separate the changes in core classes into separate issues.
Actually there are a handful of other HTTP methods which can be invoked over HTTP. These can be used to control the feature from admin interface
Just the changes required to the core
new patch that takes care of the refcount. this is a complete patch
I just committed
This patch is to
I haven't had a chance to check out the latest patch, but it sounds like "SolrCore.close() is done in a refcounted way" is a generic multi-core change that is potentially sticky enough that it deserves it's own JIRA issue.
Yes , we may need another issue to track it. Directly calling Solr.close() can cause exceptions on in-flight requests
The core reload functionality has to close the old core .
Comment about the solrconfig entry for replication on the master:
<requestHandler name="/replication" class="solr.ReplicationHandler" > <lst name="master"> <!--Replicate on 'optimize' it can also be 'commit' --> <str name="replicateAfter">commit</str> <str name="confFiles">schema.xml,stopwords.txt,elevate.xml</str> </lst> </requestHandler> Reading the above makes one think that it is the master that does the actual replication. In fact, the master only creates a snapshot of the index and other files after either commit or optimize. It is the slaves that copy the snapshots. So while we refer to the whole process as replication, I think the configuration elements' names should reflect the actual actions to ease understanding and avoid confusion. Concretely, I think "replicateAfter" should be called "snapshootAfter" or some such. +1 for Hoss' suggestion to decouple scheduling from the handler that can replicate/copy on-demand
The new replication does not create snapshots for replication. The replication is done from/to a live index. Hence the change in name Full patch with:
1. Support for reserving commit point. Configurable with commitReserveDuration configuration in ReplicationHandler section. <requestHandler name="/replication" class="solr.ReplicationHandler" > <lst name="master"> <str name="replicateAfter">commit</str> <str name="confFiles">schema.xml,stopwords.txt,elevate.xml</str> <str name="commitReserveDuration">01:00:00</str> </lst> </requestHandler> 2. Admin page for displaying replication details. Why are files downloaded to a temp directory first? Since all index files are versioned, would it make sense to copy directly into the index dir (provided you copy segments_n last)?
If Solr crashes while downloading that will leave unnecessary/incomplete files in the index directory. We did not want the index directory to be polluted. The files are 'moved' to index directory after they are downloaded . The segments_n file is copied in the end. from temp directory to index directory.
If we don't want to try and pick up from where we left off, it seems like Lucene's deletion policy can clean up old index files that are unreferenced. If the files are not part of any indexcommit (this is true if the segments_n file didn't get downloaded) will it still clean it up?. And when solr restarts ReplicationHandler will have difficulty in cleaning up those files if replication kicks off before Lucene cleans it up (If it actually does that)
Patch with following changes:
Thanks Akshay.
On a first glance, this is looking really good. I am planning to commit this in a few days. We can take up the enhancements or bug fixes through new issues. Updated patch with a couple of bug fixes related to closing connections and refcounted index searcher. Other cosmetic changes include code formatting and javadocs.
Noble has put up a wiki page at http://wiki.apache.org/solr/SolrReplication Patch with minor fixes related to the admin page.
Another iteration over Akshay's patch.
Again a minor fix in replication admin page
Committed revision 706565.
Thanks Noble, Yonik and Akshay! Snappuller should use getNewestSearcher() rather than getSearcher() to avoid pulling the same snapshot more than once if warming takes a long time.
I didn't catch earlier how reservations were done: currently, the commit point is reserved for a certain time when the file list is initially fetched. This requires that the user estimate how long a snap pull will last, and if they get it wrong things will fail. On the other side, setting the time high requires more free disk space.
It seems like renewing a lease (a short term reservation) whenever an access is done would solve both of these problems (and is what I initially had in mind). All requests should indicate what commit point is being copied so that the lease can be extended. Files are downloaded in one HTTP request... the response is read and written one chunk at a time. Has anyone tested this with a large files (say 5G or more) to ensure that:
The first 3 go through servlet container code and thus should probably be tested with tomcat, jetty, and resin.
The servlet container usually have a small chunk size by default(~8KB(in tomcat). It keeps flushing the stream after that size is crossed.
This is a good idea . But when the index is large it tends to have 1 very large file and a few other smaller files. It is that very large file that takes a lot of time(In our case a 6GB file across data centers took around 2 hrs) So we may also need to do call reserve even while download is going on. Thanks for going through this Yonik.
The SnapPuller calls commit with waitSearcher=true, so the call will wait for the searcher to get registered and warmed. The reentrant lock in SnapPuller will be released only after the commit call returns. So it should be OK, right?
Since files are transferred in one go, the master knows about the access time but it does not know if the transfer has ended so the lease may expire in between the transfer leading to a failure. We'll need to track the transfers individually as well. If the slave dies in between the transfer, we'll need to track that as well and time-out the lease appropriately. If I compare the state of things to the old way of replication, not sure if this feature is worth the effort. What do you think?
We have been testing it with a large index (wikipedia articles, around 7-8GB index on disk) with Tomcat across networks (transfer rate between servers is around 700-800 KB/sec). We haven't seen any problem yet. We'll continue to test this with Tomcat and other containers and report performance numbers and problems, if any.
A commit could come from somewhere else though, or we could be starting up and no searcher is yet registered. It's always safe (and clearer) to just use the newest reader opened, right? Is there a reason that SnapPuller waits for the new searcher to be registered?
Right, as Noble pointed out, lease extension will need to be done periodically during the download (every N blocks written to the socket).
Each file request can optionally specify the commit point it is copying.
The lease is just the current reservation mechanism, but called more often and with a very short reservation (on the order of seconds, not minutes I would think), so I don't see a need to time them out.
Cool. Hopefully one of the test indexes contain a single file greater than 4G to test that we don't hit any 32 bit overflow in the stack. If not, re-doing your wikipedia test with compound index format and after an optimize should do the trick.
Yes, one of the files in the index is of size 6.3G, created on optimize. patch contains changes for reserve being set for 10secs by default after every 5 packets (5 MB) are written.
The commitReserveDuration is now supposed to be a small value (default is 10 secs). If the network is particularly slow user can tweak it to set a bigger number. every command for fetching file content has an extra attribute indexversion , so that the master now knows which IndexCommit is being downloaded. Thanks Noble, reviewing now...
Committed with 2 changes:
silly me. The packetsWriitten variable was not incremented
Attaching some little thread safety fixes (mostly adding volatile to values modified and read from different threads).
Updated the fixes patch with more thread safety fixes.
Q: what is ReplicationHandler.getIndexVersion() supposed to return, and why? It currently returns the version of the visible index (registered). Should it be the most recent version of the index we have? Any reason it isn't using ReplicationHandler.indexCommitPoint? Also, I think we should all work at adding more comments to code as it is written. Lack of comments made this patch harder to review. I think there's an issue with SnapShooter in that it never does any reservations for the commit point it's trying to copy.
Here's an update to the "fixes" patch that fixes an issue with setReserveDuration when called with different reserveTimes. Previously, the new value overwrites the old, regardless of it's value. The approach to fix is a basic spin loop (see below). Anyone see issues with this approach?
public void setReserveDuration(Long indexVersion, long reserveTime) { long timeToSet = System.currentTimeMillis() + reserveTime; for(;;) { Long previousTime = reserves.put(indexVersion, timeToSet); // this is the common success case: the older time didn't exist, or // came before the new time. if (previousTime == null || previousTime <= timeToSet) break; // At this point, we overwrote a longer reservation, so we want to restore the older one. // the problem is that an even longer reservation may come in concurrently // and we don't want to overwrite that one too. We simply keep retrying in a loop // with the maximum time value we have seen. timeToSet = previousTime; } } I think this is also a great example of where comments explaining how things work are really needed.
This is the method called by the slaves. they must only see the current "replicatable" index version. For instance, if the 'replicateAfter' is set to 'optimize' then the slave should not see the index version that is a commit. The getDetails() ( command=details) method gives the actual current index version
right, Snaphsooter has to reserve. The new setReserveDuration() looks right. The SnapShooter is not written right (thread safety). soon after you commit the patch , I can give a patch . After I fix it I can update the wiki w/ proper documentation.
Snapshoot is not a very important feature in the current scheme of things. It is useful only if somebody wants to do periodic backups Should we try OS specific copy? Yonik. If you can commit this patch I can give a patch with comments . The code badly needs some comments.
Hi, i have a couple comments about the implementation, specifically SnapShooter.java just pulled from TRUNK:
------------------------------------- lockFile = new File(snapDir, directoryName + ".lock"); ... <1> ... lockFile.createNewFile(); ... <2> ... if (lockFile != null) { lockFile.delete(); } AFAIK, java.nio.channels.FileLock should be used for any type of file-based locking of the sort for cross-vm synchronization. If you are worried about in-vm synchronization, it might be best to just use j.u.c Locks or synchronized{} blocks. This would remove the possiblity of junk .lock files if, say the VM dies during <2>. ------------------------------------- fis = new FileInputStream(file); Am i crazy or are these real problems?
Right, as Noble & I noted, there are still known problems with SnapShooter. Luckily, it's not necessary in the current replication scheme which no longer relies on snapshots. Gotcha, i will focus efforts elsewhere then
We need to cleanup the SnapShooter. it was given low priority because
snapshoot is not at all necessary in the new replication implementation. It is only useful for periodic backups – I wonder if it might be useful to add copy throttle support to the replication. See SOLR-849 and the referenced email thread.
I am not a huge fan of PollInterval. It would be great to add an option to get the Index based on exact time: PollTime="*/15 * * * *" That would run at every 15 minutes based on the clock. i.e. 1:00pm, 1:15pm, 1:30pm, 1:45pm, etc. All my slaves are sync'd using NTP, so this would work better. Since each slave starts differently, we cannot set the PollInterval="00:15:00" since they would get different indexes based on when they start. The other option would be to suspend polling - and start - which would be very manual I guess. Setting the PollInterval to 10 seconds would be getting a new index when the old one is still warming up. Even 10 seconds interval would not be good, since we get so many updates, each server would have different indexes. With Snap we don't have this issue.
We get SOLR updates frequently and since they are large we cannot wait to do a commit at the 15 minute mark using cron. Optimize just takes too long. On our system we need to limit how often the slaves get the new index. We would like all slaves to get the index at the same time. Bill The default pollInterval can behave the vway you want (so that the fetches are synchronized in time by the clock). Raise a separate issue and we can fix it
change component from scripts to java
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||