Thank you for the extensive comments from both Rick and Narayanan. I have a few supplementary comments to those from Narayanan.
>>Looks like you have addressed issue (1). I see in your comments above, that you are in agreement about how to address issue (2), but I don't see this reflected in the new spec itself. I'm getting the impression that the answer to (3) and (4) is that the first rev of replication won't handle these issues; instead, they will be addressed in a later rev. Is that right?
>I interpret it that a manual startup is planned for now. Is a auto startup on the cards?
Re 2: It says so below the table of NetworkServerControl commands, but I will make it clearer in the next version of the spec.
Re 3 and 4: That's correct; in the first rev, there will be no automatic restart of replication when one of the instances have failed. The DB owner will have to manually restart replication. A later improvement may automate this step; this is a good candidate for extending the functionality later.
>>5) A heads-up about the user/password options on the new server commands. There has been some discussion about authenticating server shutdown operations and general agreement that the current situation is confusing. DERBY-2109 intends to add credentials to the server shutdown command. I think that the same api should be used to specify username and password for all of our server commands--whatever that api turns out to be.
>Thank you for this pointer. I guess taking the same lines as 2109 is the thing to do here.
I agree. There is no reason why authentication for replication should differ from other commands. The NetworkServerControl commands I wrote in the func spec show what information is needed. I will modify the next version of the func spec to state that authentication is needed, and should be performed in the same manner as other NetworkServerControl commands.
>>6) I think it would be clearer if the url option were called slaveurl. Do we need a symmetric masterurl option for the startslave command? How does the slave know that it is receiving records from the correct master? What happens if two masters try to replicate to the same slave?
>This would be an issue I guess because the slave would assume both to be legitimate unless we send the database name each time.
>But what would happen if both use the same database also.
>Can this be eliminated by having a handshake phase before the actual log transfer occurs. So if the same url is being used for a second handshake we would reject this unless this is a reconnect attempt after the master has
We should only allow one connection to a slave database. A handshake sounds like a good idea.
>>7) Is the startmaster command restricted to a server running on the same machine as the master database? Similarly, is the startslave command restricted to a server on the slave database machine? What about failover and stop?
I think the start and failover commands needs to be restricted to the same machine as the database resides, but this depends on the NetworkServerControl security. Again, this should be equal to the policy for other NetworkServerControl commands. See 12) for how to stop replication.
>>8) I am confused about the startslave command. Does this create a new database? If so, how are the credentials enforced in the case that credentials are stored in the database? If not, what happens if there is already a database by that name? Is the database destroyed and replaced after authentication?
Since this has not been implemented yet, the solution may have to change later. However, the current intention is that the first thing that happens on the slave is that it receives the database 'x' from the master. When 'x' has been received, the slave starts the boot process of 'x'. So, the slave does not create 'x', even though it did not exist on the slave when the startslave command was issued.
We will have to check that a database with the same name does not exist on the slave. Furthermore, we should probably ensure that the owner of 'x' is allowed to create a database on the slave. Did you think of any other permissions we should check for? Maybe a allowedToReplicate credential would be needed?
>>9) If you have stopped replication, can you resume it later on?
>If stopping replication means that we will not archive logs anymore I guess this will not be possible. If the logs are still archived we can transmit from the log after replication has been stopped and the slave can still redo from there and replication from continue. That is we should not call SYSCS_UTIL.SYSCS_DISABLE_LOG_ARCHIVE_MODE system procedure after stopping replication. Guess the user should be able to decide this.
I am not sure about this. If a failover was performed, the answer is definately 'no' because the repliaction method assumes that the physical layout of the databases are equal. A failover will not preserve this exactly equal physical layout since the failover process will undo uncommitted transactions. If the replication was simply turned off, Narayanans suggestion of starting log shipment from some defined log record will probably work.
However, I think we have to be restrictive in the first version of the functionality. For now, I think the answer will be 'no', i.e., you have to restart replication by first deleting the database (on the slave), and then send the entire database to the slave. Resuming replication makes a good candidate for extending the functionality.
>>10) What is the sequence of these commands? Do you first issue a startmaster and then issue a startslave? What happens if the commands occur out of sequence? Similarly for
>Since the startslave starts a listener this should be done first before startmaster.
It is correct that the slave will be listening for the master and therefore must be started before replication can start. However, I see no reason why the connection attempts should not be retried every now and then until the slave is ready to accept the connection.
Hence, I don't think we need a defined sequence of commands. When the slave starts, it does nothing until a master connects to it (except write some messages to derby.log). When the master is started, it continues as normal (also writes some messages to derby.log) until it is able to get a connection to the slave.
>>11) It would be nice to understand how we insulate replication from man-in-the-middle attacks--even if we don't implement these protections in this first version.
That is a good point. It would, e.g., be possible to use a signature. The slave could send a hashed username to the master, and the master could respond by sending the hashed password. It should not be possible to "unhash" the username/password. But I am no security expert, hence input on this issue is appreciated. And you are right; this will not be handled in the first version.
>>12) What happens if someone tries to connect to an active slave? What happens if someone tries to shutdown an active slave without first stopping replication at the master's end?
If someone tries to connect to a db 'x' that has the slave role in derby instance 'i', the connection is refused. Note that the derby instance 'i' may manage other databases at the same time. Making a connection to these other databases is unaffected by replication.
>A connect attempt from the master would fail and the master would report that the connection has been terminated due to the slave not being able to be reached or that a slave could not be found. Would this case be different from trying to connect to a Derby NetworkServer when it has been shutdown?
The initial plan was to allow shutdown at both ends. Now that you mention it, however, stopping replication from the master seems to be more clean. Hence, I think the revised plan should be as follows: Stopping replication will be performed by issuing the stopreplication command at the master. The master then sends a stop replication message over the network connection to the slave.
>>13) What happens if the slave is shut down and then, later on, someone tries to boot the slave as an embedded database?
That will be allowed. In this case, the database will then boot to a transaction consistent state that includes all transactions that were committed (and sent, of course) before the shutdown.