The current code prints to stdout. We have a RMI service that has ZK server embedded in it. We do this so that we can run/start/stop ZK across platforms without having to write platform specific scripts. In this server, we start a thread that periodically calls PurgeTxnlog.purge(). As you pointed out, we should have a -q flag to direct to log instead stdout to statisfy both the approaches. I will make that change.
We chose number 2 here because we think having only one backup will be enough. It is not clear to us under what conditions the additional backup will be useful.
Backups are useful under the following scenario (correct me if I am wrong):
1. The current ZooKeeper transaction log and/or snapshot is corrupted, but the past snapshots and transaction logs are ok. Corrupting can mean either disk file corruption or corrupting of transaction entries in the log. We store ZooKeeper data on mirrored disks.
2. The application itself made some errors that requires reverting back to the older version.
For the first point, having one additional backup would suffice. The second point is really tricky. I am not sure how the application can decide which snapshot to revert to. I think in most cases it will be trial and error. It is not clear to me how to estimate the number of backups needed. Also, it is not clear how one would go about going back in time. I looked at LogFormatter utility and that utility does not help much in undoing the erroneous transactions for case 2 above. In general, I think it is good to enforce users to have a minimum of one backup.
Related question: Is there hash on the log files (or internal tree structures) that can tell the ZooKeeper server if the logs are corrupted. If yes, the zookeeper server can verify the hash during startup and take some action based on that. For example, make sure that it never becomes a leader until it gets the correct snapshot from the existing leader (otherwise it may endup corrupting other server's log). "Corrupting" here refers to the case where the file is readable, but one or more transactions in the log are bad.
I am not sure if there is a test for this. If I remember correctly, there is a bug that causes the purge() function to leave behind one addition log file. Please refer to my question above about findNRecentSnapshots(). I can add a test or modify the pruge utlity once we have concluded this discussion.