Ok I managed to figure out what is going on here, at least one thing that could be wrong.
Basically, when a follower starts up, it reads the latest snap file and sets the dataTree.lastProcessedZxid to be the Zxid specified by that file. This is not the right thing to do. When you call snapLog.restore (before you connect to the leader), you both deserialize the latest snap log, AND play through the transaction logs. While playing through the transaction logs, you find the actual last transaction you have recorded which could be less than the snapLog's zxid (will explain below). That zxid is returned by the call to restore, but we don't actually use it, you always use the Zxid from the chose snap file (see FileSnap.deserialize).
So, follower crashes at zxid N-5, starts back up, replays state to N-5, and connects to the leader.
When it connects to the leader, it gets the leader's zxid of N. In Learner.syncWithLeader, it sees that it will get a diff, and sets its lastProcessedZxid to N, then acks the leader and starts up. The leader queues up the transactions N-4...N and sends those to the follower, followed finally by the UPTODATE message.
So what does the Follower now see? Well it sees the transactions in order, PROPOSAL, then COMMIT, for all of N-4 to N, followed by the UPTODATE. However, the Follower processes the PROPOSAL and COMMIT packets in separate threads, so it pulls those messages off the wire fast, queues them to be processed by the Sync/Commit processor threads, and keeps going until it hits the UPTODATE message. At this point, it writes a snapshot with Zxid N. We have NO GUARANTEE that we have actually processed transactions N-4 to N, and as Vishal has seen, sometimes the thread that writes the snapshot gets executed before any or all of the commit processor threads have actually run.
But generally, this is ok, because you are also writing a transaction log of transactions N-4 to N, so you still process them, write them to your transaction log, and as long as you have processed all of them before your Follower goes down again when you recover they will be applied to the snapshot and you will be fine. HOWEVER, if you kill the Follower after it has written snapshot.N and before it has processed transactions N-4 to N and written them to its log, when you restore the Follower it will believe that it is at Zxid N, it won't ever see those transactions, and it will never delete those nodes.
I believe one solution might be, when you are restoring the database on startup to set the lastProcessedZxid to the zxid of the last transaction in your actual tx log, NOT the zxid of the snapshot file. However, that might make restoring from copied snapshots break, so I think we should consider other solution options. But setting the zxid of the follower to be the server's zxid before it processes all the missing transactions is definitely the wrong thing to do.