1. this assumes a DN goes down with the client (either in tandem, or on the same box) and that the NN initiates lease recovery later correct?
Really, this applies any time that recovery is initiated after the node has come back to life. The most likely case is a hard lease expiry like you suggest above, since it gives a full hour for the DN to restart, but it could be a manually triggered recovery as well.
2. the idea here is that RBW should have lengths longer than RWR, but both will have the same genstamp?
yep, s/should/could/ though (in many cases, RWR will have the right length)
If so, why aren't we just taking the replica with the longest length? Is there a reason to
In a normal pipeline failure, it's likely that the earlier DNs in the pipeline will have longer length than the later ones, right? So if we always just took the longest length, we'd usually recover to a pipeline of length 1 even when other replicas are available that satisfy correct semantics. At least, I assume this is the reasoning - in this patch I was just trying to maintain the semantics elsewhere.
3. if sync() did not complete, there is no violation. do I follow? i agree we can try to recover more data if it's there, but i just want to make sure i'm on the same page
The issue here is that sync() could complete, but the post-power-failure replica could still be truncated. Recall that hflush() doesn't actually fsync to disk, so after an actual power failure of the local node, it will usually come back with a truncated replica after EXT3 journal replay.