Trying to catch up...
I think it's fine to use INode in the lease map. However, in the meanwhile, I'm not sure if this will increase the possibility of memory leak due to failing to update the lease map (maybe because of buggy code).
I agree-- we might get some really nasty bugs by having a bunch of references to inodes that somehow are "really" deleted here. It would only require a minor mistake in the code. The memory savings doesn't seem worth it (and, of course, we're already reducing memory consumption by switching away from strings)
Yes, but minor bugs anywhere can have devastating results. The block manager doesn't make attempts to add indirection just in case there's a future bug. Even if there was a leak, lease expiration will eventually clean it up?
My thinking is that even though the lease manager is buggy, NN should continue to function when replay edit logs / saving namespaces instead of crashing. Though not ideal, but it allows NN to tolerate the bugs and to treat them as lease recoveries. [...] I'm updating my patch to print out warnings to help developers to catch these bugs.
I can agree to saving the namespace being tolerant. We would have been in a world of hurt after the recent edit log corruption bugs if we couldn't save the namespace on the active.
I'm not comfortable with being tolerant of buggy edits. If the NN cannot re-do what it thinks it did, that's pretty serious, and should continue to be a stop the world event. The only reason we noticed the recent edit log corruptions is because the standby died. Sadly, nobody is going to look for or notice log messages until a major service interruption has occurred.
The idea that the NN should be tolerant of its own bugs will inevitably lead to namespace and/or data corruption...
Jing Zhao's made a good point about how we don't want to remove leases for open files just because we deleted an under-construction inode inside a snapshot.
Perhaps I'm understanding. What is the use case for allowing a client to continue writing to a deleted file and have the results reflected in a snapshot? The lease is effectively for the mutable/non-snapshot path. If it's deleted, why wouldn't we want to revoke the lease? It's already "bad" enough that snapshotting a UC file does not record the visible length so the file in the snapshot continue to mutate, but continuing to modify a deleted file in a snapshot?
Imagine that a client gets SIGSTOP, and then its hard lease on its open-for-write file expires and is removed. The client can be resumed later, come back and then try to close. So not being able to find a lease during a close is certainly worthy of a nasty log message, but I don't think we can call it a bug.
I thought the close fails if the lease is expired? If yes, my question is whether replaying edit logs should tolerate a close with no lease if the live NN does not allow it.
You're right that the new lease manager won't (and doesn't need to) update the lease during the renames, but I'm not sure that I fully understand your comments.
My mild concern/question is let's say this change is in 2.6. We deploy it, run a week, find a horrible problem and back up to 2.5 because it has the same layout. If 2.5 replays the edits, will it be in a sane state due to the missing lease reassigns? I wish I had time to study the code and think about it, but I don't, so I'm posing the question.