Ok, I need some input here. I'm attacking the symptom of a problem encountered on a production cluster that encountered a series of complex issues that we can't piece together. Long story short, we wound up with leases to non-existent files and the NN couldn't start. When believed the dangling leases were the result of deletions but the cause is ambiguous.
My original intent was to make dangling leases non-fatal. It's arguably reasonable because nothing will be lost - that's not already lost (the file). However that masks whatever bug may have caused the issue in the first place.
- Looking at the code, the inodes are deleted, then the corresponding leases are deleted, and then the edit log is synced outside the write lock so perhaps it's plausible that the edits were not atomically written? I don't think I can write a test that might induce that possible bug.
- The other possibility is that LeaseManager#removeLeaseWithPathPrefix is malfunctioning. It takes a SortedSet, makes another SortedSet backed by the former set, makes a List of Map.Entry from the set of a set, then iterates and the list and performs operations that will affect the original set. Now the javadocs for Map.Entry state:
Map.Entry objects are valid only for the duration of the iteration; more formally, the behavior of a map entry is undefined if the backing map has been modified after the entry was returned by the iterator, except through the setValue operation on the map entry.
It sounds like #2 is relying on undefined behavior twofold: using the entries after the iteration is complete, and using the entries while modifying their set. I can easily fix that but I can't write a test that can reliably induce the former undefined behavior to occur.
So I'm at the crossroads of:
- As a safety measure, make dangling leases non-fatal
- Fix removeLeaseWithPrefix to avoid undefined behavior, and hope it solves the problem