Missing config key documentation in hdfs-defaults.xml
requestBlockReportLeaseId: empty catch for unregistered node, we could add some more informative logging rather than relying on the warn below
I discussed the NodeData structure with Colin offline, wondering why we didn't use a standard Collection. Colin brought up the reason of reducing garbage, which seems valid. I think we should consider implementing IntrusiveCollection though rather than writing another.
yes, there will be quite a few of these requests coming in at any given point. IntrusiveCollection is an interface rather than an implementation, so I don't think it would help here (it's most useful when an element needs to be in multiple lists at once, and when you need fancy operations like finding the list from the element)
I also asked about putting NodeData into DatanodeDescriptor. Not sure what the conclusion was on this, it might reduce garbage since we don't need a separate NodeData object.
The locking is easier to understand if all the lease data is inside BlockReportLeaseManager.
I prefer Precondition checks for invalid configuration values at startup, so there aren't any surprises for the user. Not everyone reads the messages on startup.
requestLease has a check for isTraceEnabled, then logs at debug level
In offerService, we ignore the new leaseID if we already have one. On the NN though, a new request wipes out the old leaseID, and processReport checks based on leaseID rather than node. This kind of bug makes me wonder why we really need the leaseID at all, why not just attach a boolean to the node? Or if it's in the deferred vs. pending list?
It's safer for the NameNode to wipe the old lease ID every time there is a new request. It avoids problems where the DN went down while holding a lease, and then came back up. We could potentially also avoid those problems by being very careful with node (un)registration, but why make things more complicated than they need to be? I do think that the DN should overwrite its old lease ID if the NN gives it a new one, for the same reason. Let me change it to do that... Of course this code path should never happen since the NN should never give a new lease ID when none was requested. So calling this a "bug" seems like a bit of a stretch.
I prefer IDs to simply checking against the datanode UUID, because lease IDs allow us to match up the NN granting a lease with the DN accepting and using it, which is very useful for debugging or understanding what is happening in production. It also makes it very obvious whether a DN is "cheating" by sending a block report with leaseID = 0 to disable rate-limiting. This is a use-case we want to support but we also want to know when it is going on.
Can we fix the javadoc for scheduleBlockReport to mention randomness, and not "send...at the next heartbeat?" Incorrect right now.
I looked pretty far back into the history of this code. It seems to go back to at least 2009. The underlying ideas seem to be:
1. the first full block report can have a configurable delay in seconds expressed by dfs.blockreport.initialDelay
2. the second full block report gets a random delay between 0 and dfs.blockreport.intervalMsec
3. all other block reports get an interval of dfs.blockreport.intervalMsec unless the previous block report had a longer interval than expected... if the previous one had a longer interval than expected, the next one gets a shorter interval.
We can keep behavior #1... it's simple to implement and may be useful for testing (although I think this patch makes it no longer necessary).
Behavior #2 seems like a workaround for the lack of congestion control in the past. In a world where the NN rate-limits full block reports, we don't need this behavior to prevent FBRs from "clumping". They will just naturally not overly clump because we are rate-limiting them.
Behavior #3 just seems incorrect, even without this patch. By definition, a full block report contains all the information the NN needs to understand the DN state. Just because block report interval N was longer than expected, seems no reason to shorten block report interval N+1. In fact, this behavior seems like it could lead to congestion collapse... if the NN gets overloaded and can't handle block reports for some time, a bunch of DNs will shorten the time in between the current block report and the next one, further increasing total NN load. Not good. Not good at all.
I replaced this with a simple "randomize first block report time within 0 and dfs.blockreport.initialDelay, then try to do all other block reports after dfs.blockreport.intervalMsec ms. If the full block report interval was more than 2x what was configured, we whine about it in the log file (should only happen if the NN is under extreme load).
Have you thought about moving the BR scheduler to the NN side? We still rely on the DNs to jitter themselves and do the initial delay, but we could have the NN handle all this. This would also let the NN trigger FBRs whenever it wants. We could also do better than random scheduling, i.e. stride it rather than jitter. Incompatible, so we probably won't, but fun to think about
Yeah, we could do more on the NN to ensure fairness and so forth. I think it's not really a big problem since datanodes aren't evil, and the existing method of configuring BR period on the datanode side seems to be working well. We also tend to assume uniform cluster load in HDFS, an assumption which makes complex BR scheduling less interesting. But maybe some day...
Could we do the BRLManager register/unregister in addDatanode and removeDatanode? I think this is safe, since on a DN restart it'll provide a lease ID of 0 and FBR, even without a reg/unreg.
Seems reasonable. I moved the BRLM register/unregister calls from registerDatanode into addDatanode / removeDatanode.