Presently #getLiveDatanodeStorageReport() is fetched for every file and does the computation. This Jira sub-task is to discuss and implement a cache mechanism which in turn reduces the number of function calls. Also, could define a configurable refresh interval and periodically refresh the DN cache by fetching latest #getLiveDatanodeStorageReport on this interval.
Adding getDatanodeStorageReport is concerning. getDatanodeListForReport is already a very bad method that should be avoided for anything but jmx – even then it’s a concern. I eliminated calls to it years ago. All it takes is a nscd/dns hiccup and you’re left holding the fsn lock for an excessive length of time. Beyond that, the response is going to be pretty large and tagging all the storage reports is not going to be cheap.
verifyTargetDatanodeHasSpaceForScheduling does it really need the namesystem lock? Can’t DatanodeDescriptor#chooseStorage4Block synchronize on its storageMap?
Appears to be calling getLiveDatanodeStorageReport for every file. As mentioned earlier, this is NOT cheap. The SPS should be able to operate on a fuzzy/cached state of the world. Then it gets another datanode report to determine the number of live nodes to decide if it should sleep before processing the next path. The number of nodes from the prior cached view of the world should suffice.