Took a while to load back the NM code into my memory
Had a comprehensive look at the patch. Mostly looks good. Some comments:
General: Please wrap around lines longer than 80 chars (you can look at the patch itself to figure out the lines).
- handle() method already takes a writeLock, doesn't need to be synchronized.
- RM doesn't kill containers using stopContainer(), it instead sends a FINISH_CONTAINER in the heartbeat. So the error message should only refer to ApplicationMaster.
RELEASE_CONTAINER_RESOURCES is always sent along with CLEANUP_CONTAINER_RESOURCES event. I think we should just merge these into CLEANUP_CONTAINER_RESOURCES event itself. This will also be inline with the fact that creation of container-dirs and the localization of files both happen as part of a single event INIT_CONTAINER_RESOURCES, so cleanup should also be a single event. We can send a Map<LocalResourceVisibility, Collection<LocalResourceRequest>> as the event payload. To be symmetric, we should probably also merge the multiple INIT_CONTAINER_RESOURCES calls one for each LocalResourceVisibility to be a single event. Thoughts?
- Shall we rename ContainerRelMatcher to ResourcesReleasedMatcher? Similarly ContainerReqMatcher to ResourcesRequestedMatcher?
- Most of the new static methods and classes added can be private.
- Good that you've refactored out a WrappedContainer!
- When we use this WrappedContainer, we always want to drain events immediately. So, I think we should drop passing a drain boolean for every method and always drain the events. We can rename maybeDrain(boolean) method to be drainDispatcherEvents() and use the same. Improves readability IMO.
- testLocalizationCleanup() can be renamed to testResourcesRelease() as we are only verifying the ref-counts here?
- We can try different refCount values for different resource types (all of the types use a ref-count 1 in your tests).
The tests are getting a bit difficult to read because of the mocks. In both the above tests, I think we can use the real objects particularly for records like ContainerId, ApplicationId using the BuilderUtil APIs. We don't really need to mock these, it just adds more code and confuses as to what fields are mocked and what aren't. We seem to mocking everything as opposed to mocking only those objects that need modified behaviour. Some of these records:
- getMockRsrc(), createResource(), getMockContainerId(),
- creating appId in testLocalizationCleanup()
- creating records in getMockContainer()
- getPath(), getMockedResource(), getAppMockedResource(), getPublicMockedResource() and getPrivateMockedResource()
It doesn't look like an LRU cleanup is implemented, which i believe was added recently to 20security - that would be a separate jira.
LRU was put into trunk, but it was on MRV1.
There is existing code for purging of cache under disk pressure - See ResourceLocalization.CacheCleanup and ResourceRetentionSet. (We need tests for this though, will file a ticket) This only deletes files that aren't in use at all. By LRU, do you mean selective deletion of these files based on their usage? Can you please point me to the relevant MRV1 JIRA? Thanks!