Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5767

Fix the order that resources are cleaned up from the local Public/Private caches

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.6.0, 2.7.0, 3.0.0-alpha1
    • 2.8.0, 3.0.0-alpha2
    • None
    • Reviewed
    • Hide
      This issue fixes a bug in how resources are evicted from the PUBLIC and PRIVATE yarn local caches used by the node manager for resource localization. In summary, the caches are now properly cleaned based on an LRU policy across both the public and private caches.
      Show
      This issue fixes a bug in how resources are evicted from the PUBLIC and PRIVATE yarn local caches used by the node manager for resource localization. In summary, the caches are now properly cleaned based on an LRU policy across both the public and private caches.

    Description

      If you look at ResourceLocalizationService#handleCacheCleanup, you can see that public resources are added to the ResourceRetentionSet first followed by private resources:

      private void handleCacheCleanup(LocalizationEvent event) {
        ResourceRetentionSet retain =
          new ResourceRetentionSet(delService, cacheTargetSize);
        retain.addResources(publicRsrc);
        if (LOG.isDebugEnabled()) {
          LOG.debug("Resource cleanup (public) " + retain);
        }
        for (LocalResourcesTracker t : privateRsrc.values()) {
          retain.addResources(t);
          if (LOG.isDebugEnabled()) {
            LOG.debug("Resource cleanup " + t.getUser() + ":" + retain);
          }
        }
        //TODO Check if appRsrcs should also be added to the retention set.
      }
      

      Unfortunately, if we look at ResourceRetentionSet#addResources we see that this means public resources are deleted first until the target cache size is met:

      public void addResources(LocalResourcesTracker newTracker) {
        for (LocalizedResource resource : newTracker) {
          currentSize += resource.getSize();
          if (resource.getRefCount() > 0) {
            // always retain resources in use
            continue;
          }
          retain.put(resource, newTracker);
        }
        for (Iterator<Map.Entry<LocalizedResource,LocalResourcesTracker>> i =
               retain.entrySet().iterator();
             currentSize - delSize > targetSize && i.hasNext();) {
          Map.Entry<LocalizedResource,LocalResourcesTracker> rsrc = i.next();
          LocalizedResource resource = rsrc.getKey();
          LocalResourcesTracker tracker = rsrc.getValue();
          if (tracker.remove(resource, delService)) {
            delSize += resource.getSize();
            i.remove();
          }
        }
      }
      

      The result of this is that resources in the private cache are only deleted in the cases where:

      1. The cache size is larger than the target cache size and the public cache is empty.
      2. The cache size is larger than the target cache size and everything in the public cache is being used by a running container.

      For clusters that primarily use the public cache (i.e. make use of the shared cache), this means that the most commonly used resources can be deleted before old resources in the private cache. Furthermore, the private cache can continue to grow over time causing more and more churn in the public cache.

      Additionally, the same problem exists within the private cache. Since resources are added to the retention set on a user by user basis, resources will get cleaned up one user at a time in the order that privateRsrc.values() returns the LocalResourcesTracker. So if user1 has 10MB in their cache and user2 has 100MB in their cache and the target size of the cache is 50MB, user1 could potentially have their entire cache removed before anything is deleted from the user2 cache.

      Attachments

        1. YARN-5767-trunk-v4.patch
          32 kB
          Chris Trezzo
        2. YARN-5767-trunk-v3.patch
          31 kB
          Chris Trezzo
        3. YARN-5767-trunk-v2.patch
          31 kB
          Chris Trezzo
        4. YARN-5767-trunk-v1.patch
          30 kB
          Chris Trezzo

        Issue Links

          Activity

            People

              ctrezzo Chris Trezzo
              ctrezzo Chris Trezzo
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: