Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1914

TrackerDistributedCacheManager never cleans its input directories

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      When we localize a file into a node's cache, it's installed in a directory whose subroot is a random long . These long s all sit in a single flat directory [per disk, per cluster node]. When the cached file is no longer needed, its reference count becomes zero in a tracking data structure. The file then becomes eligible for deletion when the total amount of space occupied by cached files exceeds 10G [by default] or the total number of such files exceeds 10K.

      However, when we delete a cached file, we don't delete the directory that contains it; this importantly includes the elements of the flat directory, which then accumulate until they reach a system limit, 32K in some cases, and then the node stops working.

      We need to delete the flat directory when we delete the localized cache file it contains.

        Issue Links

          Activity

          Hide
          Amareshwari Sriramadasu added a comment -

          I see that this has been fixed by MAPREDUCE-1538 with following code change :

          @@ -268,18 +282,9 @@
               for (CacheStatus lcacheStatus : deleteSet) {
                 synchronized (lcacheStatus) {
                   deleteLocalPath(asyncDiskService,
          -            FileSystem.getLocal(conf), lcacheStatus.localizedLoadPath);
          -        // decrement the size of the cache from baseDirSize
          -        synchronized (baseDirSize) {
          -          Long dirSize = baseDirSize.get(lcacheStatus.localizedBaseDir);
          -          if ( dirSize != null ) {
          -            dirSize -= lcacheStatus.size;
          -            baseDirSize.put(lcacheStatus.localizedBaseDir, dirSize);
          -          } else {
          -            LOG.warn("Cannot find record of the baseDir: " + 
          -                     lcacheStatus.localizedBaseDir + " during delete!");
          -          }
          -        }
          +            FileSystem.getLocal(conf), lcacheStatus.getLocalizedUniqueDir());
          +        // Update the maps baseDirSize and baseDirNumberSubDir
          +        deleteCacheInfoUpdate(lcacheStatus);
          

          The above code always deletes lcacheStatus.getLocalizedUniqueDir(), which is reported in this issue. Am I missing something?

          Show
          Amareshwari Sriramadasu added a comment - I see that this has been fixed by MAPREDUCE-1538 with following code change : @@ -268,18 +282,9 @@ for (CacheStatus lcacheStatus : deleteSet) { synchronized (lcacheStatus) { deleteLocalPath(asyncDiskService, - FileSystem.getLocal(conf), lcacheStatus.localizedLoadPath); - // decrement the size of the cache from baseDirSize - synchronized (baseDirSize) { - Long dirSize = baseDirSize.get(lcacheStatus.localizedBaseDir); - if ( dirSize != null ) { - dirSize -= lcacheStatus.size; - baseDirSize.put(lcacheStatus.localizedBaseDir, dirSize); - } else { - LOG.warn( "Cannot find record of the baseDir: " + - lcacheStatus.localizedBaseDir + " during delete!" ); - } - } + FileSystem.getLocal(conf), lcacheStatus.getLocalizedUniqueDir()); + // Update the maps baseDirSize and baseDirNumberSubDir + deleteCacheInfoUpdate(lcacheStatus); The above code always deletes lcacheStatus.getLocalizedUniqueDir(), which is reported in this issue. Am I missing something?
          Hide
          Dick King added a comment -

          This patch adds two advantages to the code in MAPREDUCE-1538:

          • it tests the deletion of the directories
          • It improve that loop of potentially 10K iterations that takes place with a lock held [ see MAPREDUCE-1909 ]
          • reordering the if clause in the loop
          • using cachedArchives.values()

          The full regression test is in progress. I don't expect any problems, since I've done the relevant unit tests.

          Show
          Dick King added a comment - This patch adds two advantages to the code in MAPREDUCE-1538 : it tests the deletion of the directories It improve that loop of potentially 10K iterations that takes place with a lock held [ see MAPREDUCE-1909 ] reordering the if clause in the loop using cachedArchives.values() The full regression test is in progress. I don't expect any problems, since I've done the relevant unit tests.
          Hide
          Dick King added a comment -

          The regression passes, except for the usual TestTaskTrackerLocalization .

          Show
          Dick King added a comment - The regression passes, except for the usual TestTaskTrackerLocalization .
          Hide
          Dick King added a comment -

          Forgot to mention: this patch also guarantees uniqueness of the local cache directory names. It's the concatenation of the creation time of the cache manager, and an incremented AtomicLong .

          Show
          Dick King added a comment - Forgot to mention: this patch also guarantees uniqueness of the local cache directory names. It's the concatenation of the creation time of the cache manager, and an incremented AtomicLong .
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12450923/MAPREDUCE-1914--2010-07-30--1336.patch
          against trunk revision 1074251.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The patch appears to cause tar ant target to fail.

          -1 findbugs. The patch appears to cause Findbugs (version 1.3.9) to fail.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:

          -1 contrib tests. The patch failed contrib unit tests.

          -1 system test framework. The patch failed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/76//testReport/
          Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/76//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12450923/MAPREDUCE-1914--2010-07-30--1336.patch against trunk revision 1074251. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The patch appears to cause tar ant target to fail. -1 findbugs. The patch appears to cause Findbugs (version 1.3.9) to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: -1 contrib tests. The patch failed contrib unit tests. -1 system test framework. The patch failed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/76//testReport/ Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/76//console This message is automatically generated.
          Hide
          Arun C Murthy added a comment -

          Sorry to come in late, the patch has gone stale. Can you please rebase? Thanks.

          Given this is not an issue with MRv2 should we still commit this? I'm happy to, but not sure it's useful. Thanks.

          Show
          Arun C Murthy added a comment - Sorry to come in late, the patch has gone stale. Can you please rebase? Thanks. Given this is not an issue with MRv2 should we still commit this? I'm happy to, but not sure it's useful. Thanks.

            People

            • Assignee:
              Dick King
              Reporter:
              Dick King
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Development