Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1538

TrackerDistributedCacheManager can fail because the number of subdirectories reaches system limit

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.22.0
    • Fix Version/s: 0.21.0
    • Component/s: tasktracker
    • Labels:
      None

      Description

      TrackerDistributedCacheManager deletes the cached files when the size goes up to a configured number.
      But there is no such limit for the number of subdirectories. Therefore the number of subdirectories may grow large and exceed system limit.
      This will make TT cannot create directory when getLocalCache and fails the tasks.

        Issue Links

          Activity

          Hide
          Scott Chen added a comment -

          When this happens, the log will actually show the follwing.

          2010-02-25 12:45:41,022 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201002230013_22452_m_003831_0 on tracker_hadoop0143.snc3.facebook.com.:localhost.localdomain/127.0.0.1:37489: java.io.FileNotFoundException: /mnt/d3/SILVER/local/taskTracker/jobcache/job_201002230013_22452/attempt_201002230013_22452_m_003831_0/output/file.out (No space left on device)

          But if we do df on the machine, we found the space is not an issue. It is because of the number of subdirectories are too high.

          Show
          Scott Chen added a comment - When this happens, the log will actually show the follwing. 2010-02-25 12:45:41,022 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201002230013_22452_m_003831_0 on tracker_hadoop0143.snc3.facebook.com.:localhost.localdomain/127.0.0.1:37489: java.io.FileNotFoundException: /mnt/d3/SILVER/local/taskTracker/jobcache/job_201002230013_22452/attempt_201002230013_22452_m_003831_0/output/file.out (No space left on device) But if we do df on the machine, we found the space is not an issue. It is because of the number of subdirectories are too high.
          Hide
          Zheng Shao added a comment -

          +1

          Show
          Zheng Shao added a comment - +1
          Hide
          Scott Chen added a comment -

          In the patch, we track the number of subdirectories and if it goes up to a threshold, we delete the released cache.
          Similar mechanism is there for the total size. So we just adapt from that.

          Show
          Scott Chen added a comment - In the patch, we track the number of subdirectories and if it goes up to a threshold, we delete the released cache. Similar mechanism is there for the total size. So we just adapt from that.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12437300/MAPREDUCE-1538.patch
          against trunk revision 918037.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/13/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/13/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/13/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/13/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12437300/MAPREDUCE-1538.patch against trunk revision 918037. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/13/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/13/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/13/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/13/console This message is automatically generated.
          Hide
          Scott Chen added a comment -

          The failed task is because of MAPREDUCE-1520.
          It is not related to this patch.

          Show
          Scott Chen added a comment - The failed task is because of MAPREDUCE-1520 . It is not related to this patch.
          Hide
          dhruba borthakur added a comment -

          Code look good. I will commit this patch

          Show
          dhruba borthakur added a comment - Code look good. I will commit this patch
          Hide
          Scott Chen added a comment -

          Thanks for the help, Dhruba

          Show
          Scott Chen added a comment - Thanks for the help, Dhruba
          Hide
          dhruba borthakur added a comment -

          resubmit for HadoopQA

          Show
          dhruba borthakur added a comment - resubmit for HadoopQA
          Hide
          Arun C Murthy added a comment -

          One comment:

          +      long allowedNumberSubDir = conf.getLong(
          +          TTConfig.TT_LOCAL_CACHE_SUBDIRS_LIMIT, DEFAULT_CACHE_SUBDIR_LIMIT);
          

          We should save the variable from the conf rather than read it from the conf each time.

          Show
          Arun C Murthy added a comment - One comment: + long allowedNumberSubDir = conf.getLong( + TTConfig.TT_LOCAL_CACHE_SUBDIRS_LIMIT, DEFAULT_CACHE_SUBDIR_LIMIT); We should save the variable from the conf rather than read it from the conf each time.
          Hide
          Arun C Murthy added a comment -

          Also, I'd propose we have a single data-structure to track sizes and #sub-dirs for a given basedir rather than 2 separate maps. Something along the lines of:

          static class CachedDir {
            long size;
            long subdirs;
          }
          
          Show
          Arun C Murthy added a comment - Also, I'd propose we have a single data-structure to track sizes and #sub-dirs for a given basedir rather than 2 separate maps. Something along the lines of: static class CachedDir { long size; long subdirs; }
          Hide
          Scott Chen added a comment -

          Arun: Both your suggestions are really good. I will make the change. We also have:

           
                // setting the cache size to a default of 10GB
                long allowedSize = conf.getLong(TTConfig.TT_LOCAL_CACHE_SIZE,
                    DEFAULT_CACHE_SIZE);
          

          Inside getLocalCache(), I will also save this variable according to your suggestion.

          Show
          Scott Chen added a comment - Arun: Both your suggestions are really good. I will make the change. We also have: // setting the cache size to a default of 10GB long allowedSize = conf.getLong(TTConfig.TT_LOCAL_CACHE_SIZE, DEFAULT_CACHE_SIZE); Inside getLocalCache(), I will also save this variable according to your suggestion.
          Hide
          Scott Chen added a comment -

          1. Move reading config to the constructor
          2. Use a class to store cache directory properties

          Show
          Scott Chen added a comment - 1. Move reading config to the constructor 2. Use a class to store cache directory properties
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12441877/MAPREDUCE-1538-v2.txt
          against trunk revision 933441.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/114/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/114/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/114/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/114/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12441877/MAPREDUCE-1538-v2.txt against trunk revision 933441. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/114/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/114/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/114/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/114/console This message is automatically generated.
          Hide
          Scott Chen added a comment -

          Got 105 test failures with the message

          Error Message
          
          Forked Java VM exited abnormally. Please note the time in the report does not reflect the time until the VM exit.
          Stacktrace
          
          junit.framework.AssertionFailedError: Forked Java VM exited abnormally. Please note the time in the report does not reflect the time until the VM exit.
          
          Show
          Scott Chen added a comment - Got 105 test failures with the message Error Message Forked Java VM exited abnormally. Please note the time in the report does not reflect the time until the VM exit. Stacktrace junit.framework.AssertionFailedError: Forked Java VM exited abnormally. Please note the time in the report does not reflect the time until the VM exit.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12441877/MAPREDUCE-1538-v2.txt
          against trunk revision 933441.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/116/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/116/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/116/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/116/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12441877/MAPREDUCE-1538-v2.txt against trunk revision 933441. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/116/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/116/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/116/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/116/console This message is automatically generated.
          Hide
          dhruba borthakur added a comment -

          I just committed this. Thanks Scott!

          Show
          dhruba borthakur added a comment - I just committed this. Thanks Scott!
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #289 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/289/)
          MAPREDUCE-1538. TrackerDistributedCacheManager manages the
          number of files. (Scott Chen via dhruba)

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #289 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/289/ ) MAPREDUCE-1538 . TrackerDistributedCacheManager manages the number of files. (Scott Chen via dhruba)

            People

            • Assignee:
              Scott Chen
              Reporter:
              Scott Chen
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development