Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1140

Per cache-file refcount can become negative when tasks release distributed-cache files

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.2, 0.21.0, 0.22.0
    • Fix Version/s: 0.21.0
    • Component/s: tasktracker
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Fixed a bug in DistributedCache, to not decrement reference counts for unreferenced files in error conditions.
    1. patch-1140.txt
      7 kB
      Amareshwari Sriramadasu
    2. patch-1140-1.txt
      8 kB
      Amareshwari Sriramadasu
    3. patch-1140-ydist.txt
      3 kB
      Amareshwari Sriramadasu
    4. patch-1140-2.txt
      14 kB
      Amareshwari Sriramadasu
    5. patch-1140-ydist.txt
      6 kB
      Amareshwari Sriramadasu
    6. patch-1140-3.txt
      14 kB
      Amareshwari Sriramadasu
    7. patch-1140-2-ydist.txt
      6 kB
      Amareshwari Sriramadasu
    8. patch-1140-3-y20.txt
      16 kB
      Hemanth Yamijala

      Activity

      Vinod Kumar Vavilapalli created issue -
      Hide
      Vinod Kumar Vavilapalli added a comment -

      Here's the pseudo code that each Task runs w.r.t Distributed cache files:

      TaskSetup() {
        for(URI uri: list of cache uris) {
          localize(URI)
        }
      }
      
      TaskFinish() {
        for(URI uri: list of cache uris) {
          release(URI)
        }
      }
      

      The problem happens when, out of N URIs to be localized, n go through and the localization fails for the n+1 th URI. In this case, the task goes ahead and does a release on every URI even though URIs n+2 through N-1 are not even localized!

      This may result in -ve refcounts for some URIs, and deletion of files that are actually in use when refcount becomes zero.

      Show
      Vinod Kumar Vavilapalli added a comment - Here's the pseudo code that each Task runs w.r.t Distributed cache files: TaskSetup() { for (URI uri: list of cache uris) { localize(URI) } } TaskFinish() { for (URI uri: list of cache uris) { release(URI) } } The problem happens when, out of N URIs to be localized, n go through and the localization fails for the n+1 th URI. In this case, the task goes ahead and does a release on every URI even though URIs n+2 through N-1 are not even localized! This may result in -ve refcounts for some URIs, and deletion of files that are actually in use when refcount becomes zero.
      Amareshwari Sriramadasu made changes -
      Field Original Value New Value
      Assignee Amareshwari Sriramadasu [ amareshwari ]
      Amareshwari Sriramadasu made changes -
      Attachment patch-1140.txt [ 12424018 ]
      Hide
      Amareshwari Sriramadasu added a comment -

      Simple patch fixing the bug. Added testcase. Testcase fails without the patch and passes with the patch.

      Show
      Amareshwari Sriramadasu added a comment - Simple patch fixing the bug. Added testcase. Testcase fails without the patch and passes with the patch.
      Amareshwari Sriramadasu made changes -
      Status Open [ 1 ] Patch Available [ 10002 ]
      Hide
      Hadoop QA added a comment -

      -1 overall. Here are the results of testing the latest attachment
      http://issues.apache.org/jira/secure/attachment/12424018/patch-1140.txt
      against trunk revision 832362.

      +1 @author. The patch does not contain any @author tags.

      +1 tests included. The patch appears to include 3 new or modified tests.

      +1 javadoc. The javadoc tool did not generate any warning messages.

      +1 javac. The applied patch does not increase the total number of javac compiler warnings.

      +1 findbugs. The patch does not introduce any new Findbugs warnings.

      +1 release audit. The applied patch does not increase the total number of release audit warnings.

      -1 core tests. The patch failed core unit tests.

      -1 contrib tests. The patch failed contrib unit tests.

      Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/223/testReport/
      Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/223/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
      Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/223/artifact/trunk/build/test/checkstyle-errors.html
      Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/223/console

      This message is automatically generated.

      Show
      Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12424018/patch-1140.txt against trunk revision 832362. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/223/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/223/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/223/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/223/console This message is automatically generated.
      Hide
      Vinod Kumar Vavilapalli added a comment -

      I don't think this patch will work. If two different attempts use the same set of URIs with the first succeeding in localizing all files wheres the second fails on some URI, the second attempt will decrement refcounts for URIs that it itself has not localized!

      We need to maintain a map of attempts to the corresponding
      localized-files.

      Also, a URI has to be added to the map only after its localization is done i.e. after getLocalCache() and not before like the way it is done in this patch.

      Show
      Vinod Kumar Vavilapalli added a comment - I don't think this patch will work. If two different attempts use the same set of URIs with the first succeeding in localizing all files wheres the second fails on some URI, the second attempt will decrement refcounts for URIs that it itself has not localized! We need to maintain a map of attempts to the corresponding localized-files. Also, a URI has to be added to the map only after its localization is done i.e. after getLocalCache() and not before like the way it is done in this patch.
      Hide
      Amareshwari Sriramadasu added a comment -

      We need to maintain a map of attempts to the corresponding localized-files.

      The list of localizedCacheFiles is maintained in TaskDistributedCacheManager, which is one per attempt.

      a URI has to be added to the map only after its localization is done i.e. after getLocalCache() and not before like the way it is done in this patch.

      This is done, because getLocalCache increments referenceCount first and then localizes. Reference count should be decremented for the one just failed also. So, it should be added to the list before the getLocalCache call.

      Show
      Amareshwari Sriramadasu added a comment - We need to maintain a map of attempts to the corresponding localized-files. The list of localizedCacheFiles is maintained in TaskDistributedCacheManager, which is one per attempt. a URI has to be added to the map only after its localization is done i.e. after getLocalCache() and not before like the way it is done in this patch. This is done, because getLocalCache increments referenceCount first and then localizes. Reference count should be decremented for the one just failed also. So, it should be added to the list before the getLocalCache call.
      Hide
      Vinod Kumar Vavilapalli added a comment -

      Thanks for the explanations. Sound correct. I take back my complaints.

      Minor: Can you please comment that getReferenceCount() method is intended only for testing?

      Show
      Vinod Kumar Vavilapalli added a comment - Thanks for the explanations. Sound correct. I take back my complaints. Minor: Can you please comment that getReferenceCount() method is intended only for testing?
      Hide
      Amareshwari Sriramadasu added a comment -

      Cancelling the patch to fix TestTrackerDistributedCacheManagerWithLinuxTaskController

      Show
      Amareshwari Sriramadasu added a comment - Cancelling the patch to fix TestTrackerDistributedCacheManagerWithLinuxTaskController
      Amareshwari Sriramadasu made changes -
      Status Patch Available [ 10002 ] Open [ 1 ]
      Hide
      Amareshwari Sriramadasu added a comment -

      Fixed the testcase. Also added the comment that Vinod has suggested

      Show
      Amareshwari Sriramadasu added a comment - Fixed the testcase. Also added the comment that Vinod has suggested
      Amareshwari Sriramadasu made changes -
      Attachment patch-1140-1.txt [ 12424097 ]
      Amareshwari Sriramadasu made changes -
      Status Open [ 1 ] Patch Available [ 10002 ]
      Hide
      Hadoop QA added a comment -

      -1 overall. Here are the results of testing the latest attachment
      http://issues.apache.org/jira/secure/attachment/12424097/patch-1140-1.txt
      against trunk revision 833006.

      +1 @author. The patch does not contain any @author tags.

      +1 tests included. The patch appears to include 6 new or modified tests.

      +1 javadoc. The javadoc tool did not generate any warning messages.

      +1 javac. The applied patch does not increase the total number of javac compiler warnings.

      +1 findbugs. The patch does not introduce any new Findbugs warnings.

      +1 release audit. The applied patch does not increase the total number of release audit warnings.

      +1 core tests. The patch passed core unit tests.

      -1 contrib tests. The patch failed contrib unit tests.

      Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/226/testReport/
      Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/226/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
      Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/226/artifact/trunk/build/test/checkstyle-errors.html
      Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/226/console

      This message is automatically generated.

      Show
      Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12424097/patch-1140-1.txt against trunk revision 833006. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/226/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/226/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/226/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/226/console This message is automatically generated.
      Hide
      Amareshwari Sriramadasu added a comment -

      Hudson could not run contrib tests because of some network problem. resubmitting.

      Show
      Amareshwari Sriramadasu added a comment - Hudson could not run contrib tests because of some network problem. resubmitting.
      Amareshwari Sriramadasu made changes -
      Status Patch Available [ 10002 ] Open [ 1 ]
      Amareshwari Sriramadasu made changes -
      Status Open [ 1 ] Patch Available [ 10002 ]
      Hide
      Hadoop QA added a comment -

      +1 overall. Here are the results of testing the latest attachment
      http://issues.apache.org/jira/secure/attachment/12424097/patch-1140-1.txt
      against trunk revision 833275.

      +1 @author. The patch does not contain any @author tags.

      +1 tests included. The patch appears to include 6 new or modified tests.

      +1 javadoc. The javadoc tool did not generate any warning messages.

      +1 javac. The applied patch does not increase the total number of javac compiler warnings.

      +1 findbugs. The patch does not introduce any new Findbugs warnings.

      +1 release audit. The applied patch does not increase the total number of release audit warnings.

      +1 core tests. The patch passed core unit tests.

      +1 contrib tests. The patch passed contrib unit tests.

      Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/129/testReport/
      Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/129/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
      Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/129/artifact/trunk/build/test/checkstyle-errors.html
      Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/129/console

      This message is automatically generated.

      Show
      Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12424097/patch-1140-1.txt against trunk revision 833275. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/129/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/129/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/129/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/129/console This message is automatically generated.
      Hide
      Amareshwari Sriramadasu added a comment -

      Patch for Yahoo! distribution branch 0.20

      Show
      Amareshwari Sriramadasu added a comment - Patch for Yahoo! distribution branch 0.20
      Amareshwari Sriramadasu made changes -
      Attachment patch-1140-ydist.txt [ 12424359 ]
      Hide
      Jothi Padmanabhan added a comment -

      Patch looks good to me. +1

      Show
      Jothi Padmanabhan added a comment - Patch looks good to me. +1
      Hide
      Hemanth Yamijala added a comment -

      This is done, because getLocalCache increments referenceCount first and then localizes. Reference count should be decremented for the one just failed also. So, it should be added to the list before the getLocalCache call.

      Umm. But (atleast theoretically), it is still possible that a call to getLocalCache fails before referenceCount is incremented. For e.g. makeRelative throws IOException; so does getLocalCacheForWrite. Hence, we still have a situation where we record a file as being localized (by storing it in localizedCacheFiles), but the reference count is not actually incremented. And releaseCache would have the bug this JIRA is talking about still.

      One more point I am slightly uncomfortable about is the duplication of state because of the new list localizedCacheFiles.

      Here's an alternate proposal:

      • Modify CacheFile to have a boolean saying isLocalized. By default, this is false. This will be set to true if distributedCacheManager.getLocalCache returns successfully.
      • To handle the case you have mentioned above, where a failure can happen after referenceCount is incremented in getLocalCache, I would suggest we catch exceptions inside getLocalCache, and on an exception, decrement the referenceCount and re-throw the exception. This seems right to me - because if the getLocalCache doesn't complete, shouldn't we be consistent by decrementing the reference count ?

      Would this work ?

      Show
      Hemanth Yamijala added a comment - This is done, because getLocalCache increments referenceCount first and then localizes. Reference count should be decremented for the one just failed also. So, it should be added to the list before the getLocalCache call. Umm. But (atleast theoretically), it is still possible that a call to getLocalCache fails before referenceCount is incremented. For e.g. makeRelative throws IOException; so does getLocalCacheForWrite. Hence, we still have a situation where we record a file as being localized (by storing it in localizedCacheFiles), but the reference count is not actually incremented. And releaseCache would have the bug this JIRA is talking about still. One more point I am slightly uncomfortable about is the duplication of state because of the new list localizedCacheFiles. Here's an alternate proposal: Modify CacheFile to have a boolean saying isLocalized. By default, this is false. This will be set to true if distributedCacheManager.getLocalCache returns successfully. To handle the case you have mentioned above, where a failure can happen after referenceCount is incremented in getLocalCache, I would suggest we catch exceptions inside getLocalCache, and on an exception, decrement the referenceCount and re-throw the exception. This seems right to me - because if the getLocalCache doesn't complete, shouldn't we be consistent by decrementing the reference count ? Would this work ?
      Hide
      Hemanth Yamijala added a comment -

      Some comments on the test case:

      • Can we refactor TestTrackerDistributedCacheManager and its subclass for LinuxTaskController, by introducing an API like canRun in the base class. Let it return true and this can be overridden in TestTrackerDistributedCacheManagerWithLinuxTaskController to return false if the platform is not Linux. This way, I suppose we don't have to keep overridding every testcase in the base class everytime we make a change.
      • I would enhance testReferenceCount to add 3 files instead of 2, the first file should be localized correctly, the second should fail after the reference count is incremented, and the third will not be localized at all. Then we can verify that the reference count for all files is 0.
      Show
      Hemanth Yamijala added a comment - Some comments on the test case: Can we refactor TestTrackerDistributedCacheManager and its subclass for LinuxTaskController, by introducing an API like canRun in the base class. Let it return true and this can be overridden in TestTrackerDistributedCacheManagerWithLinuxTaskController to return false if the platform is not Linux. This way, I suppose we don't have to keep overridding every testcase in the base class everytime we make a change. I would enhance testReferenceCount to add 3 files instead of 2, the first file should be localized correctly, the second should fail after the reference count is incremented, and the third will not be localized at all. Then we can verify that the reference count for all files is 0.
      Hide
      Jothi Padmanabhan added a comment -

      In getLocalCache, if the file is already present, only reference count is incremented, so does not throw any IOExceptions. If the file is not present, cachedArchives.put is done only after the necessary processing (read methods that can throw IOExceptions) is complete. So, I do not think this patch will lead to any inconsistency.

      I however do agree that it is a good idea to do the refactor of test case to avoid duplication.

      Show
      Jothi Padmanabhan added a comment - In getLocalCache , if the file is already present, only reference count is incremented, so does not throw any IOExceptions. If the file is not present, cachedArchives.put is done only after the necessary processing (read methods that can throw IOExceptions) is complete. So, I do not think this patch will lead to any inconsistency. I however do agree that it is a good idea to do the refactor of test case to avoid duplication.
      Hide
      Hemanth Yamijala added a comment -

      Jothi, thanks for that clarification. I do agree that what I thought of originally will not happen.

      So just from a correctness perspective this patch seems right. I still do maintain that from a design perspective, duplicating the state is making me uncomfortable, as also the somewhat counter-intuitive assumption that before we actually increment the reference count of a cache file, we assume it is incremented (implicitly, by storing it in the localizedCacheFiles). My specific concern is that it might allow errors to creep in later.

      I am open to thoughts from folks about this.

      Show
      Hemanth Yamijala added a comment - Jothi, thanks for that clarification. I do agree that what I thought of originally will not happen. So just from a correctness perspective this patch seems right. I still do maintain that from a design perspective, duplicating the state is making me uncomfortable, as also the somewhat counter-intuitive assumption that before we actually increment the reference count of a cache file, we assume it is incremented (implicitly, by storing it in the localizedCacheFiles). My specific concern is that it might allow errors to creep in later. I am open to thoughts from folks about this.
      Hide
      Jothi Padmanabhan added a comment -

      Let us go with your proposal, I agree it is less vulnerable to future code changes than the current patch.

      Show
      Jothi Padmanabhan added a comment - Let us go with your proposal, I agree it is less vulnerable to future code changes than the current patch.
      Amareshwari Sriramadasu made changes -
      Status Patch Available [ 10002 ] Open [ 1 ]
      Hide
      Amareshwari Sriramadasu added a comment -

      Patch with review comments incorporated.

      Show
      Amareshwari Sriramadasu added a comment - Patch with review comments incorporated.
      Amareshwari Sriramadasu made changes -
      Attachment patch-1140-2.txt [ 12425043 ]
      Amareshwari Sriramadasu made changes -
      Status Open [ 1 ] Patch Available [ 10002 ]
      Hide
      Hadoop QA added a comment -

      +1 overall. Here are the results of testing the latest attachment
      http://issues.apache.org/jira/secure/attachment/12425043/patch-1140-2.txt
      against trunk revision 836063.

      +1 @author. The patch does not contain any @author tags.

      +1 tests included. The patch appears to include 6 new or modified tests.

      +1 javadoc. The javadoc tool did not generate any warning messages.

      +1 javac. The applied patch does not increase the total number of javac compiler warnings.

      +1 findbugs. The patch does not introduce any new Findbugs warnings.

      +1 release audit. The applied patch does not increase the total number of release audit warnings.

      +1 core tests. The patch passed core unit tests.

      +1 contrib tests. The patch passed contrib unit tests.

      Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/245/testReport/
      Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/245/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
      Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/245/artifact/trunk/build/test/checkstyle-errors.html
      Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/245/console

      This message is automatically generated.

      Show
      Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12425043/patch-1140-2.txt against trunk revision 836063. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/245/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/245/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/245/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/245/console This message is automatically generated.
      Hide
      Amareshwari Sriramadasu added a comment -

      Patch for Yahoo! distribution branch 0.20

      Show
      Amareshwari Sriramadasu added a comment - Patch for Yahoo! distribution branch 0.20
      Amareshwari Sriramadasu made changes -
      Attachment patch-1140-ydist.txt [ 12425315 ]
      Hide
      Amareshwari Sriramadasu added a comment -

      test-patch result for Y!20 patch :

           [exec] -1 overall.
           [exec]
           [exec]     +1 @author.  The patch does not contain any @author tags.
           [exec]
           [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.
           [exec]                         Please justify why no tests are needed for this patch.
           [exec]
           [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
           [exec]
           [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
           [exec]
           [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
      

      -1 tests included. Please justify why no tests are needed for this patch.

      It is difficult to write unit test in branch 0.20. Since the code in DistributedCache has all static methods, fake objects also can not be used.

      All unit tests passed on my machine except TestHdfsProxy.

      Show
      Amareshwari Sriramadasu added a comment - test-patch result for Y!20 patch : [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no tests are needed for this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 tests included. Please justify why no tests are needed for this patch. It is difficult to write unit test in branch 0.20. Since the code in DistributedCache has all static methods, fake objects also can not be used. All unit tests passed on my machine except TestHdfsProxy.
      Hide
      V.Karthikeyan added a comment -

      Test Scenario:

      Configure local.cache.size to 4GB.

      Ran Job1 with cache files file1 and file2 - Job succeeded.
      Ran Job2 with cache files file3 and file1. When file3 is getting localized, removed file3 from dfs - Job2 failed.
      Ran Job3 with cache files file1, file1(again) and file4. file4 is huge (say 5GB), larger than local.cache.size.
      Without the patch, but the bug could be reproduced and verified by validating the logs(file1,file2,file3 got deleted if the cache is full).

      Test Scenario: (With the patch)

      Configure local.cache.size to 4GB.

      Ran Job1 with cache files file1 and file2 - Job succeeded.
      Ran Job2 with cache files file3 and file1. When file3 is getting localized, removed file3 from dfs - Job2 failed.
      Ran Job3 with cache files file1, file1(again) and file4. file4 is huge (say 5GB), larger than local.cache.size.
      Job3 succeeded, and it is also validated using the logs (file2 and file3 got deleted while, file1 never got deleted even the cache is not enough to localize file4. ).

      Show
      V.Karthikeyan added a comment - Test Scenario: Configure local.cache.size to 4GB. Ran Job1 with cache files file1 and file2 - Job succeeded. Ran Job2 with cache files file3 and file1. When file3 is getting localized, removed file3 from dfs - Job2 failed. Ran Job3 with cache files file1, file1(again) and file4. file4 is huge (say 5GB), larger than local.cache.size. Without the patch, but the bug could be reproduced and verified by validating the logs(file1,file2,file3 got deleted if the cache is full). Test Scenario: (With the patch) Configure local.cache.size to 4GB. Ran Job1 with cache files file1 and file2 - Job succeeded. Ran Job2 with cache files file3 and file1. When file3 is getting localized, removed file3 from dfs - Job2 failed. Ran Job3 with cache files file1, file1(again) and file4. file4 is huge (say 5GB), larger than local.cache.size. Job3 succeeded, and it is also validated using the logs (file2 and file3 got deleted while, file1 never got deleted even the cache is not enough to localize file4. ).
      Hide
      Hemanth Yamijala added a comment -

      This seems fine to me, except for small points:

      • We are changing the exception thrown from getLocalCache and I don't know what kind of side effects that has. Can we instead do something like this:
        getLocalCache() {
          // get the lcacheStatus
        
          boolean initSuccessful = false;
          try {
            // localize, etc
            initSuccessful = true;
            return localizedPath;
          } finally {
            if (!initSuccessful) {
              synchronized (cachedArchives) {
                lcacheStatus.refcount--;
              }
            }
          }
        }
        

      This way, we do not modify the exception state at all, while still achieving the purpose.

      • Another minor nit is that in the test case, testReferenceCount, we should chance the comment "// Configures another job with two regular files." to three, a change you made w.r.to my other review comment.
      Show
      Hemanth Yamijala added a comment - This seems fine to me, except for small points: We are changing the exception thrown from getLocalCache and I don't know what kind of side effects that has. Can we instead do something like this: getLocalCache() { // get the lcacheStatus boolean initSuccessful = false ; try { // localize, etc initSuccessful = true ; return localizedPath; } finally { if (!initSuccessful) { synchronized (cachedArchives) { lcacheStatus.refcount--; } } } } This way, we do not modify the exception state at all, while still achieving the purpose. Another minor nit is that in the test case, testReferenceCount, we should chance the comment "// Configures another job with two regular files." to three, a change you made w.r.to my other review comment.
      Hide
      Amareshwari Sriramadasu added a comment -

      Patch incorporating review comments.

      I would enhance testReferenceCount to add 3 files instead of 2, the first file should be localized correctly, the second should fail after the reference count is incremented, and the third will not be localized at all. Then we can verify that the reference count for all files is 0.

      I enhanced the test to add three files where first file would fail in localization, second file is already localized by a different job and third file is never localized. The path to verify successful localization is already covered.

      Show
      Amareshwari Sriramadasu added a comment - Patch incorporating review comments. I would enhance testReferenceCount to add 3 files instead of 2, the first file should be localized correctly, the second should fail after the reference count is incremented, and the third will not be localized at all. Then we can verify that the reference count for all files is 0. I enhanced the test to add three files where first file would fail in localization, second file is already localized by a different job and third file is never localized. The path to verify successful localization is already covered.
      Amareshwari Sriramadasu made changes -
      Attachment patch-1140-3.txt [ 12426275 ]
      Amareshwari Sriramadasu made changes -
      Status Patch Available [ 10002 ] Open [ 1 ]
      Amareshwari Sriramadasu made changes -
      Status Open [ 1 ] Patch Available [ 10002 ]
      Amareshwari Sriramadasu made changes -
      Status Patch Available [ 10002 ] Open [ 1 ]
      Hide
      Amareshwari Sriramadasu added a comment -

      re-submitting for hudson

      Show
      Amareshwari Sriramadasu added a comment - re-submitting for hudson
      Amareshwari Sriramadasu made changes -
      Status Open [ 1 ] Patch Available [ 10002 ]
      Hide
      Hemanth Yamijala added a comment -

      I looked at the Yahoo! hadoop 0.20 patch, and have the same comment with respect to throwing the exception in DistributedCache.getLocalCache. Other than that, it looks fine.

      Regarding the tests, I spoke offline to Amarsri to understand the scenario executed by Karthikeyan in the comment above. It was not very clear why file1 was being added twice. Some more details on configuration - that it was run on a single node, max failures was set to 1 should be documented for better understanding.

      In order to match the regressions tests in trunk, I would suggest we also have in Job 2 a file, say file5, which we should verify is not even localized (because file3 fails localization). Then we can include file 5 in Job 3 and make sure localization happens successfully.

      Show
      Hemanth Yamijala added a comment - I looked at the Yahoo! hadoop 0.20 patch, and have the same comment with respect to throwing the exception in DistributedCache.getLocalCache. Other than that, it looks fine. Regarding the tests, I spoke offline to Amarsri to understand the scenario executed by Karthikeyan in the comment above. It was not very clear why file1 was being added twice. Some more details on configuration - that it was run on a single node, max failures was set to 1 should be documented for better understanding. In order to match the regressions tests in trunk, I would suggest we also have in Job 2 a file, say file5, which we should verify is not even localized (because file3 fails localization). Then we can include file 5 in Job 3 and make sure localization happens successfully.
      Hide
      Hemanth Yamijala added a comment -

      +1 for the latest patch on trunk.

      Show
      Hemanth Yamijala added a comment - +1 for the latest patch on trunk.
      Hide
      Amareshwari Sriramadasu added a comment -

      Patch for Yahoo! distribution

      Show
      Amareshwari Sriramadasu added a comment - Patch for Yahoo! distribution
      Amareshwari Sriramadasu made changes -
      Attachment patch-1140-2-ydist.txt [ 12426383 ]
      Hide
      Hadoop QA added a comment -

      -1 overall. Here are the results of testing the latest attachment
      http://issues.apache.org/jira/secure/attachment/12426275/patch-1140-3.txt
      against trunk revision 884832.

      +1 @author. The patch does not contain any @author tags.

      +1 tests included. The patch appears to include 6 new or modified tests.

      +1 javadoc. The javadoc tool did not generate any warning messages.

      +1 javac. The applied patch does not increase the total number of javac compiler warnings.

      +1 findbugs. The patch does not introduce any new Findbugs warnings.

      +1 release audit. The applied patch does not increase the total number of release audit warnings.

      -1 core tests. The patch failed core unit tests.

      -1 contrib tests. The patch failed contrib unit tests.

      Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/278/testReport/
      Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/278/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
      Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/278/artifact/trunk/build/test/checkstyle-errors.html
      Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/278/console

      This message is automatically generated.

      Show
      Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12426275/patch-1140-3.txt against trunk revision 884832. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/278/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/278/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/278/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/278/console This message is automatically generated.
      Hide
      Amareshwari Sriramadasu added a comment -

      Test failures are unrelated.

      -1 contrib tests

      Is due to MAPREDUCE-1245.

      -1 core tests

      TestJobRetire timedout.
      Looking at the logs from console output , i see that TaskTracker could not come up because of HADOOP-4744, and job could not complete.
      Here are the logs:

           [exec]     [junit] 2009-11-30 04:48:46,445 INFO  mapred.JobTracker (JobTracker.java:addNewTracker(2148)) - Adding tracker tracker_host0.foo.com:localhost/127.0.0.1:39365 to host host0.foo.com
           [exec]     [junit] 2009-11-30 04:48:46,448 ERROR mapred.TaskTracker (TaskTracker.java:offerService(1313)) - Caught exception: java.io.IOException: Jetty problem. Jetty didn't bind to a valid port
           [exec]     [junit] 	at org.apache.hadoop.mapred.TaskTracker.checkJettyPort(TaskTracker.java:1142)
           [exec]     [junit] 	at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1291)
           [exec]     [junit] 	at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1933)
           [exec]     [junit] 	at org.apache.hadoop.mapred.MiniMRCluster$TaskTrackerRunner.run(MiniMRCluster.java:207)
           [exec]     [junit] 	at java.lang.Thread.run(Thread.java:619)
           [exec]     [junit] 
           [exec]     [junit] 2009-11-30 04:48:46,457 INFO  mapred.TaskTracker (TaskTracker.java:run(731)) - Shutting down: Map-events fetcher for all reduce tasks on tracker_host0.foo.com:localhost/127.0.0.1:39365
      

      The same test passed on my machine.

      Show
      Amareshwari Sriramadasu added a comment - Test failures are unrelated. -1 contrib tests Is due to MAPREDUCE-1245 . -1 core tests TestJobRetire timedout. Looking at the logs from console output , i see that TaskTracker could not come up because of HADOOP-4744 , and job could not complete. Here are the logs: [exec] [junit] 2009-11-30 04:48:46,445 INFO mapred.JobTracker (JobTracker.java:addNewTracker(2148)) - Adding tracker tracker_host0.foo.com:localhost/127.0.0.1:39365 to host host0.foo.com [exec] [junit] 2009-11-30 04:48:46,448 ERROR mapred.TaskTracker (TaskTracker.java:offerService(1313)) - Caught exception: java.io.IOException: Jetty problem. Jetty didn't bind to a valid port [exec] [junit] at org.apache.hadoop.mapred.TaskTracker.checkJettyPort(TaskTracker.java:1142) [exec] [junit] at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1291) [exec] [junit] at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1933) [exec] [junit] at org.apache.hadoop.mapred.MiniMRCluster$TaskTrackerRunner.run(MiniMRCluster.java:207) [exec] [junit] at java.lang.Thread.run(Thread.java:619) [exec] [junit] [exec] [junit] 2009-11-30 04:48:46,457 INFO mapred.TaskTracker (TaskTracker.java:run(731)) - Shutting down: Map-events fetcher for all reduce tasks on tracker_host0.foo.com:localhost/127.0.0.1:39365 The same test passed on my machine.
      Hide
      Amareshwari Sriramadasu added a comment -

      test-patch result for 0.20 patch :

           [exec] -1 overall.
           [exec]
           [exec]     +1 @author.  The patch does not contain any @author tags.
           [exec]
           [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.
           [exec]                         Please justify why no tests are needed for this patch.
           [exec]
           [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
           [exec]
           [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
           [exec]
           [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
      

      -1 tests included. It is difficult to port the testcase to 0.20, because the code in 0.20 is all static methods.

      All unit tests passed on my machine except TestHdfsProxy

      Show
      Amareshwari Sriramadasu added a comment - test-patch result for 0.20 patch : [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no tests are needed for this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 tests included. It is difficult to port the testcase to 0.20, because the code in 0.20 is all static methods. All unit tests passed on my machine except TestHdfsProxy
      Hide
      Hemanth Yamijala added a comment -

      I verified TestJobRetire passes on my machine with the patch as well.

      Hence, I committed this to trunk. Thanks, Amareshwari !

      Show
      Hemanth Yamijala added a comment - I verified TestJobRetire passes on my machine with the patch as well. Hence, I committed this to trunk. Thanks, Amareshwari !
      Hemanth Yamijala made changes -
      Status Patch Available [ 10002 ] Resolved [ 5 ]
      Hadoop Flags [Reviewed]
      Fix Version/s 0.22.0 [ 12314184 ]
      Resolution Fixed [ 1 ]
      Hide
      Iyappan Srinivasan added a comment -
      Regarding the tests, I spoke offline to Amarsri to understand the scenario executed by Karthikeyan in the comment above. It was not very clear why file1 was being added twice. Some more details on configuration - that it was run on a single node, max failures was set to 1 should be documented for better understanding.
      • Configuration :
        The tests run on a single node running JT/NN and another node running TT/DN.
        map.max.attempts is set to 1 and reduce.max.attempts is also set to 1 and local.cache.size is set to 4 GB and mapred.local.dir is set to only 1 spindle and not all the spindles.
        This is done to force the TT to localize in the same path and try deleting the localcaheFiles when the size exceeds 4 GB.

      The idea of the above testcase is :
      Ran Job1 with cache files file1 and file2 - Job succeeded.
      Ran Job2 with cache files file3 and file1. When file3 is getting localized, removed file3 from dfs - Job2 failed.

      • Here since file3 is deleted, the reference count of file1 should not be decremented twice(once during setup and once during cleanup).Thats the objective of this scenario.

      Ran Job3 with cache files file1, file1(again) and file4. file4 is huge (say 5GB), larger than local.cache.size.

      • To make sure that the decrement happened properly, file1 is added twice. When file4 is added, which
        is more than local cache size, other files like file2 and file3 ( which were used in the previous jobs) gets deleted
        but not file1 (because it had reference count proper ).
      In order to match the regressions tests in trunk, I would suggest we also have in Job 2 a file, say file5, which we should verify is not even localized (because file3 fails localization). Then we can include file 5 in Job 3 and make sure localization happens successfully.

      This scenario is tested and the localization happens successfully.

      Show
      Iyappan Srinivasan added a comment - Regarding the tests, I spoke offline to Amarsri to understand the scenario executed by Karthikeyan in the comment above. It was not very clear why file1 was being added twice. Some more details on configuration - that it was run on a single node, max failures was set to 1 should be documented for better understanding. Configuration : The tests run on a single node running JT/NN and another node running TT/DN. map.max.attempts is set to 1 and reduce.max.attempts is also set to 1 and local.cache.size is set to 4 GB and mapred.local.dir is set to only 1 spindle and not all the spindles. This is done to force the TT to localize in the same path and try deleting the localcaheFiles when the size exceeds 4 GB. The idea of the above testcase is : Ran Job1 with cache files file1 and file2 - Job succeeded. Ran Job2 with cache files file3 and file1. When file3 is getting localized, removed file3 from dfs - Job2 failed. Here since file3 is deleted, the reference count of file1 should not be decremented twice(once during setup and once during cleanup).Thats the objective of this scenario. Ran Job3 with cache files file1, file1(again) and file4. file4 is huge (say 5GB), larger than local.cache.size. To make sure that the decrement happened properly, file1 is added twice. When file4 is added, which is more than local cache size, other files like file2 and file3 ( which were used in the previous jobs) gets deleted but not file1 (because it had reference count proper ). In order to match the regressions tests in trunk, I would suggest we also have in Job 2 a file, say file5, which we should verify is not even localized (because file3 fails localization). Then we can include file 5 in Job 3 and make sure localization happens successfully. This scenario is tested and the localization happens successfully.
      Hide
      Hudson added a comment -

      Integrated in Hadoop-Mapreduce-trunk-Commit #138 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/138/)

      Show
      Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #138 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/138/ )
      Hide
      Hudson added a comment -

      Integrated in Hadoop-Mapreduce-trunk #162 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/162/)

      Show
      Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #162 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/162/ )
      Amareshwari Sriramadasu made changes -
      Release Note Fixed a bug in DistributedCache, to not decrement reference counts for unreferenced files in error conditions.
      Hide
      Hemanth Yamijala added a comment -

      More recent patch for earlier version of Hadoop. Not for commit here.

      Show
      Hemanth Yamijala added a comment - More recent patch for earlier version of Hadoop. Not for commit here.
      Hemanth Yamijala made changes -
      Attachment patch-1140-3-y20.txt [ 12431213 ]
      Tom White made changes -
      Fix Version/s 0.21.0 [ 12314045 ]
      Fix Version/s 0.22.0 [ 12314184 ]
      Tom White made changes -
      Status Resolved [ 5 ] Closed [ 6 ]

        People

        • Assignee:
          Amareshwari Sriramadasu
          Reporter:
          Vinod Kumar Vavilapalli
        • Votes:
          0 Vote for this issue
          Watchers:
          4 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved:

            Development