Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1098

Incorrect synchronization in DistributedCache causes TaskTrackers to freeze up during localization of Cache for tasks.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: tasktracker
    • Labels:
      None
    • Release Note:
      Fixed the distributed cache's localizeCache to lock only the uri it is localizing.

      Description

      Currently org.apache.hadoop.filecache.DistributedCache.getLocalCache(URI, Configuration, Path, FileStatus, boolean, long, Path, boolean) allows only one TaskRunner thread in TT to localize DistributedCache across jobs. Current way of synchronization is across baseDir this has to be changed to lock on the same baseDir.

      1. patch-1098-7-y20.txt
        32 kB
        Hemanth Yamijala
      2. patch-1098-ydist.txt
        26 kB
        Arun C Murthy
      3. patch-1098-7.txt
        32 kB
        Arun C Murthy
      4. patch-1098-7.txt
        31 kB
        Amareshwari Sriramadasu
      5. patch-1098-ydist.txt
        25 kB
        Amareshwari Sriramadasu
      6. patch-1098-6.txt
        31 kB
        Amareshwari Sriramadasu
      7. patch-1098-5.txt
        31 kB
        Amareshwari Sriramadasu
      8. patch-1098-4.txt
        31 kB
        Amareshwari Sriramadasu
      9. patch-1098-3.txt
        29 kB
        Amareshwari Sriramadasu
      10. MAPREDUCE-1098.patch
        16 kB
        Arun C Murthy
      11. MAPREDUCE-1098.patch
        5 kB
        Arun C Murthy
      12. MAPREDUCE-1098.patch
        5 kB
        Arun C Murthy
      13. patch-1098-ydist.txt
        4 kB
        Amareshwari Sriramadasu
      14. patch-1098-2.txt
        4 kB
        Amareshwari Sriramadasu
      15. patch-1098-0.20.txt
        3 kB
        Amareshwari Sriramadasu
      16. patch-1098-1.txt
        4 kB
        Amareshwari Sriramadasu
      17. patch-1098.txt
        1.0 kB
        Amareshwari Sriramadasu

        Issue Links

          Activity

          Hide
          Amareshwari Sriramadasu added a comment -

          Straight forward patch fixing the synchronization. Testing in progress

          Show
          Amareshwari Sriramadasu added a comment - Straight forward patch fixing the synchronization. Testing in progress
          Hide
          Arun C Murthy added a comment -

          -1. This patch incorrectly changes synchronization constructs... you still need to fix deleteCache etc. ?

          Show
          Arun C Murthy added a comment - -1. This patch incorrectly changes synchronization constructs... you still need to fix deleteCache etc. ?
          Hide
          Arun C Murthy added a comment -

          Let me elaborate on my concerns:

          1. This patch doesn't fix TrackerDistributedCacheManager.deleteCache which holds the same global lock and deletes stale cache-files, a big hole.
          2. This patch changes the locking (i.e. lock cachedArchives followed by lcacheStatus) in one and only place in the whole system while keeping it the same everywhere else! This, again, is potentially a bad call. At the very least you need to fix TrackerDistributedCacheManager.deleteCache and justify how changing the locking order is ok!
          Show
          Arun C Murthy added a comment - Let me elaborate on my concerns: This patch doesn't fix TrackerDistributedCacheManager.deleteCache which holds the same global lock and deletes stale cache-files, a big hole. This patch changes the locking (i.e. lock cachedArchives followed by lcacheStatus) in one and only place in the whole system while keeping it the same everywhere else! This, again, is potentially a bad call. At the very least you need to fix TrackerDistributedCacheManager.deleteCache and justify how changing the locking order is ok !
          Hide
          Amareshwari Sriramadasu added a comment -

          Patch moves localizing and deletion of a cache file from of the global lock.
          Verified the locking order for two threads simultaneously localizing and two threads simultaneosly deleting and interesection of them (i.e. partial addition and deletion)
          Manually tested that two jobs could localize files simultaneously, by adding sleep to lcoalize.

          Show
          Amareshwari Sriramadasu added a comment - Patch moves localizing and deletion of a cache file from of the global lock. Verified the locking order for two threads simultaneously localizing and two threads simultaneosly deleting and interesection of them (i.e. partial addition and deletion) Manually tested that two jobs could localize files simultaneously, by adding sleep to lcoalize.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12422215/patch-1098-1.txt
          against trunk revision 825465.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/77/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/77/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/77/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/77/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12422215/patch-1098-1.txt against trunk revision 825465. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/77/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/77/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/77/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/77/console This message is automatically generated.
          Hide
          Amareshwari Sriramadasu added a comment -

          Patch for Yahoo! Hadoop branch 0.20 distribution

          Show
          Amareshwari Sriramadasu added a comment - Patch for Yahoo! Hadoop branch 0.20 distribution
          Hide
          Vinod Kumar Vavilapalli added a comment -

          The locking order looks ok to me. Some comments about the patch:

          • in getLocalCache(), refcount should be incremented only after localizeCache(). This was the earlier behavior and we should retain it.
          • Similarly, decrement of baseDirSize should be done only after the filesystem delete.
          Show
          Vinod Kumar Vavilapalli added a comment - The locking order looks ok to me. Some comments about the patch: in getLocalCache(), refcount should be incremented only after localizeCache(). This was the earlier behavior and we should retain it. Similarly, decrement of baseDirSize should be done only after the filesystem delete.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          What about automatic tests? I know it's difficult being a case of testing synchronization of concurrent threads. But do we even have any idea, like some framework for testing concurrency? Thoughts, anyone?

          Show
          Vinod Kumar Vavilapalli added a comment - What about automatic tests? I know it's difficult being a case of testing synchronization of concurrent threads. But do we even have any idea, like some framework for testing concurrency? Thoughts, anyone?
          Hide
          Philip Zeyliger added a comment -

          Todd recently ran JCarder on some parts of JobTracker, which, as far as I understand it, instruments lock acquisition during runtime, and then afterward analyzes it for possible locks and races. It ought to be possible to systematically run a system like that on a fixed set of programs to check that they're behaving. I'll ask him about it.

          Show
          Philip Zeyliger added a comment - Todd recently ran JCarder on some parts of JobTracker, which, as far as I understand it, instruments lock acquisition during runtime, and then afterward analyzes it for possible locks and races. It ought to be possible to systematically run a system like that on a fixed set of programs to check that they're behaving. I'll ask him about it.
          Hide
          Amareshwari Sriramadasu added a comment -

          * in getLocalCache(), refcount should be incremented only after localizeCache(). This was the earlier behavior and we should retain it.

          This can not be moved to the other lock, because if delete happens inbetween the "marking for use" and "localizing", and if reference count is zero, delete will go ahead and delete it. So, reference count should be increment before.

          * Similarly, decrement of baseDirSize should be done only after the filesystem delete.

          Incorporated in the patch

          Show
          Amareshwari Sriramadasu added a comment - * in getLocalCache(), refcount should be incremented only after localizeCache(). This was the earlier behavior and we should retain it. This can not be moved to the other lock, because if delete happens inbetween the "marking for use" and "localizing", and if reference count is zero, delete will go ahead and delete it. So, reference count should be increment before. * Similarly, decrement of baseDirSize should be done only after the filesystem delete. Incorporated in the patch
          Hide
          Vinod Kumar Vavilapalli added a comment -

          The patch looks good to me. +1.

          @Arun, can you quickly look through the patch and see if works for you too?

          Show
          Vinod Kumar Vavilapalli added a comment - The patch looks good to me. +1. @Arun, can you quickly look through the patch and see if works for you too?
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Iyappan already has a setup for manually testing this issue, will find out to see if he can test it.

          Show
          Vinod Kumar Vavilapalli added a comment - Iyappan already has a setup for manually testing this issue, will find out to see if he can test it.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12422535/patch-1098-2.txt
          against trunk revision 826565.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/182/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/182/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/182/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/182/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12422535/patch-1098-2.txt against trunk revision 826565. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/182/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/182/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/182/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/182/console This message is automatically generated.
          Hide
          Amareshwari Sriramadasu added a comment -

          -1 tests included.

          It is difficult to write unit test for this change.

          Show
          Amareshwari Sriramadasu added a comment - -1 tests included. It is difficult to write unit test for this change.
          Hide
          Iyappan Srinivasan added a comment -

          +1 from QA for patch-1098-0.20.txt

          1) Brought up cluster, made sure that the file uploaded is around 2 GB (using -files option). Submitted two jobs which acceses these files.
          Before patch, the first job finished uploading the file and then only the second job file's uploading starts, as
          clearly seen from logs. After patch, both upload starts independently, as seen from logs.

          2) Ran sleep jobs and also streaming jobs to test this behaviour.

          3) Ran with one slave cluster and made sure that two jobs access same file/ different file using -files and -cacheFile. In all
          cases it went fine.
          After patch, when different files are given with -files option, then uploading happens independently. When same files are
          provided with -files option, still it happens independently because jt places them on different directories for each job, as seen from the conf file of the job.
          with -cacheFile and with only one TT, the first file is localized by the first job and the second job just access this localized file, as
          soon as the lock over that file is removed.

          Show
          Iyappan Srinivasan added a comment - +1 from QA for patch-1098-0.20.txt 1) Brought up cluster, made sure that the file uploaded is around 2 GB (using -files option). Submitted two jobs which acceses these files. Before patch, the first job finished uploading the file and then only the second job file's uploading starts, as clearly seen from logs. After patch, both upload starts independently, as seen from logs. 2) Ran sleep jobs and also streaming jobs to test this behaviour. 3) Ran with one slave cluster and made sure that two jobs access same file/ different file using -files and -cacheFile. In all cases it went fine. After patch, when different files are given with -files option, then uploading happens independently. When same files are provided with -files option, still it happens independently because jt places them on different directories for each job, as seen from the conf file of the job. with -cacheFile and with only one TT, the first file is localized by the first job and the second job just access this localized file, as soon as the lock over that file is removed.
          Hide
          Amareshwari Sriramadasu added a comment -

          Patch for Yahoo! distribution for branch 0.20

          Show
          Amareshwari Sriramadasu added a comment - Patch for Yahoo! distribution for branch 0.20
          Hide
          Arun C Murthy added a comment -

          This patch has a corner-case synchronization bug: it relies on CacheStatus.markForDeletion flag, however getLocalCache and deleteCache could silently corrupt the distributed-cache since they are looking at different CacheStatus objects - there-by rendering the checks based on the CacheStatus.markForDeletion useless.


          The above problems arise since the DistributedCache is currently structured to share the same underlying local file-system path across all CacheStatus objects. Effectively there is a 1-1 mapping between between files on HDFS and their localized counterparts.

          I'm thinking a slightly different solution to the problem exhibited by this patch is to break the 1-1 mapping between files on HDFS and the localized files and get the CacheStatus objects to own the unique localized paths. The proposal is to have a unique CacheStatus.localLoadPath per object and to initialize them via copies from HDFS from src files to unique localized files. Thus we can then continue to keep the current (correct) structure for deleteCache and put smarts in getLocalCache to copy on init of CacheStatus.

          Show
          Arun C Murthy added a comment - This patch has a corner-case synchronization bug: it relies on CacheStatus.markForDeletion flag, however getLocalCache and deleteCache could silently corrupt the distributed-cache since they are looking at different CacheStatus objects - there-by rendering the checks based on the CacheStatus.markForDeletion useless. The above problems arise since the DistributedCache is currently structured to share the same underlying local file-system path across all CacheStatus objects. Effectively there is a 1-1 mapping between between files on HDFS and their localized counterparts. I'm thinking a slightly different solution to the problem exhibited by this patch is to break the 1-1 mapping between files on HDFS and the localized files and get the CacheStatus objects to own the unique localized paths. The proposal is to have a unique CacheStatus.localLoadPath per object and to initialize them via copies from HDFS from src files to unique localized files. Thus we can then continue to keep the current (correct) structure for deleteCache and put smarts in getLocalCache to copy on init of CacheStatus.
          Hide
          Arun C Murthy added a comment -

          Illustrative patch for my earlier comments...

          Show
          Arun C Murthy added a comment - Illustrative patch for my earlier comments...
          Hide
          Hemanth Yamijala added a comment -

          Arun, this may not work as well.

          Basically, the localization code is like this:

          synchronized (cachedArchives) {
            get lcacheStatus
            synchronized (lcacheStatus) {
              increment reference count
            }
          }
          synchronized (lcacheStatus) {
            localize cache
          }
          

          The delete cache code is like this:

          synchronized (cachedArchives) {
            for each lcacheStatus {
              synchronized (lcacheStatus) {
                if (lcacheStatus.refCount == 0) {
                  //
                }
              }
            }
          }
          

          The problem is when iterating to delete, if a localizing thread is localizing a cache file for a particular cache object, the delete thread will wait to acquire the lock on the cache object after acquiring the global lock. Since the localization could take a long time, other threads will be blocked, in effect not solving the problem we trying to.

          Does this make sense ?

          It seems like a very correct approach should not require to lock any object when doing a costly operation like a DFS download. Other threads should wait for a download complete event notification, or some such. But those are sweeping changes.

          One solution Amarsri and I discussed was to see if making the reference count an AtomicInteger would help. Then maybe, its value can be read without having to acquire a lock on the cache status object. So, the delete code will be something like this:

          synchronized (cachedArchives) {
            for each lcacheStatus {
              if (lcacheStatus.atomicReferenceCount.get() == 0) {
                synchronized (lcacheStatus) {
                  // continue operation as in your patch.
                }
              }
            }
          }
          

          Since we are guaranteed that code that's localizing a path will have the reference count as non-zero, it will never try and proceed to the delete operation.

          Could this work ?

          Show
          Hemanth Yamijala added a comment - Arun, this may not work as well. Basically, the localization code is like this: synchronized (cachedArchives) { get lcacheStatus synchronized (lcacheStatus) { increment reference count } } synchronized (lcacheStatus) { localize cache } The delete cache code is like this: synchronized (cachedArchives) { for each lcacheStatus { synchronized (lcacheStatus) { if (lcacheStatus.refCount == 0) { // } } } } The problem is when iterating to delete, if a localizing thread is localizing a cache file for a particular cache object, the delete thread will wait to acquire the lock on the cache object after acquiring the global lock. Since the localization could take a long time, other threads will be blocked, in effect not solving the problem we trying to. Does this make sense ? It seems like a very correct approach should not require to lock any object when doing a costly operation like a DFS download. Other threads should wait for a download complete event notification, or some such. But those are sweeping changes. One solution Amarsri and I discussed was to see if making the reference count an AtomicInteger would help. Then maybe, its value can be read without having to acquire a lock on the cache status object. So, the delete code will be something like this: synchronized (cachedArchives) { for each lcacheStatus { if (lcacheStatus.atomicReferenceCount.get() == 0) { synchronized (lcacheStatus) { // continue operation as in your patch. } } } } Since we are guaranteed that code that's localizing a path will have the reference count as non-zero, it will never try and proceed to the delete operation. Could this work ?
          Hide
          Arun C Murthy added a comment -

          Oh, sure. I think making the reference count atomic is reasonable.

          Show
          Arun C Murthy added a comment - Oh, sure. I think making the reference count atomic is reasonable.
          Hide
          Arun C Murthy added a comment -

          Oh, sure. I think making the reference count atomic is reasonable.

          Owen pointed out today that we don't need to synchronize on CacheStatus to incremement/decrement/check the CacheStatus.refcount since we are synchronized on the global TrackerDistributedCacheManager.cachedArchives lock anyway... we should fix localizeCache to never use CacheStatus.refcount.

          Show
          Arun C Murthy added a comment - Oh, sure. I think making the reference count atomic is reasonable. Owen pointed out today that we don't need to synchronize on CacheStatus to incremement/decrement/check the CacheStatus.refcount since we are synchronized on the global TrackerDistributedCacheManager.cachedArchives lock anyway... we should fix localizeCache to never use CacheStatus.refcount.
          Hide
          Arun C Murthy added a comment -

          Also, seems to me that CacheStatus.currentStatus is fairly useless once we go to the new proposal...

          Show
          Arun C Murthy added a comment - Also, seems to me that CacheStatus.currentStatus is fairly useless once we go to the new proposal...
          Hide
          Arun C Murthy added a comment -

          I was bored enough to hack this a bit more.... an important related point is that we should make sure 'releaseCache' is called on exceptions from getLocalCache (in TaskRunner.run) etc.

          Show
          Arun C Murthy added a comment - I was bored enough to hack this a bit more.... an important related point is that we should make sure 'releaseCache' is called on exceptions from getLocalCache (in TaskRunner.run) etc.
          Hide
          Amareshwari Sriramadasu added a comment -

          Updated patch from Arun. Patch does the following:
          1. changes the key into cacheArchives map to use both uri and modification-time of the file.
          2. changes the localizedPath to be unique for a CacheStatus object.
          3. Added a testcase to test the freshness of the file on FileSystem. If the file gets modified while the job is running, it would fail. If a second job sees different modification time of the file, it will be localized in a different path.

          Show
          Amareshwari Sriramadasu added a comment - Updated patch from Arun. Patch does the following: 1. changes the key into cacheArchives map to use both uri and modification-time of the file. 2. changes the localizedPath to be unique for a CacheStatus object. 3. Added a testcase to test the freshness of the file on FileSystem. If the file gets modified while the job is running, it would fail. If a second job sees different modification time of the file, it will be localized in a different path.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12423017/patch-1098-3.txt
          against trunk revision 828979.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/204/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/204/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/204/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/204/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12423017/patch-1098-3.txt against trunk revision 828979. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/204/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/204/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/204/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/204/console This message is automatically generated.
          Hide
          Hemanth Yamijala added a comment -

          Hmm. Seems like the test failure (though introduced by this patch) is not really the fault of this patch. TestTrackerDistributedCacheManagerWithLinuxTaskController should really do a better job at not running tests if not under the task controller environment without having to add an overridden test case for every test of TestTrackerDistributedCacheManager. That seems a little odd to me. Keeping them in sync is obviously going to be maintenance overhead - as this patch's test failure clearly shows.

          Show
          Hemanth Yamijala added a comment - Hmm. Seems like the test failure (though introduced by this patch) is not really the fault of this patch. TestTrackerDistributedCacheManagerWithLinuxTaskController should really do a better job at not running tests if not under the task controller environment without having to add an overridden test case for every test of TestTrackerDistributedCacheManager. That seems a little odd to me. Keeping them in sync is obviously going to be maintenance overhead - as this patch's test failure clearly shows.
          Hide
          Arun C Murthy added a comment -

          This is looking good, some comments:

          1. The old o.a.h.mapreduce.filecache.DistributedCache.releaseCache calls the new api with 0L as the mtime - that is guaranteed to fail! We should either fix it to use the right mtime, or always use 0 as the mtime (i.e. override getKey()) or throw an exception. Not decrementing the cache silently is bad.
          2. I don't get why TrackerDistributedCacheManager.localizeCache actually deletes the file first... we should never have localized the file. Deleting it might hid a bug.
          3. We have forever had a stupid problem where we localize /foo on HDFS to <jobcache>/foo/foo in the local-fs. This patch continues that by doing some extra work (by calling cacheFilePath(cacheStatus.localLoadPath)). Shouldn't we just fix it? Adding more code for continuing old bad-practices is not useful.
          Show
          Arun C Murthy added a comment - This is looking good, some comments: The old o.a.h.mapreduce.filecache.DistributedCache.releaseCache calls the new api with 0L as the mtime - that is guaranteed to fail! We should either fix it to use the right mtime, or always use 0 as the mtime (i.e. override getKey()) or throw an exception. Not decrementing the cache silently is bad. I don't get why TrackerDistributedCacheManager.localizeCache actually deletes the file first... we should never have localized the file. Deleting it might hid a bug. We have forever had a stupid problem where we localize /foo on HDFS to <jobcache>/foo/foo in the local-fs. This patch continues that by doing some extra work (by calling cacheFilePath(cacheStatus.localLoadPath)). Shouldn't we just fix it? Adding more code for continuing old bad-practices is not useful.
          Hide
          Amareshwari Sriramadasu added a comment -

          Patch with review comments incorporated

          Show
          Amareshwari Sriramadasu added a comment - Patch with review comments incorporated
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12423098/patch-1098-4.txt
          against trunk revision 829281.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/208/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/208/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/208/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/208/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12423098/patch-1098-4.txt against trunk revision 829281. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/208/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/208/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/208/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/208/console This message is automatically generated.
          Hide
          Amareshwari Sriramadasu added a comment -

          changes the localizedPath to be unique for a CacheStatus object.

          Changed this to use random number instead of CacheStatus object's hashcode, as suggested by Arun.

          Manually tested that two jobs could localize files simultaneously, by adding sleep to localize.

          Show
          Amareshwari Sriramadasu added a comment - changes the localizedPath to be unique for a CacheStatus object. Changed this to use random number instead of CacheStatus object's hashcode, as suggested by Arun. Manually tested that two jobs could localize files simultaneously, by adding sleep to localize.
          Hide
          Arun C Murthy added a comment -

          +1.
          Contingent to hudson's +1 this patch is ready to commit.

          Show
          Arun C Murthy added a comment - +1. Contingent to hudson's +1 this patch is ready to commit.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12423109/patch-1098-5.txt
          against trunk revision 829312.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/209/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/209/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/209/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/209/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12423109/patch-1098-5.txt against trunk revision 829312. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/209/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/209/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/209/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/209/console This message is automatically generated.
          Hide
          Amareshwari Sriramadasu added a comment -

          Fixed findbugs warning

          Show
          Amareshwari Sriramadasu added a comment - Fixed findbugs warning
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12423115/patch-1098-6.txt
          against trunk revision 829312.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/210/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/210/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/210/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/210/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12423115/patch-1098-6.txt against trunk revision 829312. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/210/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/210/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/210/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/210/console This message is automatically generated.
          Hide
          Amareshwari Sriramadasu added a comment -

          Oops! Uploaded patch has my debug sleep .

          Show
          Amareshwari Sriramadasu added a comment - Oops! Uploaded patch has my debug sleep .
          Hide
          Amareshwari Sriramadasu added a comment -

          Patch for Y!20 distribution.
          test-patch and ant ant test passed with Y!20 patch

          Show
          Amareshwari Sriramadasu added a comment - Patch for Y!20 distribution. test-patch and ant ant test passed with Y!20 patch
          Hide
          Amareshwari Sriramadasu added a comment -

          Patch removing the debug sleep

          Show
          Amareshwari Sriramadasu added a comment - Patch removing the debug sleep
          Hide
          Amareshwari Sriramadasu added a comment -

          test-patch result:

              [exec] 
               [exec] +1 overall.  
               [exec] 
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec] 
               [exec]     +1 tests included.  The patch appears to include 6 new or modified tests.
               [exec] 
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec] 
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec] 
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec] 
               [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
               [exec] 
          
          Show
          Amareshwari Sriramadasu added a comment - test-patch result: [exec] [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec]
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12423126/patch-1098-7.txt
          against trunk revision 829312.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/211/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/211/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/211/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/211/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12423126/patch-1098-7.txt against trunk revision 829312. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/211/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/211/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/211/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/211/console This message is automatically generated.
          Hide
          Arun C Murthy added a comment -

          Same patch, added a couple of LOG.warn to go with the "<something> != null" checks...

          Show
          Arun C Murthy added a comment - Same patch, added a couple of LOG.warn to go with the "<something> != null" checks...
          Hide
          Arun C Murthy added a comment -

          I just committed this. Thanks, Amareshwari!

          Show
          Arun C Murthy added a comment - I just committed this. Thanks, Amareshwari!
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #98 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/98/)
          . Fixed the distributed-cache to not do i/o while holding a global lock. Contributed by Amareshwari Sriramadasu.

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #98 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/98/ ) . Fixed the distributed-cache to not do i/o while holding a global lock. Contributed by Amareshwari Sriramadasu.
          Hide
          Arun C Murthy added a comment -

          +1 for the y20 patch.

          (I just added the same LOG.warn statements I put in for the trunk patch.)

          Show
          Arun C Murthy added a comment - +1 for the y20 patch. (I just added the same LOG.warn statements I put in for the trunk patch.)
          Hide
          Amareshwari Sriramadasu added a comment -

          Repeated the manual testing with sleeps for Y!20 patch. Ran streaming jobs with -files, -libjars and -archives options. All work fine.

          Show
          Amareshwari Sriramadasu added a comment - Repeated the manual testing with sleeps for Y!20 patch. Ran streaming jobs with -files, -libjars and -archives options. All work fine.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #127 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/127/)

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #127 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/127/ )
          Hide
          Amareshwari Sriramadasu added a comment -

          This is committed only to trunk.

          Show
          Amareshwari Sriramadasu added a comment - This is committed only to trunk.
          Hide
          Hemanth Yamijala added a comment -

          A more updated version of the patch for earlier hadoop release. Not for commit.

          Show
          Hemanth Yamijala added a comment - A more updated version of the patch for earlier hadoop release. Not for commit.

            People

            • Assignee:
              Amareshwari Sriramadasu
              Reporter:
              Sreekanth Ramakrishnan
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development