Hadoop Common
  1. Hadoop Common
  2. HADOOP-4041

IsolationRunner does not work as documented

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.18.0
    • Fix Version/s: 0.21.0
    • Component/s: documentation
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Fixed a bug in IsolationRunner to make it work for map tasks.

      Description

      IsolationRunner does not work as documented in the tutorial.

      The tutorial says "To use the IsolationRunner, first set keep.failed.tasks.files to true (also see keep.tasks.files.pattern)."

      Should be:
      keep.failed.task.files (not tasks)

      After the above was set (quoted from my message on hadoop-core):
      > After the task
      > hung, I failed it via the web interface. Then I went to the node that was
      > running this task
      >
      > $ cd ...local/taskTracker/jobcache/job_200808071645_0001/work
      > (this path is already different from the tutorial's)
      >
      > $ hadoop org.apache.hadoop.mapred.IsolationRunner ../job.xml
      > Exception in thread "main" java.lang.NullPointerException
      > at
      > org.apache.hadoop.mapred.IsolationRunner.main(IsolationRunner.java:164)
      >
      > Looking at IsolationRunner code, I see this:
      >
      > 164 File workDirName = new File(lDirAlloc.getLocalPathToRead(
      > 165 TaskTracker.getJobCacheSubdir()
      > 166 + Path.SEPARATOR + taskId.getJobID()
      > 167 + Path.SEPARATOR + taskId
      > 168 + Path.SEPARATOR + "work",
      > 169 conf). toString());
      >
      > I.e. it assumes there is supposed to be a taskID subdirectory under the job
      > dir, but:
      > $ pwd
      > ...mapred/local/taskTracker/jobcache/job_200808071645_0001
      > $ ls
      > jars job.xml work
      >
      > – it's not there.

      1. HADOOP-4041-v4-y20.patch
        39 kB
        Hemanth Yamijala
      2. org.apache.hadoop.fs.LocalDirAllocator.html
        6 kB
        Philip Zeyliger
      3. HADOOP-4041-v4.patch
        39 kB
        Philip Zeyliger
      4. HADOOP-4041-v3.patch
        38 kB
        Philip Zeyliger
      5. HADOOP-4041-v2.patch
        39 kB
        Philip Zeyliger
      6. hadoop-4041.patch
        10 kB
        Tom White

        Issue Links

          Activity

          Hide
          Owen O'Malley added a comment -

          Sigh We really need a unit test that tests the isolation runner to catch when it breaks. It looks like it was broken when the task directories moved around. If you fix it, please create a patch and submit it. Thanks!

          Show
          Owen O'Malley added a comment - Sigh We really need a unit test that tests the isolation runner to catch when it breaks. It looks like it was broken when the task directories moved around. If you fix it, please create a patch and submit it. Thanks!
          Hide
          Yuri Pradkin added a comment -

          OK, tutorial documentation is still bungled, but at least IsolationRunner is not throwing exceptions now (-r699451).

          What I see now is:

            cd mapred/local/taskTracker/jobcache/job_.../attempt_..._r_../work
            hadoop org.apache.hadoop.mapred.IsolationRunner ../job.xml
          

          After that it's going into what seems an endless loop trying to fetch the data (which presumably is already there in ../output/map_0.out:

            08/09/26 12:55:13 INFO mapred.IsolationRunner: Task attempt_200809261208_0003_r_000001_1_1222456110621 making progress to 0.0 and state of reduce > copy >
            ...
          

          Is someone using I.R. successfully? Can you please comment?

          Show
          Yuri Pradkin added a comment - OK, tutorial documentation is still bungled, but at least IsolationRunner is not throwing exceptions now (-r699451). What I see now is: cd mapred/local/taskTracker/jobcache/job_.../attempt_..._r_../work hadoop org.apache.hadoop.mapred.IsolationRunner ../job.xml After that it's going into what seems an endless loop trying to fetch the data (which presumably is already there in ../output/map_0.out: 08/09/26 12:55:13 INFO mapred.IsolationRunner: Task attempt_200809261208_0003_r_000001_1_1222456110621 making progress to 0.0 and state of reduce > copy > ... Is someone using I.R. successfully? Can you please comment?
          Hide
          Tom White added a comment -

          I've been trying out IsolationRunner and I think it is broken in several ways.

          1. It doesn't construct a valid classpath due to changes in the directory layout as Owen mentioned.
          2. For reduce tasks, it doesn't fill in the missing map outputs (throws a DiskErrorException).
          3. For reduce tasks, even if map outputs are there the reduce task never exits as Yuri observed.

          I've produced a test for IsolationRunner which tests 2. and 3., and a fix for 1. and 2. (So the test just hangs at the moment, which exposes 3.) See the attached patch.

          I'm not sure how to fix 3. The task spawns MapOutputCopiers which go into a wait state forever. Should IsolationRunner even be copying map outputs from map nodes, or should it just be using the files it has locally? Any ideas how to fix this?

          As a future improvement it would be better if IsolationRunner could share code with TaskRunner so its behaviour is as faithful as possible.

          Show
          Tom White added a comment - I've been trying out IsolationRunner and I think it is broken in several ways. 1. It doesn't construct a valid classpath due to changes in the directory layout as Owen mentioned. 2. For reduce tasks, it doesn't fill in the missing map outputs (throws a DiskErrorException). 3. For reduce tasks, even if map outputs are there the reduce task never exits as Yuri observed. I've produced a test for IsolationRunner which tests 2. and 3., and a fix for 1. and 2. (So the test just hangs at the moment, which exposes 3.) See the attached patch. I'm not sure how to fix 3. The task spawns MapOutputCopiers which go into a wait state forever. Should IsolationRunner even be copying map outputs from map nodes, or should it just be using the files it has locally? Any ideas how to fix this? As a future improvement it would be better if IsolationRunner could share code with TaskRunner so its behaviour is as faithful as possible.
          Hide
          Philip Zeyliger added a comment -

          I'm going to take a stab at this this week.

          Show
          Philip Zeyliger added a comment - I'm going to take a stab at this this week.
          Hide
          Philip Zeyliger added a comment -

          Attaching a patch.

          I updated Tom's previous patch a little bit to get IsolationRunner to work for map tasks. TestIsolationRunner passes. I'm still running the other tests.

          I've also been testing this manually:

          $ bin/hadoop jar build/hadoop-0.21.0-dev-examples.jar fail -D keep.failed.task.files=true -failMappers
          [lots of noise]
          $ bin/hadoop org.apache.hadoop.mapred.IsolationRunner /tmp/hadoop-philip-trunk/mapred/local/taskTracker/jobcache/job_200905251539_0001/attempt_200905251539_0001_m_000000_0/job.xml
          09/05/25 15:41:26 INFO mapred.MapTask: io.sort.mb = 100
          09/05/25 15:41:26 INFO mapred.MapTask: data buffer = 79691776/99614720
          09/05/25 15:41:26 INFO mapred.MapTask: record buffer = 262144/327680
          Exception in thread "main" java.lang.RuntimeException: Intentional map failure
          	at org.apache.hadoop.examples.FailJob$FailMapper.map(FailJob.java:53)
          	at org.apache.hadoop.examples.FailJob$FailMapper.map(FailJob.java:48)
          	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
          	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:528)
          	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:310)
          	at org.apache.hadoop.mapred.IsolationRunner.run(IsolationRunner.java:190)
          	at org.apache.hadoop.mapred.IsolationRunner.main(IsolationRunner.java:202)
          

          (The failure when re-run is what I'd expect, since the map always fails. This is much better than, say, a ClassNotFound exception of some sort, which would indicate IsolationRunner not working.)

          I had to rejigger TaskRunner a bit to be able to share code for generation of the classpath. I suspect that there's still some funny business not happening for users of the DistributedCache. I haven't dug in deeply there.

          I'd like to propose that we open a separate JIRA for IsolationRunner for reduce tasks. Reducers have to contact mappers to get the intermediate data, and, frankly, that's quite messy. I believe it requires interacting with the job tracker, and that seems like a lot of dependencies for a tool that in theory runs in isolation. So I'd like to get this fixed for mappers first and then tackle reducers separately.

          – Philip

          Show
          Philip Zeyliger added a comment - Attaching a patch. I updated Tom's previous patch a little bit to get IsolationRunner to work for map tasks. TestIsolationRunner passes. I'm still running the other tests. I've also been testing this manually: $ bin/hadoop jar build/hadoop-0.21.0-dev-examples.jar fail -D keep.failed.task.files=true -failMappers [lots of noise] $ bin/hadoop org.apache.hadoop.mapred.IsolationRunner /tmp/hadoop-philip-trunk/mapred/local/taskTracker/jobcache/job_200905251539_0001/attempt_200905251539_0001_m_000000_0/job.xml 09/05/25 15:41:26 INFO mapred.MapTask: io.sort.mb = 100 09/05/25 15:41:26 INFO mapred.MapTask: data buffer = 79691776/99614720 09/05/25 15:41:26 INFO mapred.MapTask: record buffer = 262144/327680 Exception in thread "main" java.lang.RuntimeException: Intentional map failure at org.apache.hadoop.examples.FailJob$FailMapper.map(FailJob.java:53) at org.apache.hadoop.examples.FailJob$FailMapper.map(FailJob.java:48) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:528) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:310) at org.apache.hadoop.mapred.IsolationRunner.run(IsolationRunner.java:190) at org.apache.hadoop.mapred.IsolationRunner.main(IsolationRunner.java:202) (The failure when re-run is what I'd expect, since the map always fails. This is much better than, say, a ClassNotFound exception of some sort, which would indicate IsolationRunner not working.) I had to rejigger TaskRunner a bit to be able to share code for generation of the classpath. I suspect that there's still some funny business not happening for users of the DistributedCache. I haven't dug in deeply there. I'd like to propose that we open a separate JIRA for IsolationRunner for reduce tasks. Reducers have to contact mappers to get the intermediate data, and, frankly, that's quite messy. I believe it requires interacting with the job tracker, and that seems like a lot of dependencies for a tool that in theory runs in isolation. So I'd like to get this fixed for mappers first and then tackle reducers separately. – Philip
          Hide
          Tom White added a comment -

          +1

          I agree that we should commit this and have a separate Jira for reducers. Having a unit test for mappers is a step forward from where we are today and will prevent regressions in IsolationRunner (or at least those to do with classpath).

          Thanks for picking this up Philip.

          Show
          Tom White added a comment - +1 I agree that we should commit this and have a separate Jira for reducers. Having a unit test for mappers is a step forward from where we are today and will prevent regressions in IsolationRunner (or at least those to do with classpath). Thanks for picking this up Philip.
          Hide
          Philip Zeyliger added a comment -

          Attaching new patch: fixed a javadoc warning and made an inner class static (findbugs warning).

          Note that for readability, I introduced LocalDirAllocator.SIZE_UNKNOWN == 1 (a final int), so that some lines wouldn't be as opaque.

          • Philip
          Show
          Philip Zeyliger added a comment - Attaching new patch: fixed a javadoc warning and made an inner class static (findbugs warning). Note that for readability, I introduced LocalDirAllocator.SIZE_UNKNOWN == 1 (a final int), so that some lines wouldn't be as opaque. Philip
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12409182/HADOOP-4041-v3.patch
          against trunk revision 779944.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          -1 release audit. The applied patch generated 496 release audit warnings (more than the trunk's current 494 warnings).

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/427/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/427/artifact/trunk/current/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/427/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/427/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/427/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12409182/HADOOP-4041-v3.patch against trunk revision 779944. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 release audit. The applied patch generated 496 release audit warnings (more than the trunk's current 494 warnings). -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/427/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/427/artifact/trunk/current/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/427/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/427/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/427/console This message is automatically generated.
          Hide
          Philip Zeyliger added a comment -

          Looks like nothing major from the build bot. TestQueueCapacities.testMultipleQueues is failing on trunk as well: http://hudson.zones.apache.org/hudson/view/Hadoop/job/Hadoop-trunk/849/ . TestCombineFileInputFormat also throws "java.lang.ClassNotFoundException: org.apache.hadoop.mapred.lib.TestCombineFileInputFormat" with a clean build as well, so I think trunk is broken there. Recent commits moved that around, so probably a missing reference.

          I've fixed (I believe) the release audit warnings (forgot Apache headers in a new test file) and the javadoc warnings. Uploading a new patch shortly.

          Show
          Philip Zeyliger added a comment - Looks like nothing major from the build bot. TestQueueCapacities.testMultipleQueues is failing on trunk as well: http://hudson.zones.apache.org/hudson/view/Hadoop/job/Hadoop-trunk/849/ . TestCombineFileInputFormat also throws "java.lang.ClassNotFoundException: org.apache.hadoop.mapred.lib.TestCombineFileInputFormat" with a clean build as well, so I think trunk is broken there. Recent commits moved that around, so probably a missing reference. I've fixed (I believe) the release audit warnings (forgot Apache headers in a new test file) and the javadoc warnings. Uploading a new patch shortly.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12409431/HADOOP-4041-v4.patch
          against trunk revision 780114.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          -1 release audit. The applied patch generated 496 release audit warnings (more than the trunk's current 495 warnings).

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/441/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/441/artifact/trunk/current/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/441/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/441/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/441/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12409431/HADOOP-4041-v4.patch against trunk revision 780114. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 release audit. The applied patch generated 496 release audit warnings (more than the trunk's current 495 warnings). -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/441/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/441/artifact/trunk/current/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/441/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/441/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/441/console This message is automatically generated.
          Show
          Giridharan Kesavan added a comment - use this link to see the relaseaudit warnings on the patch build: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/441/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt tnx
          Hide
          Philip Zeyliger added a comment -

          I'm a bit puzzled as to what

          122d121
          <      [java]  !????? /home/hudson/hudson-slave/workspace/Hadoop-Patch-vesta.apache.org/trunk/build/hadoop-780114_HADOOP-4041_PATCH-12409431/docs/jdiff/changes/org.apache.hadoop.fs.LocalDirAllocator.html
          

          actually means?

          I've attached LocalDirAllocator.html, if that's of use.

          Show
          Philip Zeyliger added a comment - I'm a bit puzzled as to what 122d121 < [java] !????? /home/hudson/hudson-slave/workspace/Hadoop-Patch-vesta.apache.org/trunk/build/hadoop-780114_HADOOP-4041_PATCH-12409431/docs/jdiff/changes/org.apache.hadoop.fs.LocalDirAllocator.html actually means? I've attached LocalDirAllocator.html, if that's of use.
          Hide
          Tom White added a comment -

          I think this is saying that the HTML file for LocalDirAllocator generated by JDiff doesn't have an Apache license. It probably makes sense to exclude the JDiff files from the release audit tool.

          Otherwise this looks good. I'd like to commit it soon.

          Show
          Tom White added a comment - I think this is saying that the HTML file for LocalDirAllocator generated by JDiff doesn't have an Apache license. It probably makes sense to exclude the JDiff files from the release audit tool. Otherwise this looks good. I'd like to commit it soon.
          Hide
          Philip Zeyliger added a comment -

          Should this get committed?

          Show
          Philip Zeyliger added a comment - Should this get committed?
          Hide
          Tom White added a comment -

          I've just committed this. Thanks Philip!

          Show
          Tom White added a comment - I've just committed this. Thanks Philip!
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #869 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/869/ )
          Hide
          Hemanth Yamijala added a comment -

          Patch against Hadoop 0.20 version (not for commit)

          Show
          Hemanth Yamijala added a comment - Patch against Hadoop 0.20 version (not for commit)

            People

            • Assignee:
              Philip Zeyliger
              Reporter:
              Yuri Pradkin
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development