Hadoop Common
  1. Hadoop Common
  2. HADOOP-5376

JobInProgress.obtainTaskCleanupTask() throws an ArrayIndexOutOfBoundsException

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.20.0
    • Fix Version/s: 0.19.2, 0.20.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    1. patch-5376.txt
      5 kB
      Amareshwari Sriramadasu
    2. patch-5376-0.19.txt
      14 kB
      Amareshwari Sriramadasu
    3. patch-5376-0.20.txt
      13 kB
      Amareshwari Sriramadasu
    4. patch-5376-1.txt
      14 kB
      Amareshwari Sriramadasu

      Activity

      Hide
      Hudson added a comment -
      Show
      Hudson added a comment - Integrated in Hadoop-trunk #778 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/778/ )
      Hide
      Hemanth Yamijala added a comment -

      I committed this to trunk, and branches 0.20 and 0.19. Thanks, Amareshwari !

      Show
      Hemanth Yamijala added a comment - I committed this to trunk, and branches 0.20 and 0.19. Thanks, Amareshwari !
      Hide
      Vinod Kumar Vavilapalli added a comment -

      Looked at the patches. +1 for all the three.

      Show
      Vinod Kumar Vavilapalli added a comment - Looked at the patches. +1 for all the three.
      Hide
      Amareshwari Sriramadasu added a comment -

      Patch for branch 0.19

      Show
      Amareshwari Sriramadasu added a comment - Patch for branch 0.19
      Hide
      Amareshwari Sriramadasu added a comment -

      Patch applies to branch 0.19 as well if we ignore the testcase changes (The test is not present in branch 0.19)

      Show
      Amareshwari Sriramadasu added a comment - Patch applies to branch 0.19 as well if we ignore the testcase changes (The test is not present in branch 0.19)
      Hide
      Amareshwari Sriramadasu added a comment -

      Patch for branch 0.20

      Show
      Amareshwari Sriramadasu added a comment - Patch for branch 0.20
      Hide
      Eric Yang added a comment -

      Resubmit patch to hudson, trunk test was broken by HADOOP-5409.

      Show
      Eric Yang added a comment - Resubmit patch to hudson, trunk test was broken by HADOOP-5409 .
      Hide
      Amareshwari Sriramadasu added a comment -

      test-patch result :

           [exec]
           [exec] +1 overall.
           [exec]
           [exec]     +1 @author.  The patch does not contain any @author tags.
           [exec]
           [exec]     +1 tests included.  The patch appears to include 6 new or modified tests.
           [exec]
           [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
           [exec]
           [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
           [exec]
           [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
           [exec]
           [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
           [exec]
           [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
           [exec]
           [exec]
      

      ant test passed on my machine

      Show
      Amareshwari Sriramadasu added a comment - test-patch result : [exec] [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] ant test passed on my machine
      Hide
      Amareshwari Sriramadasu added a comment -

      Patch with vinod's comments incorporated.

      Patch also fixes wrong state displayed for setup/cleanup tasks on command-line kill. Added testcases for both command-line kill and lostTracker

      Show
      Amareshwari Sriramadasu added a comment - Patch with vinod's comments incorporated. Patch also fixes wrong state displayed for setup/cleanup tasks on command-line kill. Added testcases for both command-line kill and lostTracker
      Hide
      Vinod Kumar Vavilapalli added a comment -

      Looked at the patch. Few comments:

      • I think the correct place for the check for setup/cleanup task is lostTaskTracker() method itself. TIP.isRunningTask() doesn't look the correct method or atleast the correct method name.
      • The original TestSetupAndCleanupFailure didn't have much javadoc, can you add it in this patch? Particularly the TestSetupAndCleanupFailure class itself, and the testWithDFS() method.
      • We can write a test simulating the exact problem in this bug using a lost tracker. We can have this test in the same testcase - TestSetupAndCleanupFailure.
      Show
      Vinod Kumar Vavilapalli added a comment - Looked at the patch. Few comments: I think the correct place for the check for setup/cleanup task is lostTaskTracker() method itself. TIP.isRunningTask() doesn't look the correct method or atleast the correct method name. The original TestSetupAndCleanupFailure didn't have much javadoc, can you add it in this patch? Particularly the TestSetupAndCleanupFailure class itself, and the testWithDFS() method. We can write a test simulating the exact problem in this bug using a lost tracker. We can have this test in the same testcase - TestSetupAndCleanupFailure.
      Hide
      Amareshwari Sriramadasu added a comment -

      patch fixing the bug

      Show
      Amareshwari Sriramadasu added a comment - patch fixing the bug
      Hide
      Vinod Kumar Vavilapalli added a comment -

      In a discussion offline, Amareshwari explained that this issue occurs if a setup/cleanup task is running on the TT that subsequently becomes lost and the task moves to a KILLED_UNCLEAN state. This makes the setup/cleanup task to be incorrectly added to the list of tasks that need cleanup.

      Confirmed the same from the logs.

      2009-02-28 23:05:41,986 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_200902261046_9662_m_007800_0' to tip task_200902261046_9662_m_007800, for tracker '<tracker_host:port>'
      2009-02-28 23:17:14,800 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_200902261046_9662_m_007800_0: Lost task tracker: <tracker_host:port>
      

      The result is that the job's cleanup task got stuck, it is shown to be in pending state on the JT UI. No subsequent attempts are launched for the cleanup task. And the job hangs in there like that. I tried killing the cleanup attempt from the client command line, thinking it might get rescheduled, but it fails with message "Could not kill task attempt_200902261046_9662_m_007800_0". Even killing the job didn't work

      Show
      Vinod Kumar Vavilapalli added a comment - In a discussion offline, Amareshwari explained that this issue occurs if a setup/cleanup task is running on the TT that subsequently becomes lost and the task moves to a KILLED_UNCLEAN state. This makes the setup/cleanup task to be incorrectly added to the list of tasks that need cleanup. Confirmed the same from the logs. 2009-02-28 23:05:41,986 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_200902261046_9662_m_007800_0' to tip task_200902261046_9662_m_007800, for tracker '<tracker_host:port>' 2009-02-28 23:17:14,800 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_200902261046_9662_m_007800_0: Lost task tracker: <tracker_host:port> The result is that the job's cleanup task got stuck, it is shown to be in pending state on the JT UI. No subsequent attempts are launched for the cleanup task. And the job hangs in there like that. I tried killing the cleanup attempt from the client command line, thinking it might get rescheduled, but it fails with message "Could not kill task attempt_200902261046_9662_m_007800_0". Even killing the job didn't work
      Hide
      Vinod Kumar Vavilapalli added a comment -

      Here's the trace:

      2009-02-28 23:17:15,029 INFO org.apache.hadoop.ipc.Server: IPC Server handler 34 on 52000, call heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@243ff467, false, false, true
       7584) from <ip:port>:error: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 7800
       java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 7800
               at org.apache.hadoop.mapred.JobInProgress.obtainTaskCleanupTask(JobInProgress.java:1001)
               at org.apache.hadoop.mapred.JobTracker.getSetupAndCleanupTasks(JobTracker.java:2622)
               at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2307)
               at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
               at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
               at java.lang.reflect.Method.invoke(Method.java:597)
               at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
               at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
               at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
               at java.security.AccessController.doPrivileged(Native Method)
               at javax.security.auth.Subject.doAs(Subject.java:396)
               at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
      

      I saw this while running jobs of size 7800 maps. 7800 id thus corresponds to a job-cleanup task.

      In a discussion offline, Amareshwari explained that this issue occurs if a setup/cleanup task is running on the TT that subsequently becomes lost and the task moves to a KILLED_UNCLEAN state. This makes the setup/cleanup task to be incorrectly added to the list of tasks that need cleanup.

      Show
      Vinod Kumar Vavilapalli added a comment - Here's the trace: 2009-02-28 23:17:15,029 INFO org.apache.hadoop.ipc.Server: IPC Server handler 34 on 52000, call heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@243ff467, false , false , true 7584) from <ip:port>:error: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 7800 java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 7800 at org.apache.hadoop.mapred.JobInProgress.obtainTaskCleanupTask(JobInProgress.java:1001) at org.apache.hadoop.mapred.JobTracker.getSetupAndCleanupTasks(JobTracker.java:2622) at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2307) at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) I saw this while running jobs of size 7800 maps. 7800 id thus corresponds to a job-cleanup task. In a discussion offline, Amareshwari explained that this issue occurs if a setup/cleanup task is running on the TT that subsequently becomes lost and the task moves to a KILLED_UNCLEAN state. This makes the setup/cleanup task to be incorrectly added to the list of tasks that need cleanup.

        People

        • Assignee:
          Amareshwari Sriramadasu
          Reporter:
          Vinod Kumar Vavilapalli
        • Votes:
          0 Vote for this issue
          Watchers:
          2 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved:

            Development