Hadoop Common
  1. Hadoop Common
  2. HADOOP-5376

JobInProgress.obtainTaskCleanupTask() throws an ArrayIndexOutOfBoundsException

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.20.0
    • Fix Version/s: 0.19.2, 0.20.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    1. patch-5376-0.19.txt
      14 kB
      Amareshwari Sriramadasu
    2. patch-5376-0.20.txt
      13 kB
      Amareshwari Sriramadasu
    3. patch-5376-1.txt
      14 kB
      Amareshwari Sriramadasu
    4. patch-5376.txt
      5 kB
      Amareshwari Sriramadasu

      Activity

      Transition Time In Source Status Execution Times Last Executer Last Execution Date
      Patch Available Patch Available Open Open
      10h 26m 1 Eric Yang 05/Mar/09 21:33
      Open Open Patch Available Patch Available
      2d 23h 28m 2 Eric Yang 05/Mar/09 21:33
      Patch Available Patch Available Resolved Resolved
      14h 5m 1 Hemanth Yamijala 06/Mar/09 11:38
      Resolved Resolved Closed Closed
      48d 7h 39m 1 Nigel Daley 23/Apr/09 20:18
      Owen O'Malley made changes -
      Component/s mapred [ 12310690 ]
      Nigel Daley made changes -
      Status Resolved [ 5 ] Closed [ 6 ]
      Hide
      Hudson added a comment -
      Show
      Hudson added a comment - Integrated in Hadoop-trunk #778 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/778/ )
      Hemanth Yamijala made changes -
      Status Patch Available [ 10002 ] Resolved [ 5 ]
      Hadoop Flags [Reviewed]
      Resolution Fixed [ 1 ]
      Hide
      Hemanth Yamijala added a comment -

      I committed this to trunk, and branches 0.20 and 0.19. Thanks, Amareshwari !

      Show
      Hemanth Yamijala added a comment - I committed this to trunk, and branches 0.20 and 0.19. Thanks, Amareshwari !
      Hide
      Vinod Kumar Vavilapalli added a comment -

      Looked at the patches. +1 for all the three.

      Show
      Vinod Kumar Vavilapalli added a comment - Looked at the patches. +1 for all the three.
      Amareshwari Sriramadasu made changes -
      Attachment patch-5376-0.19.txt [ 12401601 ]
      Hide
      Amareshwari Sriramadasu added a comment -

      Patch for branch 0.19

      Show
      Amareshwari Sriramadasu added a comment - Patch for branch 0.19
      Hide
      Amareshwari Sriramadasu added a comment -

      Patch applies to branch 0.19 as well if we ignore the testcase changes (The test is not present in branch 0.19)

      Show
      Amareshwari Sriramadasu added a comment - Patch applies to branch 0.19 as well if we ignore the testcase changes (The test is not present in branch 0.19)
      Amareshwari Sriramadasu made changes -
      Fix Version/s 0.19.2 [ 12313650 ]
      Amareshwari Sriramadasu made changes -
      Attachment patch-5376-0.20.txt [ 12401585 ]
      Hide
      Amareshwari Sriramadasu added a comment -

      Patch for branch 0.20

      Show
      Amareshwari Sriramadasu added a comment - Patch for branch 0.20
      Eric Yang made changes -
      Status Open [ 1 ] Patch Available [ 10002 ]
      Eric Yang made changes -
      Status Patch Available [ 10002 ] Open [ 1 ]
      Hide
      Eric Yang added a comment -

      Resubmit patch to hudson, trunk test was broken by HADOOP-5409.

      Show
      Eric Yang added a comment - Resubmit patch to hudson, trunk test was broken by HADOOP-5409 .
      Amareshwari Sriramadasu made changes -
      Status Open [ 1 ] Patch Available [ 10002 ]
      Hide
      Amareshwari Sriramadasu added a comment -

      test-patch result :

           [exec]
           [exec] +1 overall.
           [exec]
           [exec]     +1 @author.  The patch does not contain any @author tags.
           [exec]
           [exec]     +1 tests included.  The patch appears to include 6 new or modified tests.
           [exec]
           [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
           [exec]
           [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
           [exec]
           [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
           [exec]
           [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
           [exec]
           [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
           [exec]
           [exec]
      

      ant test passed on my machine

      Show
      Amareshwari Sriramadasu added a comment - test-patch result : [exec] [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] ant test passed on my machine
      Amareshwari Sriramadasu made changes -
      Attachment patch-5376-1.txt [ 12401510 ]
      Hide
      Amareshwari Sriramadasu added a comment -

      Patch with vinod's comments incorporated.

      Patch also fixes wrong state displayed for setup/cleanup tasks on command-line kill. Added testcases for both command-line kill and lostTracker

      Show
      Amareshwari Sriramadasu added a comment - Patch with vinod's comments incorporated. Patch also fixes wrong state displayed for setup/cleanup tasks on command-line kill. Added testcases for both command-line kill and lostTracker
      Hemanth Yamijala made changes -
      Fix Version/s 0.20.0 [ 12313438 ]
      Hide
      Vinod Kumar Vavilapalli added a comment -

      Looked at the patch. Few comments:

      • I think the correct place for the check for setup/cleanup task is lostTaskTracker() method itself. TIP.isRunningTask() doesn't look the correct method or atleast the correct method name.
      • The original TestSetupAndCleanupFailure didn't have much javadoc, can you add it in this patch? Particularly the TestSetupAndCleanupFailure class itself, and the testWithDFS() method.
      • We can write a test simulating the exact problem in this bug using a lost tracker. We can have this test in the same testcase - TestSetupAndCleanupFailure.
      Show
      Vinod Kumar Vavilapalli added a comment - Looked at the patch. Few comments: I think the correct place for the check for setup/cleanup task is lostTaskTracker() method itself. TIP.isRunningTask() doesn't look the correct method or atleast the correct method name. The original TestSetupAndCleanupFailure didn't have much javadoc, can you add it in this patch? Particularly the TestSetupAndCleanupFailure class itself, and the testWithDFS() method. We can write a test simulating the exact problem in this bug using a lost tracker. We can have this test in the same testcase - TestSetupAndCleanupFailure.
      Amareshwari Sriramadasu made changes -
      Attachment patch-5376.txt [ 12401380 ]
      Hide
      Amareshwari Sriramadasu added a comment -

      patch fixing the bug

      Show
      Amareshwari Sriramadasu added a comment - patch fixing the bug
      Jothi Padmanabhan made changes -
      Priority Major [ 3 ] Blocker [ 1 ]
      Amareshwari Sriramadasu made changes -
      Field Original Value New Value
      Assignee Amareshwari Sriramadasu [ amareshwari ]
      Hide
      Vinod Kumar Vavilapalli added a comment -

      In a discussion offline, Amareshwari explained that this issue occurs if a setup/cleanup task is running on the TT that subsequently becomes lost and the task moves to a KILLED_UNCLEAN state. This makes the setup/cleanup task to be incorrectly added to the list of tasks that need cleanup.

      Confirmed the same from the logs.

      2009-02-28 23:05:41,986 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_200902261046_9662_m_007800_0' to tip task_200902261046_9662_m_007800, for tracker '<tracker_host:port>'
      2009-02-28 23:17:14,800 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_200902261046_9662_m_007800_0: Lost task tracker: <tracker_host:port>
      

      The result is that the job's cleanup task got stuck, it is shown to be in pending state on the JT UI. No subsequent attempts are launched for the cleanup task. And the job hangs in there like that. I tried killing the cleanup attempt from the client command line, thinking it might get rescheduled, but it fails with message "Could not kill task attempt_200902261046_9662_m_007800_0". Even killing the job didn't work

      Show
      Vinod Kumar Vavilapalli added a comment - In a discussion offline, Amareshwari explained that this issue occurs if a setup/cleanup task is running on the TT that subsequently becomes lost and the task moves to a KILLED_UNCLEAN state. This makes the setup/cleanup task to be incorrectly added to the list of tasks that need cleanup. Confirmed the same from the logs. 2009-02-28 23:05:41,986 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_200902261046_9662_m_007800_0' to tip task_200902261046_9662_m_007800, for tracker '<tracker_host:port>' 2009-02-28 23:17:14,800 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_200902261046_9662_m_007800_0: Lost task tracker: <tracker_host:port> The result is that the job's cleanup task got stuck, it is shown to be in pending state on the JT UI. No subsequent attempts are launched for the cleanup task. And the job hangs in there like that. I tried killing the cleanup attempt from the client command line, thinking it might get rescheduled, but it fails with message "Could not kill task attempt_200902261046_9662_m_007800_0". Even killing the job didn't work
      Hide
      Vinod Kumar Vavilapalli added a comment -

      Here's the trace:

      2009-02-28 23:17:15,029 INFO org.apache.hadoop.ipc.Server: IPC Server handler 34 on 52000, call heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@243ff467, false, false, true
       7584) from <ip:port>:error: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 7800
       java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 7800
               at org.apache.hadoop.mapred.JobInProgress.obtainTaskCleanupTask(JobInProgress.java:1001)
               at org.apache.hadoop.mapred.JobTracker.getSetupAndCleanupTasks(JobTracker.java:2622)
               at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2307)
               at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
               at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
               at java.lang.reflect.Method.invoke(Method.java:597)
               at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
               at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
               at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
               at java.security.AccessController.doPrivileged(Native Method)
               at javax.security.auth.Subject.doAs(Subject.java:396)
               at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
      

      I saw this while running jobs of size 7800 maps. 7800 id thus corresponds to a job-cleanup task.

      In a discussion offline, Amareshwari explained that this issue occurs if a setup/cleanup task is running on the TT that subsequently becomes lost and the task moves to a KILLED_UNCLEAN state. This makes the setup/cleanup task to be incorrectly added to the list of tasks that need cleanup.

      Show
      Vinod Kumar Vavilapalli added a comment - Here's the trace: 2009-02-28 23:17:15,029 INFO org.apache.hadoop.ipc.Server: IPC Server handler 34 on 52000, call heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@243ff467, false , false , true 7584) from <ip:port>:error: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 7800 java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 7800 at org.apache.hadoop.mapred.JobInProgress.obtainTaskCleanupTask(JobInProgress.java:1001) at org.apache.hadoop.mapred.JobTracker.getSetupAndCleanupTasks(JobTracker.java:2622) at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2307) at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) I saw this while running jobs of size 7800 maps. 7800 id thus corresponds to a job-cleanup task. In a discussion offline, Amareshwari explained that this issue occurs if a setup/cleanup task is running on the TT that subsequently becomes lost and the task moves to a KILLED_UNCLEAN state. This makes the setup/cleanup task to be incorrectly added to the list of tasks that need cleanup.
      Vinod Kumar Vavilapalli created issue -

        People

        • Assignee:
          Amareshwari Sriramadasu
          Reporter:
          Vinod Kumar Vavilapalli
        • Votes:
          0 Vote for this issue
          Watchers:
          2 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved:

            Development