Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5198

Race condition in cleanup during task tracker renint with LinuxTaskController

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.2.0
    • Fix Version/s: 1.2.0
    • Component/s: tasktracker
    • Labels:
      None

      Description

      This was noticed when job tracker would be restarted while jobs were running and would ask the task tracker to reinitialize.

      Tasktracker would fail with an error like

      013-04-27 20:19:09,627 INFO org.apache.hadoop.mapred.TaskTracker: Good mapred local directories are: /grid/0/hdp/mapred/local,/grid/1/hdp/mapred/local,/grid/2/hdp/mapred/local,/grid/3/hdp/mapred/local,/grid/4/hdp/mapred/local,/grid/5/hdp/mapred/local
      2013-04-27 20:19:09,628 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 42075 caught: java.nio.channels.ClosedChannelException
      	at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
      	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
      	at org.apache.hadoop.ipc.Server.channelWrite(Server.java:1717)
      	at org.apache.hadoop.ipc.Server.access$2000(Server.java:98)
      	at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:744)
      	at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:808)
      	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1433)
      
      2013-04-27 20:19:09,628 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 42075: exiting
      2013-04-27 20:19:10,414 ERROR org.apache.hadoop.mapred.TaskTracker: Got fatal exception while reinitializing TaskTracker: org.apache.hadoop.util.Shell$ExitCodeException: 
      	at org.apache.hadoop.util.Shell.runCommand(Shell.java:255)
      	at org.apache.hadoop.util.Shell.run(Shell.java:182)
      	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
      	at org.apache.hadoop.mapred.LinuxTaskController.deleteAsUser(LinuxTaskController.java:281)
      	at org.apache.hadoop.mapred.TaskTracker.deleteUserDirectories(TaskTracker.java:779)
      	at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:816)
      	at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:2704)
      	at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3934)
      

        Activity

        Hide
        Arun C Murthy added a comment -

        Good catch Arpit Gupta!

        Show
        Arun C Murthy added a comment - Good catch Arpit Gupta !
        Hide
        Arun C Murthy added a comment -

        I think we should just make a simple change to TT so that it doesn't get TIP to cleanup when it's re-init'ing. That should fix the race.

        Show
        Arun C Murthy added a comment - I think we should just make a simple change to TT so that it doesn't get TIP to cleanup when it's re-init'ing. That should fix the race.
        Hide
        Arpit Gupta added a comment -

        Arun C Murthy

        Thanks for the suggestion. Patch attached.

        Also made changes to LinuxTaskController.java to log when there is a failure in delete.

        We will now see logs like

        id/4/hdp/mapred/local,/grid/5/hdp/mapred/local
        2013-04-30 23:05:37,033 INFO org.apache.hadoop.mapred.LinuxTaskController: deleteAsUser: [/usr/lib/hadoop/libexec/../bin/task-controller, hrt_qa, /grid/0/hdp/mapred/local,/grid/1/hdp/mapred/local,/grid/2/hdp/mapred/local,/grid/3/hdp/mapred/local,/grid/4/hdp/mapred/local,/grid/5/hdp/mapred/local, 3, ]
        2013-04-30 23:05:37,033 WARN org.apache.hadoop.mapred.LinuxTaskController: Exit code from task is : 255
        2013-04-30 23:05:37,033 INFO org.apache.hadoop.mapred.LinuxTaskController: Output from deleteAsUser LinuxTaskController:
        2013-04-30 23:05:37,033 INFO org.apache.hadoop.mapred.TaskController: Reading task controller config from /etc/hadoop/conf.empty/taskcontroller.cfg
        2013-04-30 23:05:37,033 INFO org.apache.hadoop.mapred.TaskController: main : command provided 3
        2013-04-30 23:05:37,033 INFO org.apache.hadoop.mapred.TaskController: main : user is hrt_qa
        2013-04-30 23:05:37,033 INFO org.apache.hadoop.mapred.TaskController: Good mapred-local-dirs are /grid/0/hdp/mapred/local,/grid/1/hdp/mapred/local,/grid/2/hdp/mapred/local,/grid/3/hdp/mapred/local,/grid/4/hdp/mapred/local,/grid/5/hdp/mapred/local
        2013-04-30 23:05:37,033 INFO org.apache.hadoop.mapred.TaskController: Unreadable directory /grid/4/hdp/mapred/local/taskTracker/hrt_qa/jobcache/job_201304302257_0002/attempt_201304302257_0002_m_000022_0. Skipping..
        2013-04-30 23:05:37,033 INFO org.apache.hadoop.mapred.TaskController: Couldn't delete directory /grid/5/hdp/mapred/local/taskTracker/hrt_qa/jobcache/job_201304302257_0002/attempt_201304302257_0002_m_000022_0/output - No such file or directory
        2013-04-30 23:05:37,034 INFO org.apache.hadoop.mapred.TaskController: rmdir of /grid/5/hdp/mapred/local/taskTracker/hrt_qa/ failed - Directory not empty
        2013-04-30 23:05:37,034 ERROR org.apache.hadoop.mapred.TaskTracker: Got fatal exception while reinitializing TaskTracker: org.apache.hadoop.util.Shell$ExitCodeException:
                at org.apache.hadoop.util.Shell.runCommand(Shell.java:255)
                at org.apache.hadoop.util.Shell.run(Shell.java:182)
                at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
                at org.apache.hadoop.mapred.LinuxTaskController.deleteAsUser(LinuxTaskController.java:281)
                at org.apache.hadoop.mapred.TaskTracker.deleteUserDirectories(TaskTracker.java:779)
                at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:816)
                at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:2704)
                at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3934)
        
        Show
        Arpit Gupta added a comment - Arun C Murthy Thanks for the suggestion. Patch attached. Also made changes to LinuxTaskController.java to log when there is a failure in delete. We will now see logs like id/4/hdp/mapred/local,/grid/5/hdp/mapred/local 2013-04-30 23:05:37,033 INFO org.apache.hadoop.mapred.LinuxTaskController: deleteAsUser: [/usr/lib/hadoop/libexec/../bin/task-controller, hrt_qa, /grid/0/hdp/mapred/local,/grid/1/hdp/mapred/local,/grid/2/hdp/mapred/local,/grid/3/hdp/mapred/local,/grid/4/hdp/mapred/local,/grid/5/hdp/mapred/local, 3, ] 2013-04-30 23:05:37,033 WARN org.apache.hadoop.mapred.LinuxTaskController: Exit code from task is : 255 2013-04-30 23:05:37,033 INFO org.apache.hadoop.mapred.LinuxTaskController: Output from deleteAsUser LinuxTaskController: 2013-04-30 23:05:37,033 INFO org.apache.hadoop.mapred.TaskController: Reading task controller config from /etc/hadoop/conf.empty/taskcontroller.cfg 2013-04-30 23:05:37,033 INFO org.apache.hadoop.mapred.TaskController: main : command provided 3 2013-04-30 23:05:37,033 INFO org.apache.hadoop.mapred.TaskController: main : user is hrt_qa 2013-04-30 23:05:37,033 INFO org.apache.hadoop.mapred.TaskController: Good mapred-local-dirs are /grid/0/hdp/mapred/local,/grid/1/hdp/mapred/local,/grid/2/hdp/mapred/local,/grid/3/hdp/mapred/local,/grid/4/hdp/mapred/local,/grid/5/hdp/mapred/local 2013-04-30 23:05:37,033 INFO org.apache.hadoop.mapred.TaskController: Unreadable directory /grid/4/hdp/mapred/local/taskTracker/hrt_qa/jobcache/job_201304302257_0002/attempt_201304302257_0002_m_000022_0. Skipping.. 2013-04-30 23:05:37,033 INFO org.apache.hadoop.mapred.TaskController: Couldn't delete directory /grid/5/hdp/mapred/local/taskTracker/hrt_qa/jobcache/job_201304302257_0002/attempt_201304302257_0002_m_000022_0/output - No such file or directory 2013-04-30 23:05:37,034 INFO org.apache.hadoop.mapred.TaskController: rmdir of /grid/5/hdp/mapred/local/taskTracker/hrt_qa/ failed - Directory not empty 2013-04-30 23:05:37,034 ERROR org.apache.hadoop.mapred.TaskTracker: Got fatal exception while reinitializing TaskTracker: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:255) at org.apache.hadoop.util.Shell.run(Shell.java:182) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375) at org.apache.hadoop.mapred.LinuxTaskController.deleteAsUser(LinuxTaskController.java:281) at org.apache.hadoop.mapred.TaskTracker.deleteUserDirectories(TaskTracker.java:779) at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:816) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:2704) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3934)
        Hide
        Arpit Gupta added a comment -

        I ran tests with this patch and it did not run into the race. Did not add unit tests as not easy to reproduce.

        Show
        Arpit Gupta added a comment - I ran tests with this patch and it did not run into the race. Did not add unit tests as not easy to reproduce.
        Hide
        Karthik Kambatla (Inactive) added a comment -

        Thanks Arpit. Is there a way to reproduce the issue, may be by adding explicit sleeps etc.?

        Otherwise, the patch looks mostly good:

        Nit:

        1. LTC#deleteAsUser() LTC#deleteLogAsUser() have unrelated (guess auto-formatting) changes
        Show
        Karthik Kambatla (Inactive) added a comment - Thanks Arpit. Is there a way to reproduce the issue, may be by adding explicit sleeps etc.? Otherwise, the patch looks mostly good: Nit: LTC#deleteAsUser() LTC#deleteLogAsUser() have unrelated (guess auto-formatting) changes
        Hide
        Arun C Murthy added a comment -

        I just committed this. Thanks Arpit!

        PS: I added a javadoc to the new ttReInit param for TT.TIP.jobHasFinished during the commit.

        Show
        Arun C Murthy added a comment - I just committed this. Thanks Arpit! PS: I added a javadoc to the new ttReInit param for TT.TIP.jobHasFinished during the commit.
        Hide
        Arun C Murthy added a comment -

        Karthik Kambatla, sorry - I committed before seeing your comments. Let's file a follow up. I agree with Arpit that reproducing this is super hard in a unit test since we also need LTC. Thanks.

        Show
        Arun C Murthy added a comment - Karthik Kambatla , sorry - I committed before seeing your comments. Let's file a follow up. I agree with Arpit that reproducing this is super hard in a unit test since we also need LTC. Thanks.
        Hide
        Arpit Gupta added a comment -

        Karthik Kambatla thanks for the review.

        Nit:

        LTC#deleteAsUser() LTC#deleteLogAsUser() have unrelated (guess auto-formatting) changes

        Are you talking about

        String[] command = new String[] { taskControllerExe, user,
                localStorage.getDirsString(),
                Integer.toString(Commands.DELETE_LOG_AS_USER.getValue()), subDir };
        

        The rest of the changes were related to improving the logging when there was a failure deleteAsUser and deleteLogAsUser.

        Show
        Arpit Gupta added a comment - Karthik Kambatla thanks for the review. Nit: LTC#deleteAsUser() LTC#deleteLogAsUser() have unrelated (guess auto-formatting) changes Are you talking about String [] command = new String [] { taskControllerExe, user, localStorage.getDirsString(), Integer .toString(Commands.DELETE_LOG_AS_USER.getValue()), subDir }; The rest of the changes were related to improving the logging when there was a failure deleteAsUser and deleteLogAsUser.
        Hide
        Karthik Kambatla (Inactive) added a comment -

        That is right, Arpit. Anyways, I guess we can let it be now that it has been committed No worries, Arun.

        However, I am just curious to know if there is a way to reproduce this. What were the circumstances you ran into this? I understand it being hard to write a test for.

        Show
        Karthik Kambatla (Inactive) added a comment - That is right, Arpit. Anyways, I guess we can let it be now that it has been committed No worries, Arun. However, I am just curious to know if there is a way to reproduce this. What were the circumstances you ran into this? I understand it being hard to write a test for.
        Hide
        Arpit Gupta added a comment -

        However, I am just curious to know if there is a way to reproduce this. What were the circumstances you ran into this? I understand it being hard to write a test for.

        We noticed this while running tests for job recovery and we noticed on a secure cluster number of task trackers in the cluster would go down eventually to 0 causing tests to hang. These tests are running mr, pig, hive oozie etc and in the back ground jobtracker would be restarted.

        Show
        Arpit Gupta added a comment - However, I am just curious to know if there is a way to reproduce this. What were the circumstances you ran into this? I understand it being hard to write a test for. We noticed this while running tests for job recovery and we noticed on a secure cluster number of task trackers in the cluster would go down eventually to 0 causing tests to hang. These tests are running mr, pig, hive oozie etc and in the back ground jobtracker would be restarted.
        Hide
        Karthik Kambatla (Inactive) added a comment -

        Thanks Arpit.

        Show
        Karthik Kambatla (Inactive) added a comment - Thanks Arpit.
        Hide
        Matt Foley added a comment -

        Closed upon release of Hadoop 1.2.0.

        Show
        Matt Foley added a comment - Closed upon release of Hadoop 1.2.0.

          People

          • Assignee:
            Arpit Gupta
            Reporter:
            Arpit Gupta
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development