Hadoop Common
  1. Hadoop Common
  2. HADOOP-5285

JobTracker hangs for long periods of time

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.19.1
    • Fix Version/s: 0.19.2, 0.20.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      On one of the larger clusters of 2000 nodes, JT hanged quite often, sometimes for times in the order of 10-15 minutes and once for one and a half hours. The stack trace shows that JobInProgress.obtainTaskCleanupTask() is waiting for lock on JobInProgress object which JobInProgress.initTasks() is holding for a long time waiting for DFS operations.

      1. 5285.1.patch
        13 kB
        Vinod Kumar Vavilapalli
      2. 5285.patch
        13 kB
        Devaraj Das
      3. trace.txt
        255 kB
        Vinod Kumar Vavilapalli

        Issue Links

          Activity

          Hide
          Vinod Kumar Vavilapalli added a comment -

          In an offline discussion with Devaraj, Amareshwari and Hemanth, the suggested fix is to make JobInProgress.obtainTaskCleanupTask() similar to other methods like JobInProgress.obtainJobSetupTask() and not lock on JobInProgress if the job is yet not inited. Other suggestion was to move all the DFS operations in JT that might result in locking of JT into a separate thread; at present the only one operation that needs to be moved is job cleanup from JobTracker.finalizeJob().

          Show
          Vinod Kumar Vavilapalli added a comment - In an offline discussion with Devaraj, Amareshwari and Hemanth, the suggested fix is to make JobInProgress.obtainTaskCleanupTask() similar to other methods like JobInProgress.obtainJobSetupTask() and not lock on JobInProgress if the job is yet not inited. Other suggestion was to move all the DFS operations in JT that might result in locking of JT into a separate thread; at present the only one operation that needs to be moved is job cleanup from JobTracker.finalizeJob().
          Hide
          Vinod Kumar Vavilapalli added a comment -

          I just realized that we might also want to fix HADOOP-4375 as part of this jira; this is related to locking of JT when accessing CompletedJobStore.

          Show
          Vinod Kumar Vavilapalli added a comment - I just realized that we might also want to fix HADOOP-4375 as part of this jira; this is related to locking of JT when accessing CompletedJobStore.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          After recovering from the hang, JT logged something as follows. Looks like initTasks() was blocked because of some DFS issue:

          2009-02-19 09:52:53,132 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_200902190419_0129_r_000178_0' has completed task_200902190419_0129_r_000178 successful
          ly.
          2009-02-19 10:03:29,445 WARN org.apache.hadoop.hdfs.DFSClient: Exception while reading from blk_2044238107768440002_840946 of /mapredsystem/hadoopqa/mapredsystem/job_200902190419_0419/job.split from <host:port>: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/<host:port> remote=/<host:port>]
                  at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
                  at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
                  at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
                  at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
                  at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
                  at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
                  at java.io.DataInputStream.read(DataInputStream.java:132)
                  at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
                  at org.apache.hadoop.hdfs.DFSClient$BlockReader.readChunk(DFSClient.java:1222)
                  at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:237)
                  at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
                  at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
                  at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
                  at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1075)
                  at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1630)
                  at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1680)
                  at java.io.DataInputStream.readFully(DataInputStream.java:178)
                  at org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:154)
                  at org.apache.hadoop.mapred.JobClient$RawSplit.readFields(JobClient.java:983)
                  at org.apache.hadoop.mapred.JobClient.readSplitFile(JobClient.java:1060)
                  at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:402)
                  at org.apache.hadoop.mapred.EagerTaskInitializationListener$JobInitThread.run(EagerTaskInitializationListener.java:55)
                  at java.lang.Thread.run(Thread.java:619)
          
          2009-02-19 10:03:29,446 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_2044238107768440002_840946 from any node:  java.io.IOException: No live nodes contain current block
          

          Also attaching the trace of JT when one such situation happened.

          Show
          Vinod Kumar Vavilapalli added a comment - After recovering from the hang, JT logged something as follows. Looks like initTasks() was blocked because of some DFS issue: 2009-02-19 09:52:53,132 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_200902190419_0129_r_000178_0' has completed task_200902190419_0129_r_000178 successful ly. 2009-02-19 10:03:29,445 WARN org.apache.hadoop.hdfs.DFSClient: Exception while reading from blk_2044238107768440002_840946 of /mapredsystem/hadoopqa/mapredsystem/job_200902190419_0419/job.split from <host:port>: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/<host:port> remote=/<host:port>] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at java.io.DataInputStream.read(DataInputStream.java:132) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100) at org.apache.hadoop.hdfs.DFSClient$BlockReader.readChunk(DFSClient.java:1222) at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:237) at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176) at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193) at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158) at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1075) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1630) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1680) at java.io.DataInputStream.readFully(DataInputStream.java:178) at org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:154) at org.apache.hadoop.mapred.JobClient$RawSplit.readFields(JobClient.java:983) at org.apache.hadoop.mapred.JobClient.readSplitFile(JobClient.java:1060) at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:402) at org.apache.hadoop.mapred.EagerTaskInitializationListener$JobInitThread.run(EagerTaskInitializationListener.java:55) at java.lang. Thread .run( Thread .java:619) 2009-02-19 10:03:29,446 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_2044238107768440002_840946 from any node: java.io.IOException: No live nodes contain current block Also attaching the trace of JT when one such situation happened.
          Hide
          Devaraj Das added a comment -

          Attaching patch. The patch
          1) Fixes obtainTaskCleanupTask to check whether the job got inited before it tries to take a lock on JobInProgress
          2) Moves the CleanupQueue class outside the TaskTracker to a generic class. The JobTracker also uses that class for queueing up deletes during job finalization.
          3) Moves the references to completedJobStore outside the block where the JobTracker is locked.

          Show
          Devaraj Das added a comment - Attaching patch. The patch 1) Fixes obtainTaskCleanupTask to check whether the job got inited before it tries to take a lock on JobInProgress 2) Moves the CleanupQueue class outside the TaskTracker to a generic class. The JobTracker also uses that class for queueing up deletes during job finalization. 3) Moves the references to completedJobStore outside the block where the JobTracker is locked.
          Hide
          Devaraj Das added a comment -

          Both ant test and test-patch passed locally.

          Show
          Devaraj Das added a comment - Both ant test and test-patch passed locally.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Attaching patch with minor changes:

          • Documented CleanupQueue's constructor saying that it is a singleton, does cleanup in a separate thread and that the constructor itself automatically starts the thread as a daemon.
          • Removed an unused variable 'events' in JobTracker.getTaskCompletionEvents()
          Show
          Vinod Kumar Vavilapalli added a comment - Attaching patch with minor changes: Documented CleanupQueue's constructor saying that it is a singleton, does cleanup in a separate thread and that the constructor itself automatically starts the thread as a daemon. Removed an unused variable 'events' in JobTracker.getTaskCompletionEvents()
          Hide
          Devaraj Das added a comment -

          I just committed this.

          Show
          Devaraj Das added a comment - I just committed this.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          HADOOP-4375 is fixed as part of this issue.

          Show
          Vinod Kumar Vavilapalli added a comment - HADOOP-4375 is fixed as part of this issue.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-trunk #763 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/763/)
          . Adding a file that I missed in my earlier commit.
          . Fixes the issues - (1) obtainTaskCleanupTask checks whether job is inited before trying to lock the JobInProgress (2) Moves the CleanupQueue class outside the TaskTracker and makes it a generic class that is used by the JobTracker also for deleting the paths on the job's output fs. (3) Moves the references to completedJobStore outside the block where the JobTracker is locked. Contributed by Devaraj Das.

          Show
          Hudson added a comment - Integrated in Hadoop-trunk #763 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/763/ ) . Adding a file that I missed in my earlier commit. . Fixes the issues - (1) obtainTaskCleanupTask checks whether job is inited before trying to lock the JobInProgress (2) Moves the CleanupQueue class outside the TaskTracker and makes it a generic class that is used by the JobTracker also for deleting the paths on the job's output fs. (3) Moves the references to completedJobStore outside the block where the JobTracker is locked. Contributed by Devaraj Das.
          Hide
          Jacob Leverich added a comment -

          Note, this affects 0.19.1 as well. I just spent a good deal of time diagnosing this and came to the same resolution (check for taskInited.get() in obtainTaskCleanupTask).

          Show
          Jacob Leverich added a comment - Note, this affects 0.19.1 as well. I just spent a good deal of time diagnosing this and came to the same resolution (check for taskInited.get() in obtainTaskCleanupTask).
          Hide
          Devaraj Das added a comment -

          I committed this to the 0.19 branch as well.

          Show
          Devaraj Das added a comment - I committed this to the 0.19 branch as well.
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #778 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/778/ )
          Hide
          Tsz Wo Nicholas Sze added a comment -

          The CleanupQueue.PathCleanupThread causes some problems in TestMiniMRWithDFS. See HADOOP-5483.

          Show
          Tsz Wo Nicholas Sze added a comment - The CleanupQueue.PathCleanupThread causes some problems in TestMiniMRWithDFS. See HADOOP-5483 .

            People

            • Assignee:
              Devaraj Das
              Reporter:
              Vinod Kumar Vavilapalli
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development