Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-82

JobTracker loses it: NoSuchElementException

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 0.1.0
    • None
    • None

    Description

      On a number of occasions, JobTracker goes into a loop that it never recovers from. Over and over it prints the below to the jobtracker log.

      060304 124522 Server handler 5 on 8010 call error: java.io.IOException: java.util.NoSuchElementException
      java.io.IOException: java.util.NoSuchElementException
      at java.util.TreeMap.key(TreeMap.java:433)
      at java.util.TreeMap.firstKey(TreeMap.java:287)
      at java.util.TreeSet.first(TreeSet.java:407)
      at org.apache.hadoop.mapred.TaskInProgress.getTaskToRun(TaskInProgress.java:428Timed out.org.apache.hadoop.fs.ChecksumException: Checksum error:/2/hadoop/nara/data/tmp/task_r_m80hob/all.1 at 1554810368 at
      org.apache.hadoop.fs.FSDataInputStream$Checker.verifySum(FSDataInputStream.java:122)
      at org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:98)
      at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:158)
      at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at
      java.io.BufferedInputStream.read(BufferedInputStream.java:235) at
      org.apache.hadoop.fs.FSDataInputStream$Buffer.read(FSDataInputStream.java:210)
      at java.io.DataInputStream.readInt(DataInputStream.java:353) at
      org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:367) at
      org.apache.hadoop.io.SequenceFile$Sorter$SortPass.run(SequenceFile.java:557)
      at org.apache.hadoop.io.SequenceFile$Sorter.sortPass(SequenceFile.java:523)
      at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:511)
      at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:254) at
      org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:666)

      I added debug to TIP#getTaskToRun so i could tell which TIP had empted its allotment of tasks. Below is extract from jobtracker log that shows sequence of events for TIP tip_fizr7m that lead up to JT losing it:

      060314 203637 Adding task 'task_m_4d6ht0' to tip tip_fizr7m, for tracker 'tracker_41791' on ia109314.archive.org
      060314 204758 Task 'task_m_4d6ht0' has been lost.
      060314 204811 Adding task 'task_m_fb0wf0' to tip tip_fizr7m, for tracker 'tracker_70065' on ia109314.archive.org
      060314 210118 Task 'task_m_fb0wf0' has been lost.
      060314 210119 Adding task 'task_m_irar47' to tip tip_fizr7m, for tracker 'tracker_82285' on ia109324.archive.org
      060314 211541 Taskid 'task_m_irar47' has finished successfully.
      060314 211541 Task 'task_m_irar47' has completed.
      060314 211543 Adding task 'task_m_qo1g69' to tip tip_fizr7m, for tracker 'tracker_97839' on ia109306.archive.org
      060314 213004 Taskid 'task_m_qo1g69' has finished successfully.
      060314 213004 Task 'task_m_qo1g69' has completed.
      060314 213005 Adding task 'task_m_t0lnzk' to tip tip_fizr7m, for tracker 'tracker_57273' on ia109314.archive.org
      060314 214118 Task 'task_m_t0lnzk' has been lost.

      So, we lose two, complete two, then lose a third.

      TIP should have been done on first completion.

      TIP accounting is off.

      Attachments

        1. completes.patch
          0.6 kB
          Michael Stack

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            stack Michael Stack
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment