Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-82

JobTracker loses it: NoSuchElementException

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 0.1.0
    • None
    • None

    Description

      On a number of occasions, JobTracker goes into a loop that it never recovers from. Over and over it prints the below to the jobtracker log.

      060304 124522 Server handler 5 on 8010 call error: java.io.IOException: java.util.NoSuchElementException
      java.io.IOException: java.util.NoSuchElementException
      at java.util.TreeMap.key(TreeMap.java:433)
      at java.util.TreeMap.firstKey(TreeMap.java:287)
      at java.util.TreeSet.first(TreeSet.java:407)
      at org.apache.hadoop.mapred.TaskInProgress.getTaskToRun(TaskInProgress.java:428Timed out.org.apache.hadoop.fs.ChecksumException: Checksum error:/2/hadoop/nara/data/tmp/task_r_m80hob/all.1 at 1554810368 at
      org.apache.hadoop.fs.FSDataInputStream$Checker.verifySum(FSDataInputStream.java:122)
      at org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:98)
      at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:158)
      at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at
      java.io.BufferedInputStream.read(BufferedInputStream.java:235) at
      org.apache.hadoop.fs.FSDataInputStream$Buffer.read(FSDataInputStream.java:210)
      at java.io.DataInputStream.readInt(DataInputStream.java:353) at
      org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:367) at
      org.apache.hadoop.io.SequenceFile$Sorter$SortPass.run(SequenceFile.java:557)
      at org.apache.hadoop.io.SequenceFile$Sorter.sortPass(SequenceFile.java:523)
      at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:511)
      at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:254) at
      org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:666)

      I added debug to TIP#getTaskToRun so i could tell which TIP had empted its allotment of tasks. Below is extract from jobtracker log that shows sequence of events for TIP tip_fizr7m that lead up to JT losing it:

      060314 203637 Adding task 'task_m_4d6ht0' to tip tip_fizr7m, for tracker 'tracker_41791' on ia109314.archive.org
      060314 204758 Task 'task_m_4d6ht0' has been lost.
      060314 204811 Adding task 'task_m_fb0wf0' to tip tip_fizr7m, for tracker 'tracker_70065' on ia109314.archive.org
      060314 210118 Task 'task_m_fb0wf0' has been lost.
      060314 210119 Adding task 'task_m_irar47' to tip tip_fizr7m, for tracker 'tracker_82285' on ia109324.archive.org
      060314 211541 Taskid 'task_m_irar47' has finished successfully.
      060314 211541 Task 'task_m_irar47' has completed.
      060314 211543 Adding task 'task_m_qo1g69' to tip tip_fizr7m, for tracker 'tracker_97839' on ia109306.archive.org
      060314 213004 Taskid 'task_m_qo1g69' has finished successfully.
      060314 213004 Task 'task_m_qo1g69' has completed.
      060314 213005 Adding task 'task_m_t0lnzk' to tip tip_fizr7m, for tracker 'tracker_57273' on ia109314.archive.org
      060314 214118 Task 'task_m_t0lnzk' has been lost.

      So, we lose two, complete two, then lose a third.

      TIP should have been done on first completion.

      TIP accounting is off.

      Attachments

        1. completes.patch
          0.6 kB
          Michael Stack

        Activity

          People

            Unassigned Unassigned
            stack Michael Stack
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: