Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
None
-
None
-
None
Description
On a number of occasions, JobTracker goes into a loop that it never recovers from. Over and over it prints the below to the jobtracker log.
060304 124522 Server handler 5 on 8010 call error: java.io.IOException: java.util.NoSuchElementException
java.io.IOException: java.util.NoSuchElementException
at java.util.TreeMap.key(TreeMap.java:433)
at java.util.TreeMap.firstKey(TreeMap.java:287)
at java.util.TreeSet.first(TreeSet.java:407)
at org.apache.hadoop.mapred.TaskInProgress.getTaskToRun(TaskInProgress.java:428Timed out.org.apache.hadoop.fs.ChecksumException: Checksum error:/2/hadoop/nara/data/tmp/task_r_m80hob/all.1 at 1554810368 at
org.apache.hadoop.fs.FSDataInputStream$Checker.verifySum(FSDataInputStream.java:122)
at org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:98)
at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:158)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at
java.io.BufferedInputStream.read(BufferedInputStream.java:235) at
org.apache.hadoop.fs.FSDataInputStream$Buffer.read(FSDataInputStream.java:210)
at java.io.DataInputStream.readInt(DataInputStream.java:353) at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:367) at
org.apache.hadoop.io.SequenceFile$Sorter$SortPass.run(SequenceFile.java:557)
at org.apache.hadoop.io.SequenceFile$Sorter.sortPass(SequenceFile.java:523)
at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:511)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:254) at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:666)
I added debug to TIP#getTaskToRun so i could tell which TIP had empted its allotment of tasks. Below is extract from jobtracker log that shows sequence of events for TIP tip_fizr7m that lead up to JT losing it:
060314 203637 Adding task 'task_m_4d6ht0' to tip tip_fizr7m, for tracker 'tracker_41791' on ia109314.archive.org
060314 204758 Task 'task_m_4d6ht0' has been lost.
060314 204811 Adding task 'task_m_fb0wf0' to tip tip_fizr7m, for tracker 'tracker_70065' on ia109314.archive.org
060314 210118 Task 'task_m_fb0wf0' has been lost.
060314 210119 Adding task 'task_m_irar47' to tip tip_fizr7m, for tracker 'tracker_82285' on ia109324.archive.org
060314 211541 Taskid 'task_m_irar47' has finished successfully.
060314 211541 Task 'task_m_irar47' has completed.
060314 211543 Adding task 'task_m_qo1g69' to tip tip_fizr7m, for tracker 'tracker_97839' on ia109306.archive.org
060314 213004 Taskid 'task_m_qo1g69' has finished successfully.
060314 213004 Task 'task_m_qo1g69' has completed.
060314 213005 Adding task 'task_m_t0lnzk' to tip tip_fizr7m, for tracker 'tracker_57273' on ia109314.archive.org
060314 214118 Task 'task_m_t0lnzk' has been lost.
So, we lose two, complete two, then lose a third.
TIP should have been done on first completion.
TIP accounting is off.