Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-1874

lost task trackers -- jobs hang

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.15.0
    • Fix Version/s: 0.15.0
    • Component/s: None
    • Labels:
      None

      Description

      This happens on a 1400 node cluster using a recent nightly build patched with HADOOP-1763 (that fixes a previous 'lost task tracker' issue) running a c++-pipes job with 4200 maps and 2800 reduces. The task trackers start to get lost in high numbers at the end of job completion.

      Similar non-pipes job do not show the same problem, but is unclear whether it is related to c++-pipes. It could also be dfs overload when reduce tasks close and validate all newly created dfs files. I see dfs client rpc timeout exception. But this alone does not explain the escalation in losing task trackers.

      I also noticed that the job tracker becomes rather unresponsive with rpc timeout and call queue overflow exceptions. Job Tracker is running with 60 handlers.

        Attachments

        1. 1874.new.patch
          29 kB
          Devaraj Das
        2. 1874.new.patch
          29 kB
          Devaraj Das
        3. 1874.patch
          25 kB
          Devaraj Das
        4. lazy-dfs-ops.1.patch
          18 kB
          Devaraj Das
        5. lazy-dfs-ops.2.patch
          21 kB
          Devaraj Das
        6. lazy-dfs-ops.4.patch
          18 kB
          Devaraj Das
        7. lazy-dfs-ops.patch
          15 kB
          Devaraj Das
        8. server-throttle-hack.patch
          1 kB
          Raghu Angadi

        Issue Links

          Activity

            People

            • Assignee:
              ddas Devaraj Das
              Reporter:
              ckunz Christian Kunz

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment