Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-1874

lost task trackers -- jobs hang

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments


    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.15.0
    • Fix Version/s: 0.15.0
    • Component/s: None
    • Labels:


      This happens on a 1400 node cluster using a recent nightly build patched with HADOOP-1763 (that fixes a previous 'lost task tracker' issue) running a c++-pipes job with 4200 maps and 2800 reduces. The task trackers start to get lost in high numbers at the end of job completion.

      Similar non-pipes job do not show the same problem, but is unclear whether it is related to c++-pipes. It could also be dfs overload when reduce tasks close and validate all newly created dfs files. I see dfs client rpc timeout exception. But this alone does not explain the escalation in losing task trackers.

      I also noticed that the job tracker becomes rather unresponsive with rpc timeout and call queue overflow exceptions. Job Tracker is running with 60 handlers.


        1. 1874.new.patch
          29 kB
          Devaraj Das
        2. 1874.new.patch
          29 kB
          Devaraj Das
        3. 1874.patch
          25 kB
          Devaraj Das
        4. lazy-dfs-ops.1.patch
          18 kB
          Devaraj Das
        5. lazy-dfs-ops.2.patch
          21 kB
          Devaraj Das
        6. lazy-dfs-ops.4.patch
          18 kB
          Devaraj Das
        7. lazy-dfs-ops.patch
          15 kB
          Devaraj Das
        8. server-throttle-hack.patch
          1 kB
          Raghu Angadi

        Issue Links



            • Assignee:
              ddas Devaraj Das
              ckunz Christian Kunz


              • Created:

                Issue deployment