Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-1874

lost task trackers -- jobs hang

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 0.15.0
    • 0.15.0
    • None
    • None

    Description

      This happens on a 1400 node cluster using a recent nightly build patched with HADOOP-1763 (that fixes a previous 'lost task tracker' issue) running a c++-pipes job with 4200 maps and 2800 reduces. The task trackers start to get lost in high numbers at the end of job completion.

      Similar non-pipes job do not show the same problem, but is unclear whether it is related to c++-pipes. It could also be dfs overload when reduce tasks close and validate all newly created dfs files. I see dfs client rpc timeout exception. But this alone does not explain the escalation in losing task trackers.

      I also noticed that the job tracker becomes rather unresponsive with rpc timeout and call queue overflow exceptions. Job Tracker is running with 60 handlers.

      Attachments

        1. 1874.new.patch
          29 kB
          Devaraj Das
        2. 1874.new.patch
          29 kB
          Devaraj Das
        3. 1874.patch
          25 kB
          Devaraj Das
        4. lazy-dfs-ops.1.patch
          18 kB
          Devaraj Das
        5. lazy-dfs-ops.2.patch
          21 kB
          Devaraj Das
        6. lazy-dfs-ops.4.patch
          18 kB
          Devaraj Das
        7. lazy-dfs-ops.patch
          15 kB
          Devaraj Das
        8. server-throttle-hack.patch
          1 kB
          Raghu Angadi

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            ddas Devaraj Das
            ckunz Christian Kunz
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment