Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-1874

lost task trackers -- jobs hang

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 0.15.0
    • 0.15.0
    • None
    • None

    Description

      This happens on a 1400 node cluster using a recent nightly build patched with HADOOP-1763 (that fixes a previous 'lost task tracker' issue) running a c++-pipes job with 4200 maps and 2800 reduces. The task trackers start to get lost in high numbers at the end of job completion.

      Similar non-pipes job do not show the same problem, but is unclear whether it is related to c++-pipes. It could also be dfs overload when reduce tasks close and validate all newly created dfs files. I see dfs client rpc timeout exception. But this alone does not explain the escalation in losing task trackers.

      I also noticed that the job tracker becomes rather unresponsive with rpc timeout and call queue overflow exceptions. Job Tracker is running with 60 handlers.

      Attachments

        1. 1874.new.patch
          29 kB
          Devaraj Das
        2. 1874.new.patch
          29 kB
          Devaraj Das
        3. 1874.patch
          25 kB
          Devaraj Das
        4. lazy-dfs-ops.1.patch
          18 kB
          Devaraj Das
        5. lazy-dfs-ops.2.patch
          21 kB
          Devaraj Das
        6. lazy-dfs-ops.4.patch
          18 kB
          Devaraj Das
        7. lazy-dfs-ops.patch
          15 kB
          Devaraj Das
        8. server-throttle-hack.patch
          1 kB
          Raghu Angadi

        Issue Links

          Activity

            People

              ddas Devaraj Das
              ckunz Christian Kunz
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: