Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6474

ShuffleHandler can possibly exhaust nodemanager file descriptors

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      The async nature of the shufflehandler can cause it to open a huge number of
      file descriptors, when it runs out it crashes.

      Scenario:
      Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node.
      Let's say all 6K reduces hit a node at about same time asking for their
      outputs. Each reducer will ask for all 40 map outputs over a single socket in a
      single request (not necessarily all 40 at once, but with coalescing it is
      likely to be a large number).

      sendMapOutput() will open the file for random reading and then perform an async transfer of the particular portion of this file(). This will theoretically
      happen 6000*40=240000 times which will run the NM out of file descriptors and cause it to crash.

      The algorithm should be refactored a little to not open the fds until they're
      actually needed.

      Attachments

        1. YARN-2410-v1.patch
          18 kB
          Kuhu Shukla
        2. YARN-2410-v10.patch
          15 kB
          Kuhu Shukla
        3. YARN-2410-v11.patch
          16 kB
          Kuhu Shukla
        4. YARN-2410-v2.patch
          18 kB
          Kuhu Shukla
        5. YARN-2410-v3.patch
          18 kB
          Kuhu Shukla
        6. YARN-2410-v4.patch
          11 kB
          Kuhu Shukla
        7. YARN-2410-v5.patch
          16 kB
          Kuhu Shukla
        8. YARN-2410-v6.patch
          16 kB
          Kuhu Shukla
        9. YARN-2410-v7.patch
          15 kB
          Kuhu Shukla
        10. YARN-2410-v8.patch
          14 kB
          Kuhu Shukla
        11. YARN-2410-v9.patch
          15 kB
          Kuhu Shukla

        Issue Links

          Activity

            People

              kshukla Kuhu Shukla
              nroberts Nathan Roberts
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: