Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.5.0
-
None
-
Reviewed
Description
The async nature of the shufflehandler can cause it to open a huge number of
file descriptors, when it runs out it crashes.
Scenario:
Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node.
Let's say all 6K reduces hit a node at about same time asking for their
outputs. Each reducer will ask for all 40 map outputs over a single socket in a
single request (not necessarily all 40 at once, but with coalescing it is
likely to be a large number).
sendMapOutput() will open the file for random reading and then perform an async transfer of the particular portion of this file(). This will theoretically
happen 6000*40=240000 times which will run the NM out of file descriptors and cause it to crash.
The algorithm should be refactored a little to not open the fds until they're
actually needed.
Attachments
Attachments
Issue Links
- is part of
-
BIGTOP-1403 Add YARN-2410 to Longevity tests.
- Open