Uploaded image for project: 'REEF'
  1. REEF
  2. REEF-42

Extra YARN container causing unexpected memory reservations



    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: REEF-Runtime-YARN
    • Labels:


      Our cluster has 4 nodes, each with 13.67GB of memory. When we launch Surf(a long-running job) with 4 evaluators(7GB each), The available memory becomes 6.67GBX3nodes and 5.67GBX1node(AM is 1GB). But an extra container request(7GB), hangs at RM, because of the following reasons.

      1. Because Surf is a long-running job, the evaluators that have been allocated do not exit and make room for the extra container. If there was a room, REEF would have been notified of the allocation of the extra container and released it right away.
      2. To avoid YARN-314, currently we never send a 0-container request, which in effect removes the hanging extra container

      As a result, RM infinitely tries to allocate the hanging request, reserving 7GB for each node. So, Memory Reserved metric increases and Memory Available metric decreases.

      The same thing happens when we explicitly request for more than the capacity, say 8GBX5evaluators. But the difference is that the one caused by the extra container is unpredictable.

      Brian Cho and I discussed the tradeoff between the followings.

      1. Send 0-container requests and address YARN-314 differently by adding another indirection atop AMRMClient or replacing it altogether
      2. Wait until YARN-314 is resolved since our case is not common and can be discovered and fixed by the system administrator

      We think the second approach is better. Once YARN-314 is resolved, I'll create a patch that allows sending 0-container requests.

      Any suggestions are welcome.


          Issue Links



              • Assignee:
                johnyangk John Yang
                johnyangk John Yang
              • Votes:
                0 Vote for this issue
                2 Start watching this issue


                • Created: