Uploaded image for project: 'REEF'
  1. REEF
  2. REEF-568

Work around the federated YARN node reports problem

    Details

    • Type: Task
    • Status: Resolved
    • Priority: Minor
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: 0.13
    • Component/s: None
    • Labels:
      None

      Description

      When trying to use REEF with Federation, there's a problem on the node reports YARN sends us.
      Just after initializing our yarn client library (hadoop-yarn-client-2.4.0), we ask for the RUNNING nodes in the cluster to populate our own Resource Catalog.
      YARN replies with the nodes that belong to a 'random' sub-cluster; sometimes with the nodes in the correct sub-cluster (where the containers will be placed), and sometimes with other ones. That causes the application to randomly fail.
      For example, we populate our resource catalog with nodes in sub-cluster 1, but the allocations are actually made on sub-cluster 2, so we fail.

      We need to do a work around for this issue, as YARN folks are not sure when they will have the right.

        Issue Links

          Activity

          Hide
          markus.weimer Markus Weimer added a comment -

          We need to do a work around to the federated YARN container allocation issue.

          Can you please be more specific on the issue we face?

          Show
          markus.weimer Markus Weimer added a comment - We need to do a work around to the federated YARN container allocation issue. Can you please be more specific on the issue we face?
          Hide
          nachoacano Ignacio Cano added a comment - - edited

          Outdated comment: Should be discarded, please refer to the current description...

          When using a federated YARN cluster, everytime we try to allocate a container, the underlying YARN impl allocates one container in each subcluster, and then returns them to the AM. If there are two subclusters and we request one container, two will be returned to the AM, causing REEF to fail.
          We therefore need a work around to silently discard those extra containers when using federation

          Show
          nachoacano Ignacio Cano added a comment - - edited Outdated comment: Should be discarded, please refer to the current description... When using a federated YARN cluster, everytime we try to allocate a container, the underlying YARN impl allocates one container in each subcluster, and then returns them to the AM. If there are two subclusters and we request one container, two will be returned to the AM, causing REEF to fail. We therefore need a work around to silently discard those extra containers when using federation
          Hide
          nachoacano Ignacio Cano added a comment -

          Should link to the YARN JIRA issue when they have one.

          Show
          nachoacano Ignacio Cano added a comment - Should link to the YARN JIRA issue when they have one.
          Hide
          nachoacano Ignacio Cano added a comment -

          This is the only Federated YARN jira available so far.

          Show
          nachoacano Ignacio Cano added a comment - This is the only Federated YARN jira available so far.
          Hide
          markus.weimer Markus Weimer added a comment -

          Just after initializing our yarn client library (hadoop-yarn-client-2.4.0), we ask for the RUNNING nodes in the cluster to populate our own Resource Catalog.

          Does compiling with newer Hadoop change anything? You can compile REEF via mvn -Dhadoop.version=2.6 clean package and check.

          Show
          markus.weimer Markus Weimer added a comment - Just after initializing our yarn client library ( hadoop-yarn-client-2.4.0 ), we ask for the RUNNING nodes in the cluster to populate our own Resource Catalog. Does compiling with newer Hadoop change anything? You can compile REEF via mvn -Dhadoop.version=2.6 clean package and check.
          Hide
          nachoacano Ignacio Cano added a comment -

          It will be the same, but I can check it tmrw.
          It's a YARN bug, they are replying bad to our request when federation is enabled.

          Show
          nachoacano Ignacio Cano added a comment - It will be the same, but I can check it tmrw. It's a YARN bug, they are replying bad to our request when federation is enabled.
          Hide
          markus.weimer Markus Weimer added a comment -

          Wouldn't REEF's behavior also cause issues for dynamic clusters? That is: If an admin where to add a node to a cluster, future allocations might end up on that node. Which in turn would crash the REEF Driver, correct? That strikes me as an issue more severe than assumed here, and we should have proper behavior for it.

          Show
          markus.weimer Markus Weimer added a comment - Wouldn't REEF's behavior also cause issues for dynamic clusters? That is: If an admin where to add a node to a cluster, future allocations might end up on that node. Which in turn would crash the REEF Driver, correct? That strikes me as an issue more severe than assumed here, and we should have proper behavior for it.
          Hide
          nachoacano Ignacio Cano added a comment -

          Yes it will. Will close this JIRA as duplicated from a new one explaining the problem.
          And update the code appropriately.
          Will close the pull request and create a new one with this changes.

          Show
          nachoacano Ignacio Cano added a comment - Yes it will. Will close this JIRA as duplicated from a new one explaining the problem. And update the code appropriately. Will close the pull request and create a new one with this changes.
          Hide
          nachoacano Ignacio Cano added a comment -

          Closing this one, as the other will be resolved

          Show
          nachoacano Ignacio Cano added a comment - Closing this one, as the other will be resolved

            People

            • Assignee:
              nachoacano Ignacio Cano
              Reporter:
              nachoacano Ignacio Cano
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development