[SPARK-11638] Run Spark on Mesos with bridge networking - ASF JIRA

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Later
Affects Version/s: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
Fix Version/s: None
Component/s: Mesos, Spark Core
Labels:
None

Flags:

Patch

Description

Summary

Provides spark.driver.advertisedPort, spark.fileserver.advertisedPort, spark.broadcast.advertisedPort and spark.replClassServer.advertisedPort settings to enable running Spark in Mesos on Docker with Bridge networking. Provides patches for Akka Remote to enable Spark driver advertisement using alternative host and port.

With these settings, it is possible to run Spark Master in a Docker container and have the executors running on Mesos talk back correctly to such Master.

The problem is discussed on the Mesos mailing list here: https://mail-archives.apache.org/mod_mbox/mesos-user/201510.mbox/%3CCACTd3c9vjAMXk=bFOtj5LJZFRH5u7ix-ghppFqKnVg9mkKctjg@mail.gmail.com%3E

Running Spark on Mesos - LIBPROCESS_ADVERTISE_IP opens the door

In order for the framework to receive orders in the bridged container, Mesos in the container has to register for offers using the IP address of the Agent. Offers are sent by Mesos Master to the Docker container running on a different host, an Agent. Normally, prior to Mesos 0.24.0, libprocess would advertise itself using the IP address of the container, something like 172.x.x.x. Obviously, Mesos Master can't reach that address, it's a different host, it's a different machine. Mesos 0.24.0 introduced two new properties for libprocess - LIBPROCESS_ADVERTISE_IP and LIBPROCESS_ADVERTISE_PORT. This allows the container to use the Agent's address to register for offers. This was provided mainly for running Mesos in Docker on Mesos.

Spark - how does the above relate and what is being addressed here?

Similar to Mesos, out of the box, Spark does not allow to advertise its services on ports different than bind ports. Consider following scenario:

Spark is running inside a Docker container on Mesos, it's a bridge networking mode. Assuming a port 6666 for the spark.driver.port, 6677 for the spark.fileserver.port, 6688 for the spark.broadcast.port and 23456 for the spark.replClassServer.port. If such task is posted to Marathon, Mesos will give 4 ports in range 31000-32000 mapping to the container ports. Starting the executors from such container results in executors not being able to communicate back to the Spark Master.

This happens because of 2 things:

Spark driver is effectively an akka-remote system with akka.tcp transport. akka-remote prior to version 2.4 can't advertise a port different to what it bound to. The settings discussed are here: https://github.com/akka/akka/blob/f8c1671903923837f22d0726a955e0893add5e9f/akka-remote/src/main/resources/reference.conf#L345-L376. These do not exist in Akka 2.3.x. Spark driver will always advertise port 6666 as this is the one akka-remote is bound to.
Any URIs the executors contact the Spark Master on, are prepared by Spark Master and handed over to executors. These always contain the port number used by the Master to find the service on. The services are:

spark.broadcast.port
spark.fileserver.port
spark.replClassServer.port

all above ports are by default 0 (random assignment) but can be specified using Spark configuration ( -Dspark...port ). However, they are limited in the same way as the spark.driver.port; in the above example, an executor should not contact the file server on port 6677 but rather on the respective 31xxx assigned by Mesos.

Spark currently does not allow any of that.

Taking on the problem, step 1: Spark Driver

As mentioned above, Spark Driver is based on akka-remote. In order to take on the problem, the akka.remote.net.tcp.bind-hostname and akka.remote.net.tcp.bind-port settings are a must. Spark does not compile with Akka 2.4.x yet.

What we want is the back port of mentioned akka-remote settings to 2.3.x versions. These patches are attached to this ticket - 2.3.4.patch and 2.3.11.patch files provide patches for respective akka versions. These add mentioned settings and ensure they work as documented for Akka 2.4. In other words, these are future compatible.

A part of that patch also exists in the patch for Spark, in the org.apache.spark.util.AkkaUtils class. This is where Spark is creating the driver and compiling the Akka configuration. That part of the patch tells Akka to use bind-hostname instead of hostname, if spark.driver.advertisedHost is given and use bind-port instead of port, if spark.driver.advertisedPort is given. In such cases, hostname and port are set to the advertised values, respectively.

Worth mentioning: if spark.driver.advertisedHost or spark.driver.advertisedPort isn't given, patched Spark reverts to using the settings as they would be in case of non-patched akka-remote; exactly for that purpose: if there is no patched akka-remote in use. Even if it is in use, akka-remote will correctly handle undefined bind-hostname and bind-port, as specified by Akka 2.4.x.

Akka versions in Spark (attached patches only)

Akka 2.3.4
Spark 1.4.0
Spark 1.4.1
Akka 2.3.11
spark 1.5.0
spark 1.5.1
spark-1.6.0-SNAPSHOT

Taking on the problem, step 2: Spark services

The fortunate thing is that every other Spark service is running over HTTP, using an org.apache.spark.HttpServer class. This is where the second part of the Spark patch comes into play. All other changes in the patch files provide alternative advertised... ports for each of the following services:

spark.broadcast.port -> spark.broadcast.advertisedPort
spark.fileserver.port -> spark.fileserver.advertisedPort
spark.replClassServer.port -> spark.replClassServer.advertisedPort

What we are telling Spark here, is the following: if there is an alternative advertisedPort setting given to this server instance, use that setting for advertising the port.

Patches

These patches are cleared by the Technicolor IP&L Team to be contributed back under the Apache 2.0 License to Spark.
All patches for versions from 1.4.0 to 1.5.2 can be applied directly to the respective tag from Spark git repository. The 1.6.0-master.patch applies to git sha 18350a57004eb87cafa9504ff73affab4b818e06.

Building Akka

To build the required akka version:

AKKA_VERSION=2.3.4
git clone https://github.com/akka/akka.git .
git fetch origin
git checkout v${AKKA_VERSION}
git apply ...2.3.4.patch
sbt package -Dakka.scaladoc.diagrams=false

What is not supplied

At the moment of contribution, we do not supply any unit tests. We would like to contribute those but we may require some assistance.

=====

Happy to answer any questions and looking forward to any guidance which would lead to have these included in the master Spark version.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

1.4.0.patch
10/Nov/15 21:16
14 kB
Radoslaw Gruchalski
1.4.1.patch
10/Nov/15 21:16
14 kB
Radoslaw Gruchalski
1.5.0.patch
10/Nov/15 21:16
14 kB
Radoslaw Gruchalski
1.5.1.patch
10/Nov/15 21:16
14 kB
Radoslaw Gruchalski
1.5.2.patch
10/Nov/15 21:16
14 kB
Radoslaw Gruchalski
2.3.4.patch
10/Nov/15 21:16
7 kB
Radoslaw Gruchalski
2.3.11.patch
10/Nov/15 21:16
7 kB
Radoslaw Gruchalski
1.6.0.patch
26/Nov/15 22:44
15 kB
Radoslaw Gruchalski

Issue Links

is related to

SPARK-17419 Mesos virtual network support

Resolved

links to

[Github] Pull Request #9608 (radekg)

Run Spark on Mesos with bridge networking

Details

Description

Summary

Running Spark on Mesos - LIBPROCESS_ADVERTISE_IP opens the door

Spark - how does the above relate and what is being addressed here?

Taking on the problem, step 1: Spark Driver

Akka versions in Spark (attached patches only)

Taking on the problem, step 2: Spark services

Patches

Building Akka

What is not supplied

Attachments

Attachments

Issue Links

Activity

People

Dates