This is the non-FLIP6 version of FLINK-4346, restricted to deployment messages:
Currently, messages larger than the maximum Akka Framesize cause an error when being transported. We should add a way to pass messages that are larger than akka.framesize as may happen for task deployments via the TaskDeploymentDescriptor.
We should use the BlobServer to offload big data items (if possible) and make use of any potential distributed file system behind. This way, not only do we avoid the akka framesize restriction, but may also be able to speed up deployment.
I suggest the following changes:
- the sender, i.e. the Execution class, tries to store the serialized job information and serialized task information (if oversized) from the TaskDeploymentDescriptor (tdd) on the BlobServer as a single NAME_ADDRESSABLE blob under its job ID (if this does not work, we send the whole tdd as usual via akka)
- if stored in a blob, these data items are removed from the tdd
- the receiver, i.e. the TaskManager class, tries to retrieve any offloaded data after receiving the TaskDeploymentDescriptor from akka; it re-assembles the original tdd
- the stored blob may be deleted after re-assembly of the tdd
Further (future) changes may include:
- separating the serialized job information and serialized task information into two files and re-use the first one for all tasks
- not re-deploying these two during job recovery (if possible)
- then, as all other NAME_ADDRESSABLE blobs, these offloaded blobs may be removed when the job enters a final state instead