Zhankun Tang thanks for the patch and doc. It echos many of my concerns.
I've given image localization and management quite a bit of thought, and so far I haven't come up with a great solution here. Some of the goals I had in mind that, IMO, should be carried forward are to minimize dependence on the internet, get the container started as fast as possible, ensure the same image is used for the duration of an application, and maintain the image's metadata.
Daniel Templeton Today, when the docker run in container executor is issued, a docker pull is run behind the scenes, similar to what you are suggesting. The potential for timeouts is high in unstable networks. This also doesn't work for docker hub private repositories, but that is a separate issue that needs to be filed.
One comment on the approach here, IIRC, docker export/import also retains the history and layers, whereas save/load flattens, so we should likely use export/import vs save/load.
The approach outlined in the patch does have its merits. You are not dependent on being able to pull from docker hub or a private registry and could ensure that the same image is run by all of the tasks in the job. I believe it would be possible to keep the image metadata intact as well.
My concerns with using Dockerhub/a private registry is what happens during a long running job if someone pushes a new "latest" to the registry? Would the docker pull result in the last part of my application running a different image (perhaps that doesn't apply to what you have in mind)? However, I completely agree with Daniel's concerns on the current approach, plus it's extra work administrators now have to do to get the images packaged and into HDFS.
Somewhat OT, but I started on a HDFS storage plugin for the docker registry storage driver API, but the API was changing daily, so I put this on the back burner waiting for a bit more stabilization - docker-registry-hdfs if you want to play with it. It allows for doing a docker pull from a private registry backed by HDFS. This would help satisfy the goal of not depending on the internet/docker hub and maintaining the image's metadata, but beyond that it doesn't buy us much. I'm hopeful there are interesting "hacks" where this might provide benefits to YARN in the future.
Perhaps image management and localization should be handled outside of the application lifecycle? Otherwise, image localization could introduce significant lag for starting containers (which may be OK?).
Interested in other's thoughts here.