Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.6.2, 2.0.0
-
None
Description
The behavior of SparkContext.addFile() changed slightly with the introduction of the Netty-RPC-based file server, which was introduced in Spark 1.6 (where it was disabled by default) and became the default / only file server in Spark 2.0.0.
Prior to 2.0, calling SparkContext.addFile() twice with the same path would succeed and would cause future tasks to receive an updated copy of the file. This behavior was never explicitly documented but Spark has behaved this way since very early 1.x versions (some of the relevant lines in Executor.updateDependencies() have existed since 2012).
In 2.0 (or 1.6 with the Netty file server enabled), the second addFile() call will fail with a requirement error because NettyStreamManager tries to guard against duplicate file registration.
I believe that this change of behavior was unintentional and propose to remove the require check so that Spark 2.0 matches 1.x's default behavior.
This problem also affects addJar() in a more subtle way: the fileServer.addJar() call will also fail with an exception but that exception is logged and ignored due to some code which was added in 2014 in order to ignore errors caused by missing Spark examples JARs when running on YARN cluster mode (AFAIK).