Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16787

SparkContext.addFile() should not fail if called twice with the same file



    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.6.2, 2.0.0
    • Fix Version/s: 2.0.1, 2.1.0
    • Component/s: Spark Core
    • Labels:
    • Target Version/s:


      The behavior of SparkContext.addFile() changed slightly with the introduction of the Netty-RPC-based file server, which was introduced in Spark 1.6 (where it was disabled by default) and became the default / only file server in Spark 2.0.0.

      Prior to 2.0, calling SparkContext.addFile() twice with the same path would succeed and would cause future tasks to receive an updated copy of the file. This behavior was never explicitly documented but Spark has behaved this way since very early 1.x versions (some of the relevant lines in Executor.updateDependencies() have existed since 2012).

      In 2.0 (or 1.6 with the Netty file server enabled), the second addFile() call will fail with a requirement error because NettyStreamManager tries to guard against duplicate file registration.

      I believe that this change of behavior was unintentional and propose to remove the require check so that Spark 2.0 matches 1.x's default behavior.

      This problem also affects addJar() in a more subtle way: the fileServer.addJar() call will also fail with an exception but that exception is logged and ignored due to some code which was added in 2014 in order to ignore errors caused by missing Spark examples JARs when running on YARN cluster mode (AFAIK).




            • Assignee:
              joshrosen Josh Rosen
              joshrosen Josh Rosen
            • Votes:
              0 Vote for this issue
              4 Start watching this issue


              • Created: