Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16787

SparkContext.addFile() should not fail if called twice with the same file

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6.2, 2.0.0
    • 2.0.1, 2.1.0
    • Spark Core
    • None

    Description

      The behavior of SparkContext.addFile() changed slightly with the introduction of the Netty-RPC-based file server, which was introduced in Spark 1.6 (where it was disabled by default) and became the default / only file server in Spark 2.0.0.

      Prior to 2.0, calling SparkContext.addFile() twice with the same path would succeed and would cause future tasks to receive an updated copy of the file. This behavior was never explicitly documented but Spark has behaved this way since very early 1.x versions (some of the relevant lines in Executor.updateDependencies() have existed since 2012).

      In 2.0 (or 1.6 with the Netty file server enabled), the second addFile() call will fail with a requirement error because NettyStreamManager tries to guard against duplicate file registration.

      I believe that this change of behavior was unintentional and propose to remove the require check so that Spark 2.0 matches 1.x's default behavior.

      This problem also affects addJar() in a more subtle way: the fileServer.addJar() call will also fail with an exception but that exception is logged and ignored due to some code which was added in 2014 in order to ignore errors caused by missing Spark examples JARs when running on YARN cluster mode (AFAIK).

      Attachments

        Activity

          People

            joshrosen Josh Rosen
            joshrosen Josh Rosen
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: