Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28517

pyspark with --conf spark.jars.packages causes duplicate jars to be uploaded

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.4.3
    • Fix Version/s: None
    • Component/s: PySpark, YARN
    • Labels:
    • Environment:

      spark 2.4.3_2.12 without hadoop

      yarn 2.6

      python 2.7.16

      centos 7

      Description

      Steps to reproduce:

      spark-submit --master yarn --conf "spark.jars.packages=org.apache.spark:spark-avro_2.12:2.4.3" ${SPARK_HOME}/examples/src/main/python/pi.py 100

      Undesirable behavior:

      warnings are printed package jars have been added to the distributed cache multiple times

      19/07/25 23:25:07 WARN Client: Same path resource file:///home/barryl/.ivy2/jars/org.apache.spark_spark-avro_2.12-2.4.3.jar added multiple times to distributed cache.
      19/07/25 23:25:07 WARN Client: Same path resource file:///home/barryl/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar added multiple times to distributed cache.

      This does not happen for Scala jobs, only Pyspark

       

      Full output of example run.

      [barryl@hostname ~]$ /opt/spark2/bin/spark-submit --master yarn --conf "spark.jars.packages=org.apache.spark:spark-avro_2.12:2.4.3" /opt/spark2/examples/src/main/python/pi.py 100
      Ivy Default Cache set to: /home/barryl/.ivy2/cache
      The jars for the packages stored in: /home/barryl/.ivy2/jars
      :: loading settings :: url = jar:file:/opt/spark-2.4.3-bin-without-hadoop-scala-2.12/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
      org.apache.spark#spark-avro_2.12 added as a dependency
      :: resolving dependencies :: org.apache.spark#spark-submit-parent-2c34ecff-b060-4af9-9b9f-83867672748c;1.0
          confs: [default]
          found org.apache.spark#spark-avro_2.12;2.4.3 in central
          found org.spark-project.spark#unused;1.0.0 in central
      :: resolution report :: resolve 457ms :: artifacts dl 5ms
          :: modules in use:
          org.apache.spark#spark-avro_2.12;2.4.3 from central in [default]
          org.spark-project.spark#unused;1.0.0 from central in [default]
          ---------------------------------------------------------------------
          |                  |            modules            ||   artifacts   |
          |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
          ---------------------------------------------------------------------
          |      default     |   2   |   0   |   0   |   0   ||   2   |   0   |
          ---------------------------------------------------------------------
      :: retrieving :: org.apache.spark#spark-submit-parent-2c34ecff-b060-4af9-9b9f-83867672748c
          confs: [default]
          0 artifacts copied, 2 already retrieved (0kB/7ms)
      19/07/25 23:25:03 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
      19/07/25 23:25:07 WARN Client: Same path resource file:///home/barryl/.ivy2/jars/org.apache.spark_spark-avro_2.12-2.4.3.jar added multiple times to distributed cache.
      19/07/25 23:25:07 WARN Client: Same path resource file:///home/barryl/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar added multiple times to distributed cache.
      19/07/25 23:25:28 WARN TaskSetManager: Stage 0 contains a task of very large size (365 KB). The maximum recommended task size is 100 KB.
      Pi is roughly 3.142308

       

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              barryl Barry
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: