Uploaded image for project: 'Zeppelin'
  1. Zeppelin
  2. ZEPPELIN-4934

Add jars to sys.path

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 0.9.0
    • 0.9.0
    • pySpark, spark
    • None
    • Distributor ID: Ubuntu
      Description: Ubuntu 18.04.1 LTS
      Release: 18.04
      Codename: bionic

      spark 2.4.6

      hadoop 3.1.3

    Description

      Packages (jars) added through spark.jars.packages do not appear in the pyspark interpreter sys.path. This is different from the behavior of CLI shell pyspark  2.4 - see example below. This leads to import not found errors for jars that have their python code included such as com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1

      pyspark include these jars to the path at context initialization. From the spark repo see:

      • core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
      • python/pyspark/context.py
      • python/pyspark/shell.py

      In pyspark, the jars passed with "–packages" are passed onto "spark.submit.pyFiles" (in prepareSubmitEnvironment function) and then added to sys.path by the context initialization.

      A simple fix is to do:

      import sys

      sys.path.extend(sc.getConf().get("spark.jars").split(","))

       at the top of every notebook. However, this is a little cumbersome and unintuitive to users who expect the same spark behavior. 

       

      Behavior in pyspark CLI:

      ```

      pyspark --packages com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1 --repositories https://mmlspark.azureedge.net/maven --master local[*]

      Python 3.6.9 (default, Oct 17 2019, 11:10:22)
      [GCC 8.3.0] on linux
      Type "help", "copyright", "credits" or "license" for more information.
      https://mmlspark.azureedge.net/maven added as a remote repository with the name: repo-1
      Ivy Default Cache set to: /root/.ivy2/cache
      The jars for the packages stored in: /root/.ivy2/jars
      :: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
      com.microsoft.ml.spark#mmlspark_2.11 added as a dependency
      :: resolving dependencies :: org.apache.spark#spark-submit-parent-26440d6b-a15d-40e0-8225-c9eb1fd50ac9;1.0
      confs: [default]
      found com.microsoft.ml.spark#mmlspark_2.11;1.0.0-rc1 in repo-1
      found org.scalactic#scalactic_2.11;3.0.5 in central
      found org.scala-lang#scala-reflect;2.11.12 in central
      found org.scalatest#scalatest_2.11;3.0.5 in central
      found org.scala-lang.modules#scala-xml_2.11;1.0.6 in central
      found io.spray#spray-json_2.11;1.3.2 in central
      found com.microsoft.cntk#cntk;2.4 in central
      found org.openpnp#opencv;3.2.0-1 in central
      found com.jcraft#jsch;0.1.54 in central
      found org.apache.httpcomponents#httpclient;4.5.6 in central
      found org.apache.httpcomponents#httpcore;4.4.10 in central
      found commons-logging#commons-logging;1.2 in central
      found commons-codec#commons-codec;1.10 in central
      found com.microsoft.ml.lightgbm#lightgbmlib;2.3.100 in central
      found com.github.vowpalwabbit#vw-jni;8.7.0.3 in central
      :: resolution report :: resolve 462ms :: artifacts dl 9ms
      :: modules in use:
      com.github.vowpalwabbit#vw-jni;8.7.0.3 from central in [default]
      com.jcraft#jsch;0.1.54 from central in [default]
      com.microsoft.cntk#cntk;2.4 from central in [default]
      com.microsoft.ml.lightgbm#lightgbmlib;2.3.100 from central in [default]
      com.microsoft.ml.spark#mmlspark_2.11;1.0.0-rc1 from repo-1 in [default]
      commons-codec#commons-codec;1.10 from central in [default]
      commons-logging#commons-logging;1.2 from central in [default]
      io.spray#spray-json_2.11;1.3.2 from central in [default]
      org.apache.httpcomponents#httpclient;4.5.6 from central in [default]
      org.apache.httpcomponents#httpcore;4.4.10 from central in [default]
      org.openpnp#opencv;3.2.0-1 from central in [default]
      org.scala-lang#scala-reflect;2.11.12 from central in [default]
      org.scala-lang.modules#scala-xml_2.11;1.0.6 from central in [default]
      org.scalactic#scalactic_2.11;3.0.5 from central in [default]
      org.scalatest#scalatest_2.11;3.0.5 from central in [default]
      ---------------------------------------------------------------------

        modules artifacts
      conf number search dwnlded evicted number dwnlded

      ---------------------------------------------------------------------

      default 15 0 0 0 15 0

      ---------------------------------------------------------------------
      :: retrieving :: org.apache.spark#spark-submit-parent-26440d6b-a15d-40e0-8225-c9eb1fd50ac9
      confs: [default]
      0 artifacts copied, 15 already retrieved (0kB/13ms)
      20/07/01 23:36:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      Setting default log level to "WARN".
      To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
      Welcome to
      ____ __
      / _/_ ___ ____/ /_
      \ \/ _ \/ _ `/ __/ '/
      /__ / ._/_,// //_\ version 2.4.6
      /_/

      Using Python version 3.6.9 (default, Oct 17 2019 11:10:22)
      SparkSession available as 'spark'.
      >>> import sys
      >>> sys.path
      ['', '/tmp/spark-f6381d0b-dff3-43e8-aecb-d54091034d43/userFiles-ffa119a5-9f73-4f83-b019-4578acfa1630/commons-codec_commons-codec-1.10.jar', '/tmp/spark-f6381d0b-dff3-43e8-aecb-d54091034d43/userFiles-ffa119a5-9f73-4f83-b019-4578acfa1630/commons-logging_commons-logging-1.2.jar', '/tmp/spark-f6381d0b-dff3-43e8-aecb-d54091034d43/userFiles-ffa119a5-9f73-4f83-b019-4578acfa1630/org.apache.httpcomponents_httpcore-4.4.10.jar', '/tmp/spark-f6381d0b-dff3-43e8-aecb-d54091034d43/userFiles-ffa119a5-9f73-4f83-b019-4578acfa1630/org.scala-lang.modules_scala-xml_2.11-1.0.6.jar', '/tmp/spark-f6381d0b-dff3-43e8-aecb-d54091034d43/userFiles-ffa119a5-9f73-4f83-b019-4578acfa1630/org.scala-lang_scala-reflect-2.11.12.jar', '/tmp/spark-f6381d0b-dff3-43e8-aecb-d54091034d43/userFiles-ffa119a5-9f73-4f83-b019-4578acfa1630/com.github.vowpalwabbit_vw-jni-8.7.0.3.jar', '/tmp/spark-f6381d0b-dff3-43e8-aecb-d54091034d43/userFiles-ffa119a5-9f73-4f83-b019-4578acfa1630/com.microsoft.ml.lightgbm_lightgbmlib-2.3.100.jar', '/tmp/spark-f6381d0b-dff3-43e8-aecb-d54091034d43/userFiles-ffa119a5-9f73-4f83-b019-4578acfa1630/org.apache.httpcomponents_httpclient-4.5.6.jar', '/tmp/spark-f6381d0b-dff3-43e8-aecb-d54091034d43/userFiles-ffa119a5-9f73-4f83-b019-4578acfa1630/com.jcraft_jsch-0.1.54.jar', '/tmp/spark-f6381d0b-dff3-43e8-aecb-d54091034d43/userFiles-ffa119a5-9f73-4f83-b019-4578acfa1630/org.openpnp_opencv-3.2.0-1.jar', '/tmp/spark-f6381d0b-dff3-43e8-aecb-d54091034d43/userFiles-ffa119a5-9f73-4f83-b019-4578acfa1630/com.microsoft.cntk_cntk-2.4.jar', '/tmp/spark-f6381d0b-dff3-43e8-aecb-d54091034d43/userFiles-ffa119a5-9f73-4f83-b019-4578acfa1630/io.spray_spray-json_2.11-1.3.2.jar', '/tmp/spark-f6381d0b-dff3-43e8-aecb-d54091034d43/userFiles-ffa119a5-9f73-4f83-b019-4578acfa1630/org.scalatest_scalatest_2.11-3.0.5.jar', '/tmp/spark-f6381d0b-dff3-43e8-aecb-d54091034d43/userFiles-ffa119a5-9f73-4f83-b019-4578acfa1630/org.scalactic_scalactic_2.11-3.0.5.jar', '/tmp/spark-f6381d0b-dff3-43e8-aecb-d54091034d43/userFiles-ffa119a5-9f73-4f83-b019-4578acfa1630/com.microsoft.ml.spark_mmlspark_2.11-1.0.0-rc1.jar', '/tmp/spark-f6381d0b-dff3-43e8-aecb-d54091034d43/userFiles-ffa119a5-9f73-4f83-b019-4578acfa1630', '/opt/spark/python/lib/py4j-0.10.7-src.zip', '/opt/spark/python', '/zeppelin', '/usr/lib/python36.zip', '/usr/lib/python3.6', '/usr/lib/python3.6/lib-dynload', '/usr/lib/python3.6/site-packages']

      ```

      Attachments

        Activity

          People

            dbanda Dalitso Banda
            dbanda Dalitso Banda
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1h
                1h