Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20001

Support PythonRunner executing inside a Conda env

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.2.0
    • None
    • PySpark, Spark Core

    Description

      Similar to SPARK-13587, I'm trying to allow the user to configure a Conda environment that PythonRunner will run from.
      This change remembers theconda environment found on the driver and installs the same packages on the executor side, only once per PythonWorkerFactory. The list of requested conda packages are added to the PythonWorkerFactory cache, so two collects using the same environment (incl packages) can re-use the same running executors.

      You have to specify outright what packages and channels to "bootstrap" the environment with.

      However, SparkContext (as well as JavaSparkContext & the pyspark version) are expanded to support addCondaPackage and addCondaChannel.
      Rationale is:

      • you might want to add more packages once you're already running in the driver
      • you might want to add a channel which requires some token for authentication, which you don't yet have access to until the module is already running

      This issue requires that the conda binary is already available on the driver as well as executors, you just have to specify where it can be found.

      Please see the attached pull request on palantir/spark for additional details: https://github.com/palantir/spark/pull/115

      As for tests, there is a local python test, as well as yarn client & cluster-mode tests, which ensure that a newly installed package is visible from both the driver and the executor.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            dsanduleac Dan Sanduleac
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 168h
                168h
                Remaining:
                Remaining Estimate - 168h
                168h
                Logged:
                Time Spent - Not Specified
                Not Specified

                Issue deployment