Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6764

Add wheel package support for PySpark

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete CommentsDelete
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Auto Closed
    • None
    • None
    • Deploy, PySpark

    Description

      We can do spark-submit with one or more Python packages (.egg,.zip and .jar) by --py-files option.

      zip packaging

      Spark put a zip file on its working directory and adds the absolute path to Python's sys.path. When the user program imports it, zipimport is automatically invoked under the hood. That is, data-files and dynamic modules(.pyd .so) can not be used since zipimport supports only .py, .pyc and .pyo.

      egg packaging

      Spark put an egg file on its working directory and adds the absolute path to Python's sys.path. Unlike zipimport, egg can handle data files and dynamid modules as far as the author of the package uses pkg_resources API properly. But so many python modules does not use pkg_resources API, that causes "ImportError"or "No such file" error. Moreover, creating eggs of dependencies and further dependencies are troublesome job.

      wheel packaging

      Supporting new Python standard package-format "wheel" would be nice. With wheel, we can do spark-submit with complex dependencies simply as follows.

      1. Write requirements.txt file.

      SQLAlchemy
      MySQL-python
      requests
      simplejson>=3.6.0,<=3.6.5
      pydoop
      

      2. Do wheel packaging by only one command. All dependencies are wheel-ed.

      $ your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse --requirement requirements.txt
      

      3. Do spark-submit

      your_spark_home/bin/spark-submit --master local[4] --py-files $(find /tmp/wheelhouse/ -name "*.whl" -print0 | sed -e 's/\x0/,/g') your_driver.py
      

      If your pyspark driver is a package which consists of many modules,

      1. Write setup.py for your pyspark driver package.

      from setuptools import (
          find_packages,
          setup,
      )
      
      setup(
          name='yourpkg',
          version='0.0.1',
          packages=find_packages(),
          install_requires=[
              'SQLAlchemy',
              'MySQL-python',
              'requests',
              'simplejson>=3.6.0,<=3.6.5',
              'pydoop',
          ],
      )
      

      2. Do wheel packaging by only one command. Your driver package and all dependencies are wheel-ed.

      your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse your_driver_package/.
      

      3. Do spark-submit

      your_spark_home/bin/spark-submit --master local[4] --py-files $(find /tmp/wheelhouse/ -name "*.whl" -print0 | sed -e 's/\x0/,/g') your_driver_bootstrap.py
      

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned Assign to me
            takaomag Takao Magoori
            Votes:
            11 Vote for this issue
            Watchers:
            17 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment