Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-3866 Clean up python/run-tests problems
  3. SPARK-3910

./python/pyspark/mllib/classification.py doctests fails with module name pollution


    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.2.0
    • Fix Version/s: None
    • Component/s: PySpark
    • Labels:
    • Environment:


      In ./python/run-tests script, we run the doctests in ./pyspark/mllib/classification.py.
      The output is as following:

      $ ./python/run-tests
      Running test: pyspark/mllib/classification.py
      Traceback (most recent call last):
        File "pyspark/mllib/classification.py", line 20, in <module>
          import numpy
        File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/__init__.py", line 170, in <module>
          from . import add_newdocs
        File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/add_newdocs.py", line 13, in <module>
          from numpy.lib import add_newdoc
        File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/__init__.py", line 8, in <module>
          from .type_check import *
        File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/type_check.py", line 11, in <module>
          import numpy.core.numeric as _nx
        File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/core/__init__.py", line 46, in <module>
          from numpy.testing import Tester
        File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/__init__.py", line 13, in <module>
          from .utils import *
        File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/utils.py", line 15, in <module>
          from tempfile import mkdtemp
        File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/tempfile.py", line 34, in <module>
          from random import Random as _Random
        File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/mllib/random.py", line 24, in <module>
          from pyspark.rdd import RDD
        File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/__init__.py", line 51, in <module>
          from pyspark.context import SparkContext
        File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/context.py", line 22, in <module>
          from tempfile import NamedTemporaryFile
      ImportError: cannot import name NamedTemporaryFile
              0.07 real         0.04 user         0.02 sys
      Had test failures; see logs.

      The problem is a cyclic import of tempfile module.
      The cause of it is that pyspark.mllib.random module exists in the directory where pyspark.mllib.classification module exists.
      classification module imports numpy module, and then numpy module imports tempfile module from its inside.
      Now the first entry sys.path is the directory "./python/pyspark/mllib" (where the executed file "classification.py" exists), so tempfile module imports pyspark.mllib.random module (not the standard library "random" module).
      Finally, import chains reach tempfile again, then a cyclic import is formed.

      Summary: classification → numpy → tempfile → pyspark.mllib.random → tempfile → (cyclic import!!)

      Furthermore, stat module is in a standard library, and pyspark.mllib.stat module exists. This also may be troublesome.

      commit: 0e8203f4fb721158fb27897680da476174d24c4b

      A fundamental solution is to avoid using module names used by standard libraries (currently "random" and "stat").
      A difficulty of this solution is to rename pyspark.mllib.random and pyspark.mllib.stat, which may be already used.




            • Assignee:
              cocoatomo Tomohiko K.
            • Votes:
              0 Vote for this issue
              5 Start watching this issue


              • Created: