[SPARK-3910] ./python/pyspark/mllib/classification.py doctests fails with module name pollution - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.2.0
Fix Version/s: None
Component/s: PySpark
Labels:
- pyspark
- testing
Environment:

Hide

Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20, Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, argparse==1.2.1, docutils==0.12, flake8==2.2.3, mccabe==0.2.1, numpy==1.9.0, pep8==1.5.7, psutil==2.1.3, pyflake8==0.1.9, pyflakes==0.8.1, unittest2==0.5.1, wsgiref==0.1.2

Show
Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20, Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, argparse==1.2.1, docutils==0.12, flake8==2.2.3, mccabe==0.2.1, numpy==1.9.0, pep8==1.5.7, psutil==2.1.3, pyflake8==0.1.9, pyflakes==0.8.1, unittest2==0.5.1, wsgiref==0.1.2

Target Version/s:

1.1.2, 1.2.0

Description

In ./python/run-tests script, we run the doctests in ./pyspark/mllib/classification.py.
The output is as following:

$ ./python/run-tests
...
Running test: pyspark/mllib/classification.py
Traceback (most recent call last):
  File "pyspark/mllib/classification.py", line 20, in <module>
    import numpy
  File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/__init__.py", line 170, in <module>
    from . import add_newdocs
  File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/add_newdocs.py", line 13, in <module>
    from numpy.lib import add_newdoc
  File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/__init__.py", line 8, in <module>
    from .type_check import *
  File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/type_check.py", line 11, in <module>
    import numpy.core.numeric as _nx
  File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/core/__init__.py", line 46, in <module>
    from numpy.testing import Tester
  File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/__init__.py", line 13, in <module>
    from .utils import *
  File "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/utils.py", line 15, in <module>
    from tempfile import mkdtemp
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/tempfile.py", line 34, in <module>
    from random import Random as _Random
  File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/mllib/random.py", line 24, in <module>
    from pyspark.rdd import RDD
  File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/__init__.py", line 51, in <module>
    from pyspark.context import SparkContext
  File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/context.py", line 22, in <module>
    from tempfile import NamedTemporaryFile
ImportError: cannot import name NamedTemporaryFile
        0.07 real         0.04 user         0.02 sys
Had test failures; see logs.

The problem is a cyclic import of tempfile module.
The cause of it is that pyspark.mllib.random module exists in the directory where pyspark.mllib.classification module exists.
classification module imports numpy module, and then numpy module imports tempfile module from its inside.
Now the first entry sys.path is the directory "./python/pyspark/mllib" (where the executed file "classification.py" exists), so tempfile module imports pyspark.mllib.random module (not the standard library "random" module).
Finally, import chains reach tempfile again, then a cyclic import is formed.

Summary: classification → numpy → tempfile → pyspark.mllib.random → tempfile → (cyclic import!!)

Furthermore, stat module is in a standard library, and pyspark.mllib.stat module exists. This also may be troublesome.

commit: 0e8203f4fb721158fb27897680da476174d24c4b

A fundamental solution is to avoid using module names used by standard libraries (currently "random" and "stat").
A difficulty of this solution is to rename pyspark.mllib.random and pyspark.mllib.stat, which may be already used.