Description
PySpark is able to load numpy functions, but not scipy.special functions. For example take this snippet:
from numpy import exp from scipy.special import gammaln a = range(1, 11) b = sc.parallelize(a) c = b.map(exp) d = b.map(special.gammaln)
Calling c.collect() will return the expected result. However, calling d.collect() will fail with
KeyError: (('gammaln',), <function _getobject at 0x10c0879b0>, ('scipy.special', 'gammaln'))
in cloudpickle.py module in _getobject.
The reason is that getobject executes import(modname), which only loads the top-level package X in case modname is like X.Y. It is failing because gammaln is not a member of scipy. The fix (for which I will shortly submit a PR) is to add fromlist=[attribute] to the import_ call, which will load the innermost module.
See
https://docs.python.org/2/library/functions.html#__import__
and
http://stackoverflow.com/questions/9544331/from-a-b-import-x-using-import