PySpark's sample() method crashes with ImportErrors on the workers if numpy is installed on the driver machine but not on the workers. I'm not sure what's the best way to fix this. A general mechanism for automatically shipping libraries from the master to the workers would address this, but that could be complicated to implement.
- is related to
-
SPARK-4477 remove numpy from RDDSampler of PySpark
-
- Resolved
-
- links to