Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see
SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow:
(those are just a sample of the ones that I've answered personally; there are many others).
I think we can detect these errors by adding logic to RDD to check whether sc is null (e.g. turn sc into a getter function); we can use this to add a better error message.
In PySpark, these errors manifest themselves slightly differently. Attempting to nest RDDs or perform actions inside of transformations results in pickle-time errors:
We get the same error when attempting to broadcast an RDD in PySpark. For Python, improved error reporting could be as simple as overriding the getnewargs method to throw a more useful UnsupportedOperation exception with a more helpful error message.
Users may also see confusing NPEs when calling methods on stopped SparkContexts, so I've added checks for that as well.