Description
In Spark, SparkContext and SparkSession can only be used on the driver, not on executors. For example, this means that you cannot call someDataset.collect() inside of a Dataset or RDD transformation.
When Spark serializes RDDs and Datasets, references to SparkContext and SparkSession are null'ed out (by being marked as @transient or via the Closure Cleaner). As a result, RDD and Dataset methods which reference use these driver-side-only objects (e.g. actions or transformations) will see null references and may fail with a NullPointerException. For example, in code which (via a chain of calls) tried to collect() a dataset inside of a Dataset.map operation:
Caused by: java.lang.NullPointerException at <http://org.apache.spark.sql.Dataset.org|org.apache.spark.sql.Dataset.org>$apache$spark$sql$Dataset$$rddQueryExecution$lzycompute(Dataset.scala:3027) at <http://org.apache.spark.sql.Dataset.org|org.apache.spark.sql.Dataset.org>$apache$spark$sql$Dataset$$rddQueryExecution(Dataset.scala:3025) at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3038) at org.apache.spark.sql.Dataset.rdd(Dataset.scala:3036) [...]
The resulting NPE can be very confusing to users.
In SPARK-5063 I added some logic to throw clearer error messages when performing similar invalid actions on RDDs. This ticket's scope is to implement similar logic for Datasets.
Attachments
Issue Links
- relates to
-
SPARK-5063 Display more helpful error messages for several invalid operations
-
- Resolved
-
- links to