Description
As a platform, Spark should enable framework developers to accomplish outside of the Spark codebase much of what can be accomplished inside the Spark codebase. One of the obstacles to this is a historical pattern of excessive data hiding in Spark, e.g., expr in Column not being accessible. This issue is an example of this pattern when it comes to Dataset.
Consider a transformation with the signature `def foo[A](ds: Dataset[A]): Dataset[A]`, which requires the use of toDF(). To get back to Dataset[A] would require calling .as[A], which requires an implicit Encoder[A]. A naive approach would change the function signature to `foo[A : Encoder]` but this is poor API design that requires unnecessarily carrying of implicits from user code into framework code. We know `Encoder[A]` exists because we have access to an instance of `Dataset[A]`... but its `encoder` is not accessible.
The solution is simple: make encoder a @transient val just as is the case with queryExecution.
Attachments
Issue Links
- links to