[SPARK-26696] Dataset encoder should be publicly accessible - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.0
Fix Version/s: 3.0.0
Component/s: SQL
Labels:
- dataset
- encoding

Description

As a platform, Spark should enable framework developers to accomplish outside of the Spark codebase much of what can be accomplished inside the Spark codebase. One of the obstacles to this is a historical pattern of excessive data hiding in Spark, e.g., expr in Column not being accessible. This issue is an example of this pattern when it comes to Dataset.

Consider a transformation with the signature `def foo[A](ds: Dataset[A]): Dataset[A]`, which requires the use of toDF(). To get back to Dataset[A] would require calling .as[A], which requires an implicit Encoder[A]. A naive approach would change the function signature to `foo[A : Encoder]` but this is poor API design that requires unnecessarily carrying of implicits from user code into framework code. We know `Encoder[A]` exists because we have access to an instance of `Dataset[A]`... but its `encoder` is not accessible.

The solution is simple: make encoder a @transient val just as is the case with queryExecution.

Attachments

Issue Links

links to

GitHub Pull Request #23620

Activity

People

Assignee:: Simeon Simeonov

Reporter:: Simeon Simeonov

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 23/Jan/19 05:01

Updated:: 12/Feb/19 03:08

Resolved:: 12/Feb/19 03:05