Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26696

Dataset encoder should be publicly accessible

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.4.0
    • Fix Version/s: 3.0.0
    • Component/s: SQL
    • Labels:

      Description

      As a platform, Spark should enable framework developers to accomplish outside of the Spark codebase much of what can be accomplished inside the Spark codebase. One of the obstacles to this is a historical pattern of excessive data hiding in Spark, e.g., expr in Column not being accessible. This issue is an example of this pattern when it comes to Dataset.

      Consider a transformation with the signature `def foo[A](ds: Dataset[A]): Dataset[A]`, which requires the use of toDF(). To get back to Dataset[A] would require calling .as[A], which requires an implicit Encoder[A]. A naive approach would change the function signature to `foo[A : Encoder]` but this is poor API design that requires unnecessarily carrying of implicits from user code into framework code. We know `Encoder[A]` exists because we have access to an instance of `Dataset[A]`... but its `encoder` is not accessible.

      The solution is simple: make encoder a @transient val just as is the case with queryExecution.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                simeons Simeon Simeonov
                Reporter:
                simeons Simeon Simeonov
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: