Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17668

Support representing structs with case classes and tuples in spark sql udf inputs

    Details

    • Type: New Feature
    • Status: In Progress
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 2.0.0
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:
      None

      Description

      after having gotten used to have case classes represent complex structures in Datasets, i am surprised to find out that when i work in DataFrames with udfs no such magic exists, and i have to fall back to manipulating Row objects, which is error prone and somewhat ugly.

      for example:

      case class Person(name: String, age: Int)
      
      val df = Seq((Person("john", 33), 5), (Person("mike", 30), 6)).toDF("person", "id")
      val df1 = df.withColumn("person", udf({ (p: Person) => p.copy(age = p.age + 1) }).apply(col("person")))
      df1.printSchema
      df1.show
      

      leads to:

      java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to Person
      

        Issue Links

          Activity

          Hide
          koert koert kuipers added a comment -
          Show
          koert koert kuipers added a comment - original conversation is here: https://www.mail-archive.com/user@spark.apache.org/msg57338.html
          Hide
          koert koert kuipers added a comment -

          similar issues with tuples:

          val df = Seq((1, (2, 3)), (4, (5, 6))).toDF("x", "y")
          val df1 = df.withColumn("z", udf({ y: (Int, Int) => y._1 + y._2 }).apply(col("y")))
          df1.printSchema
          df1.show
          

          gives:

          java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to scala.Tuple2
          
          Show
          koert koert kuipers added a comment - similar issues with tuples: val df = Seq((1, (2, 3)), (4, (5, 6))).toDF("x", "y") val df1 = df.withColumn("z", udf({ y: (Int, Int) => y._1 + y._2 }).apply(col("y"))) df1.printSchema df1.show gives: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to scala.Tuple2
          Hide
          apachespark Apache Spark added a comment -

          User 'koertkuipers' has created a pull request for this issue:
          https://github.com/apache/spark/pull/16889

          Show
          apachespark Apache Spark added a comment - User 'koertkuipers' has created a pull request for this issue: https://github.com/apache/spark/pull/16889

            People

            • Assignee:
              Unassigned
              Reporter:
              koert koert kuipers
            • Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:

                Development