Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-12916

Support Row.fromSeq and Row.toSeq methods in pyspark

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Won't Fix
    • None
    • None
    • PySpark, SQL

    Description

      Pyspark should also have access to the Row functions like fromSeq and toSeq which are exposed in the scala api.
      https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Row

      This will be useful when constructing custom columns from function called in dataframes. A good example is present in the following SO threat:

      http://stackoverflow.com/questions/32196207/derive-multiple-columns-from-a-single-column-in-a-spark-dataframe

      import org.apache.spark.sql.types._
      import org.apache.spark.sql.Row
      
      def foobarFunc(x: Long, y: Double, z: String): Seq[Any] = 
        Seq(x * y, z.head.toInt * y)
      
      val schema = StructType(df.schema.fields ++
        Array(StructField("foo", DoubleType), StructField("bar", DoubleType)))
      
      val rows = df.rdd.map(r => Row.fromSeq(
        r.toSeq ++
        foobarFunc(r.getAs[Long]("x"), r.getAs[Double]("y"), r.getAs[String]("z"))))
      
      val df2 = sqlContext.createDataFrame(rows, schema)
      
      df2.show
      // +---+----+---+----+-----+
      // |  x|   y|  z| foo|  bar|
      // +---+----+---+----+-----+
      // |  1| 3.0|  a| 3.0|291.0|
      // |  2|-1.0|  b|-2.0|-98.0|
      // |  3| 0.0|  c| 0.0|  0.0|
      // +---+----+---+----+-----+
      

      I am ready to work on this feature.

      Attachments

        Activity

          People

            Unassigned Unassigned
            shubhanshumishra@gmail.com Shubhanshu Mishra
            Shivram Mani Shivram Mani
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: