[SPARK-12916] Support Row.fromSeq and Row.toSeq methods in pyspark - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: PySpark, SQL
Labels:
- dataframe
- pyspark
- row
- sql

Description

Pyspark should also have access to the Row functions like fromSeq and toSeq which are exposed in the scala api.
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Row

This will be useful when constructing custom columns from function called in dataframes. A good example is present in the following SO threat:

http://stackoverflow.com/questions/32196207/derive-multiple-columns-from-a-single-column-in-a-spark-dataframe

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

def foobarFunc(x: Long, y: Double, z: String): Seq[Any] = 
  Seq(x * y, z.head.toInt * y)

val schema = StructType(df.schema.fields ++
  Array(StructField("foo", DoubleType), StructField("bar", DoubleType)))

val rows = df.rdd.map(r => Row.fromSeq(
  r.toSeq ++
  foobarFunc(r.getAs[Long]("x"), r.getAs[Double]("y"), r.getAs[String]("z"))))

val df2 = sqlContext.createDataFrame(rows, schema)

df2.show
// +---+----+---+----+-----+
// |  x|   y|  z| foo|  bar|
// +---+----+---+----+-----+
// |  1| 3.0|  a| 3.0|291.0|
// |  2|-1.0|  b|-2.0|-98.0|
// |  3| 0.0|  c| 0.0|  0.0|
// +---+----+---+----+-----+

I am ready to work on this feature.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Shubhanshu Mishra

Shepherd:: Shivram Mani

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 20/Jan/16 04:32

Updated:: 12/Dec/22 18:10

Resolved:: 13/Oct/16 22:15