Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.4.0
-
None
-
None
Description
In SPARK-38864, the melt function was added to Dataset.
It would be nice if fields of struct fields could be used as id and value columns. This would allow for the following:
Given a Dataset with following schema:
root |-- an: struct (nullable = false) | |-- id: integer (nullable = false) |-- str: struct (nullable = false) | |-- one: string (nullable = true) | |-- two: string (nullable = true)
For example:
+---+-------------+ | an| str| +---+-------------+ |{1}| {one, One}| |{2}| {two, null}| |{3}|{null, three}| |{4}| {null, null}| +---+-------------+
Melting with value columns Seq("str.one", "str.two") on id columns Seq("an.id") would result in
+--+--------+-----+ |an|variable|value| +--+--------+-----+ | 1| str.one| one| | 1| str.two| One| | 2| str.one| two| | 2| str.two| null| | 3| str.one| null| | 3| str.two|three| | 4| str.one| null| | 4| str.two| null| +--+--------+-----+
See test in org.apache.spark.sql.MeltSuite:
test("SPARK-39292: melt with struct fields") { val df = meltWideDataDs.select( struct($"id").as("an"), struct( $"str1".as("one"), $"str2".as("two") ).as("str") ) checkAnswer( Melt.of(df, Seq("an.id"), Seq("str.one", "str.two"), false, "variable", "value"), meltedWideDataRows.map(row => Row( row.getInt(0), row.getString(1) match { case "str1" => "str.one" case "str2" => "str.two" }, row.getString(2) )) ) }
Attachments
Issue Links
- is fixed by
-
SPARK-38864 Unpivot / melt function for Dataset API
-
- Resolved
-
- links to