[SPARK-34794] Nested higher-order functions broken in DSL - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.1, 3.2.0
Fix Version/s: 3.0.3, 3.1.2, 3.2.0
Component/s: SQL
Labels:
- correctness
Environment:

3.1.1

Description

In Spark 3, if I have:

val df = Seq(
    (Seq(1,2,3), Seq("a", "b", "c"))
).toDF("numbers", "letters")

and I want to take the cross product of these two arrays, I can do the following in SQL:

df.selectExpr("""
    FLATTEN(
        TRANSFORM(
            numbers,
            number -> TRANSFORM(
                letters,
                letter -> (number AS number, letter AS letter)
            )
        )
    ) AS zipped
""").show(false)
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
+------------------------------------------------------------------------+

This works fine. But when I try the equivalent using the scala DSL, the result is wrong:

df.select(
    f.flatten(
        f.transform(
            $"numbers",
            (number: Column) => { f.transform(
                $"letters",
                (letter: Column) => { f.struct(
                    number.as("number"),
                    letter.as("letter")
                ) }
            ) }
        )
    ).as("zipped")
).show(10, false)
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
+------------------------------------------------------------------------+

Note that the numbers are not included in the output. The explain for this second version is:

== Parsed Logical Plan ==
'Project [flatten(transform('numbers, lambdafunction(transform('letters, lambdafunction(struct(NamePlaceholder, lambda 'x AS number#442, NamePlaceholder, lambda 'x AS letter#443), lambda 'x, false)), lambda 'x, false))) AS zipped#444]
+- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
   +- LocalRelation [_1#303, _2#304]

== Analyzed Logical Plan ==
zipped: array<struct<number:string,letter:string>>
Project [flatten(transform(numbers#308, lambdafunction(transform(letters#309, lambdafunction(struct(number, lambda x#446, letter, lambda x#446), lambda x#446, false)), lambda x#445, false))) AS zipped#444]
+- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
   +- LocalRelation [_1#303, _2#304]

== Optimized Logical Plan ==
LocalRelation [zipped#444]

== Physical Plan ==
LocalTableScan [zipped#444]

Seems like variable name x is hardcoded. And sure enough: https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3647

Attachments

Issue Links

is related to

SPARK-35381 Fix lambda variable name issues in nested DataFrame functions in R APIs

Resolved

SPARK-35382 Fix lambda variable name issues in nested DataFrame functions in Python APIs

Resolved

links to

[Github] Pull Request #31887 (dmsolow)

[Github] Pull Request #32424 (maropu)

Activity

People

Assignee:: Daniel Solow

Reporter:: Daniel Solow

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 18/Mar/21 21:55

Updated:: 12/May/21 04:12

Resolved:: 05/May/21 03:50