Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34794

Nested higher-order functions broken in DSL

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.1.1, 3.2.0
    • Fix Version/s: 3.0.3, 3.1.2, 3.2.0
    • Component/s: SQL
    • Labels:
    • Environment:

      3.1.1

      Description

      In Spark 3, if I have:

      val df = Seq(
          (Seq(1,2,3), Seq("a", "b", "c"))
      ).toDF("numbers", "letters")
      

      and I want to take the cross product of these two arrays, I can do the following in SQL:

      df.selectExpr("""
          FLATTEN(
              TRANSFORM(
                  numbers,
                  number -> TRANSFORM(
                      letters,
                      letter -> (number AS number, letter AS letter)
                  )
              )
          ) AS zipped
      """).show(false)
      +------------------------------------------------------------------------+
      |zipped                                                                  |
      +------------------------------------------------------------------------+
      |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
      +------------------------------------------------------------------------+
      

      This works fine. But when I try the equivalent using the scala DSL, the result is wrong:

      df.select(
          f.flatten(
              f.transform(
                  $"numbers",
                  (number: Column) => { f.transform(
                      $"letters",
                      (letter: Column) => { f.struct(
                          number.as("number"),
                          letter.as("letter")
                      ) }
                  ) }
              )
          ).as("zipped")
      ).show(10, false)
      +------------------------------------------------------------------------+
      |zipped                                                                  |
      +------------------------------------------------------------------------+
      |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
      +------------------------------------------------------------------------+
      

      Note that the numbers are not included in the output. The explain for this second version is:

      == Parsed Logical Plan ==
      'Project [flatten(transform('numbers, lambdafunction(transform('letters, lambdafunction(struct(NamePlaceholder, lambda 'x AS number#442, NamePlaceholder, lambda 'x AS letter#443), lambda 'x, false)), lambda 'x, false))) AS zipped#444]
      +- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
         +- LocalRelation [_1#303, _2#304]
      
      == Analyzed Logical Plan ==
      zipped: array<struct<number:string,letter:string>>
      Project [flatten(transform(numbers#308, lambdafunction(transform(letters#309, lambdafunction(struct(number, lambda x#446, letter, lambda x#446), lambda x#446, false)), lambda x#445, false))) AS zipped#444]
      +- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
         +- LocalRelation [_1#303, _2#304]
      
      == Optimized Logical Plan ==
      LocalRelation [zipped#444]
      
      == Physical Plan ==
      LocalTableScan [zipped#444]
      

      Seems like variable name x is hardcoded. And sure enough: https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3647

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                dsolow Daniel Solow
                Reporter:
                dsolow1 Daniel Solow
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: