Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22725

df.select on a Stream is broken, vs a List

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Won't Fix
    • 2.3.0
    • None
    • Spark Core
    • None

    Description

      See failing test at https://github.com/apache/spark/pull/19917

      Failing:

        test("SPARK-ABC123: support select with a splatted stream") {
          val df = spark.createDataFrame(sparkContext.emptyRDD[Row], StructType(List("bar", "foo").map {
            StructField(_, StringType, false)
          }))
          val allColumns = Stream(df.col("bar"), col("foo"))
          val result = df.select(allColumns : _*)
        }
      

      Succeeds:

        test("SPARK-ABC123: support select with a splatted stream") {
          val df = spark.createDataFrame(sparkContext.emptyRDD[Row], StructType(List("bar", "foo").map {
            StructField(_, StringType, false)
          }))
          val allColumns = Seq(df.col("bar"), col("foo"))
          val result = df.select(allColumns : _*)
        }
      

      After stepping through in a debugger, the difference manifests at https://github.com/apache/spark/blob/8ae004b4602266d1f210e4c1564246d590412c06/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L120

      Changing seq.map to seq.toList.map causes the test to pass.

      I think there's a very subtle bug here where the Seq of column names passed into select is expected to eagerly evaluate when .map is called on it, even though that's not part of the Seq contract.

      Attachments

        Activity

          People

            Unassigned Unassigned
            aash Andrew Ash
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: