Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32136

Spark producing incorrect groupBy results when key is a struct with nullable properties

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 3.0.0
    • 3.0.1, 3.1.0
    • SQL

    Description

      I'm in the process of migrating from Spark 2.4.x to Spark 3.0.0 and I'm noticing a behaviour change in a particular aggregation we're doing, and I think I've tracked it down to how Spark is now treating nullable properties within the column being grouped by.
       
      Here's a simple test I've been able to set up to repro it:
       

      case class B(c: Option[Double])
      case class A(b: Option[B])
      val df = Seq(
      A(None),
      A(Some(B(None))),
      A(Some(B(Some(1.0))))
      ).toDF
      val res = df.groupBy("b").agg(count("*"))
      

      Spark 2.4.6 has the expected result:

      > res.show
      +-----+--------+
      |    b|count(1)|
      +-----+--------+
      |   []|       1|
      | null|       1|
      |[1.0]|       1|
      +-----+--------+
      
      > res.collect.foreach(println)
      [[null],1]
      [null,1]
      [[1.0],1]
      

      But Spark 3.0.0 has an unexpected result:

      > res.show
      +-----+--------+
      |    b|count(1)|
      +-----+--------+
      |   []|       2|
      |[1.0]|       1|
      +-----+--------+
      
      > res.collect.foreach(println)
      [[null],2]
      [[1.0],1]
      

      Notice how it has keyed one of the values in be as `[null]`; that is, an instance of B with a null value for the `c` property instead of a null for the overall value itself.

      Is this an intended change?

      Attachments

        Issue Links

          Activity

            People

              viirya L. C. Hsieh
              jasonmoore2k Jason Moore
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: