Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32136

Spark producing incorrect groupBy results when key is a struct with nullable properties

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 3.0.0
    • 3.0.1, 3.1.0
    • SQL

    Description

      I'm in the process of migrating from Spark 2.4.x to Spark 3.0.0 and I'm noticing a behaviour change in a particular aggregation we're doing, and I think I've tracked it down to how Spark is now treating nullable properties within the column being grouped by.
       
      Here's a simple test I've been able to set up to repro it:
       

      case class B(c: Option[Double])
      case class A(b: Option[B])
      val df = Seq(
      A(None),
      A(Some(B(None))),
      A(Some(B(Some(1.0))))
      ).toDF
      val res = df.groupBy("b").agg(count("*"))
      

      Spark 2.4.6 has the expected result:

      > res.show
      +-----+--------+
      |    b|count(1)|
      +-----+--------+
      |   []|       1|
      | null|       1|
      |[1.0]|       1|
      +-----+--------+
      
      > res.collect.foreach(println)
      [[null],1]
      [null,1]
      [[1.0],1]
      

      But Spark 3.0.0 has an unexpected result:

      > res.show
      +-----+--------+
      |    b|count(1)|
      +-----+--------+
      |   []|       2|
      |[1.0]|       1|
      +-----+--------+
      
      > res.collect.foreach(println)
      [[null],2]
      [[1.0],1]
      

      Notice how it has keyed one of the values in be as `[null]`; that is, an instance of B with a null value for the `c` property instead of a null for the overall value itself.

      Is this an intended change?

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            viirya L. C. Hsieh
            jasonmoore2k Jason Moore
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment