Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23614

Union produces incorrect results when caching is used

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: 2.3.1, 2.4.0
    • Component/s: SQL
    • Labels:

      Description

      We just upgraded from 2.2 to 2.3 and our test suite caught this error:

      case class TestData(x: Int, y: Int, z: Int)
      
      val frame = session.createDataset(Seq(TestData(1, 2, 3), TestData(4, 5, 6))).cache()
      val group1 = frame.groupBy("x").agg(min(col("y")) as "value")
      val group2 = frame.groupBy("x").agg(min(col("z")) as "value")
      group1.union(group2).show()
      // +---+-----+
      // | x|value|
      // +---+-----+
      // | 1| 2|
      // | 4| 5|
      // | 1| 2|
      // | 4| 5|
      // +---+-----+
      group2.union(group1).show()
      // +---+-----+
      // | x|value|
      // +---+-----+
      // | 1| 3|
      // | 4| 6|
      // | 1| 3|
      // | 4| 6|
      // +---+-----+
      

      The error disappears if the first data frame is not cached or if the two group by's use separate copies. I'm not sure exactly what happens on the insides of Spark, but errors that produce incorrect results rather than exceptions always concerns me.

        Attachments

          Activity

            People

            • Assignee:
              viirya L. C. Hsieh
              Reporter:
              mhornbech Morten Hornbech
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: