Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33726

Duplicate field names causes wrong answers during aggregation

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.4, 3.0.1
    • 2.4.8, 3.0.2, 3.1.1
    • SQL

    Description

      We saw this bug at Workday.

      Duplicate field names for different fields can cause  org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch#allocate to return a fixed batch when it should have returned a variable batch leading to wrong results.

      This example produces wrong results in the spark shell:

      scala> sql("with T as (select id as a, -id as x from range(3)), U as (select id as b, cast(id as string) as x from range(3)) select T.x, U.x, min(a) as ma, min(b) as mb from T join U on a=b group by U.x, T.x").show
       

      x x ma mb
      -2 2 0 null
      -1 1 null 1
      0 0 0 0

       instead of correct output : 

      x x ma mb
      0 0 0 0
      -2 2 2 2
      -1 1 1 1

      The issue can be solved by iterating over the fields themselves instead of field names. 

      Attachments

        Activity

          People

            yliou Yian Liou
            yliou Yian Liou
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: