Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40149

Star expansion after outer join asymmetrically includes joining key

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 3.2.0, 3.2.1, 3.3.0, 3.2.2
    • 3.3.1, 3.2.3, 3.4.0
    • SQL
    • None

    Description

      When star expansion is used on left side of a join, the result will include joining key, while on the right side of join it doesn't. I would expect the behaviour to be symmetric (either include on both sides or on neither). 

      Example:

      from pyspark.sql import SparkSession
      import pyspark.sql.functions as f
      
      spark = SparkSession.builder.getOrCreate()
      
      df_left = spark.range(5).withColumn('val', f.lit('left'))
      df_right = spark.range(3, 7).withColumn('val', f.lit('right'))
      
      df_merged = (
          df_left
          .alias('left')
          .join(df_right.alias('right'), on='id', how='full_outer')
          .withColumn('left_all', f.struct('left.*'))
          .withColumn('right_all', f.struct('right.*'))
      )
      
      df_merged.show()
      

      result:

      +---+----+-----+------------+---------+
      | id| val|  val|    left_all|right_all|
      +---+----+-----+------------+---------+
      |  0|left| null|   {0, left}|   {null}|
      |  1|left| null|   {1, left}|   {null}|
      |  2|left| null|   {2, left}|   {null}|
      |  3|left|right|   {3, left}|  {right}|
      |  4|left|right|   {4, left}|  {right}|
      |  5|null|right|{null, null}|  {right}|
      |  6|null|right|{null, null}|  {right}|
      +---+----+-----+------------+---------+
      

      This behaviour started with release 3.2.0. Previously the key was not included on either side.
      Result from Spark 3.1.3

      +---+----+-----+--------+---------+
      | id| val|  val|left_all|right_all|
      +---+----+-----+--------+---------+
      |  0|left| null|  {left}|   {null}|
      |  6|null|right|  {null}|  {right}|
      |  5|null|right|  {null}|  {right}|
      |  1|left| null|  {left}|   {null}|
      |  3|left|right|  {left}|  {right}|
      |  2|left| null|  {left}|   {null}|
      |  4|left|right|  {left}|  {right}|
      +---+----+-----+--------+---------+ 

      I have a gut feeling this is related to these issues:
      https://issues.apache.org/jira/browse/SPARK-39376
      https://issues.apache.org/jira/browse/SPARK-34527
      https://issues.apache.org/jira/browse/SPARK-38603

       

      Attachments

        Activity

          People

            cloud_fan Wenchen Fan
            otakar Otakar Truněček
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: