Details
Description
When star expansion is used on left side of a join, the result will include joining key, while on the right side of join it doesn't. I would expect the behaviour to be symmetric (either include on both sides or on neither).
Example:
from pyspark.sql import SparkSession import pyspark.sql.functions as f spark = SparkSession.builder.getOrCreate() df_left = spark.range(5).withColumn('val', f.lit('left')) df_right = spark.range(3, 7).withColumn('val', f.lit('right')) df_merged = ( df_left .alias('left') .join(df_right.alias('right'), on='id', how='full_outer') .withColumn('left_all', f.struct('left.*')) .withColumn('right_all', f.struct('right.*')) ) df_merged.show()
result:
+---+----+-----+------------+---------+ | id| val| val| left_all|right_all| +---+----+-----+------------+---------+ | 0|left| null| {0, left}| {null}| | 1|left| null| {1, left}| {null}| | 2|left| null| {2, left}| {null}| | 3|left|right| {3, left}| {right}| | 4|left|right| {4, left}| {right}| | 5|null|right|{null, null}| {right}| | 6|null|right|{null, null}| {right}| +---+----+-----+------------+---------+
This behaviour started with release 3.2.0. Previously the key was not included on either side.
Result from Spark 3.1.3
+---+----+-----+--------+---------+ | id| val| val|left_all|right_all| +---+----+-----+--------+---------+ | 0|left| null| {left}| {null}| | 6|null|right| {null}| {right}| | 5|null|right| {null}| {right}| | 1|left| null| {left}| {null}| | 3|left|right| {left}| {right}| | 2|left| null| {left}| {null}| | 4|left|right| {left}| {right}| +---+----+-----+--------+---------+
I have a gut feeling this is related to these issues:
https://issues.apache.org/jira/browse/SPARK-39376
https://issues.apache.org/jira/browse/SPARK-34527
https://issues.apache.org/jira/browse/SPARK-38603