Details
-
Sub-task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.5.0, 4.0.0
-
None
Description
SPARK-45220 discovers a behavior difference for a self-join scenario between classic Spark and Spark Connect.
For instance, here is the query that works without Spark Connect:
df = spark.createDataFrame([Row(name="Alice", age=2), Row(name="Bob", age=5)]) df2 = spark.createDataFrame([Row(name="Tom", height=80), Row(name="Bob", height=85)])
joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name))
joined.show()
But in Spark Connect, it throws this exception:
pyspark.errors.exceptions.connect.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `name` cannot be resolved. Did you mean one of the following? [`name`, `name`, `age`, `height`].; 'Sort ['name DESC NULLS LAST], true +- Join FullOuter, (name#64 = name#78) :- LocalRelation [name#64, age#65L] +- LocalRelation [name#78, height#79L]
On the other hand, this query failed in classic Spark Connect:
df.join(df, df.name == df.name, "outer").select(df.name).show()
pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are ambiguous...
but this query works with Spark Connect.
We need to investigate the behavior difference and fix it.