[SPARK-45509] Investigate the behavior difference in self-join - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.5.0, 4.0.0
Fix Version/s: None
Component/s: Connect, PySpark
Labels:
- pull-request-available

Description

~~SPARK-45220~~ discovers a behavior difference for a self-join scenario between classic Spark and Spark Connect.

For instance, here is the query that works without Spark Connect:

df = spark.createDataFrame([Row(name="Alice", age=2), Row(name="Bob", age=5)])
df2 = spark.createDataFrame([Row(name="Tom", height=80), Row(name="Bob", height=85)])

joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) 
joined.show()

But in Spark Connect, it throws this exception:

pyspark.errors.exceptions.connect.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `name` cannot be resolved. Did you mean one of the following? [`name`, `name`, `age`, `height`].;
'Sort ['name DESC NULLS LAST], true
+- Join FullOuter, (name#64 = name#78)
   :- LocalRelation [name#64, age#65L]
   +- LocalRelation [name#78, height#79L]

On the other hand, this query failed in classic Spark Connect:

df.join(df, df.name == df.name, "outer").select(df.name).show()

pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are ambiguous...

but this query works with Spark Connect.

We need to investigate the behavior difference and fix it.

Attachments

Issue Links

links to

GitHub Pull Request #43465

GitHub Pull Request #43699

Activity

People

Assignee:: Unassigned

Reporter:: Allison Wang

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 12/Oct/23 01:05

Updated:: 08/Nov/23 17:19