[SPARK-24780] DataFrame.column_name should resolve to a distinct ref - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: PySpark, SQL
Labels:
None

Description

If we join a dataframe with another dataframe which has the same column name of the conditions (e.g. shared lineage on one of the conditions) even though the join condition may be written with the full name, the columns returned don't have the dataframe alias and as such will create a cross-join.

For example this currently works even if both posts_by_sampled_authors & mailing_list_posts_in_reply_to contain both in_reply_to and message_id fields.

posts_with_replies = posts_by_sampled_authors.join(
 mailing_list_posts_in_reply_to,
 [F.col("mailing_list_posts_in_reply_to.in_reply_to") == F.col("posts_by_sampled_authors.message_id")],
 "inner")

But a similarly written expression:

posts_with_replies = posts_by_sampled_authors.join(
 mailing_list_posts_in_reply_to,
 [mailing_list_posts_in_reply_to.in_reply_to == posts_by_sampled_authors.message_id],
 "inner")

will fail.

I'm not super sure whats going on inside of the resolution that's causing it to get confused.

Attachments

Issue Links

is related to

SPARK-30218 Columns used in inequality conditions for joins not resolved correctly in case of common lineage

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Holden Karau

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 11/Jul/18 01:36

Updated:: 16/Mar/20 22:55