[SPARK-25150] Joining DataFrames derived from the same source yields confusing/incorrect results - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.1, 2.4.3
Fix Version/s: 3.2.0
Component/s: SQL
Labels:
- correctness

Description

I have two DataFrames, A and B. From B, I have derived two additional DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very confusing error:

Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;

Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, Spark appears to give me incorrect answers.

I am not sure if I am missing something obvious, or if there is some kind of bug here. The "join condition is missing" error is confusing and doesn't make sense to me, and the seemingly incorrect output is concerning.

I've attached a reproduction, along with the output I'm seeing with and without the implicit cross join enabled.

I realize the join I've written is not "correct" in the sense that it should be left outer join instead of an inner join (since some of the aggregates are not available for all states), but that doesn't explain Spark's behavior.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

expected-output.txt
28/Sep/18 19:20
0.3 kB
Nicholas Chammas
output-with-implicit-cross-join.txt
17/Aug/18 18:06
13 kB
Nicholas Chammas
output-without-implicit-cross-join.txt
17/Aug/18 18:06
18 kB
Nicholas Chammas
persons.csv
17/Aug/18 18:06
0.1 kB
Nicholas Chammas
states.csv
17/Aug/18 18:06
0.1 kB
Nicholas Chammas
zombie-analysis.py
17/Aug/18 18:06
2 kB
Nicholas Chammas

Issue Links

is duplicated by

SPARK-26231 Dataframes inner join on double datatype columns resulting in Cartesian product

Resolved

relates to

SPARK-20804 Join with null safe equality fails with AnalysisException

Closed

SPARK-6459 Warn when Column API is constructing trivially true equality

Resolved

links to

[Github] Pull Request #22318 (peter-toth)

GitHub Pull Request #22318

Spark Dataset joinWith API giving wrong results

Why does spark think this is a cross/cartesian join

(2 links to)

Activity

People

Assignee:: Unassigned

Reporter:: Nicholas Chammas

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 17/Aug/18 18:06

Updated:: 14/Dec/21 21:28

Resolved:: 14/Dec/21 21:28