Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25150

Joining DataFrames derived from the same source yields confusing/incorrect results

    Details

    • Type: Bug
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.3.1
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:
      None

      Description

      I have two DataFrames, A and B. From B, I have derived two additional DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very confusing error:

      Join condition is missing or trivial.
      Either: use the CROSS JOIN syntax to allow cartesian products between these
      relations, or: enable implicit cartesian products by setting the configuration
      variable spark.sql.crossJoin.enabled=true;
      

      Then, when IĀ configure "spark.sql.crossJoin.enabled=true" as instructed, Spark appears to give me incorrect answers.

      I am not sure if I am missing something obvious, or if there is some kind of bug here. The "join condition is missing" error is confusing and doesn't make sense to me, and the seemingly incorrect output is concerning.

      I've attached a reproduction, along with the output I'm seeing with and without the implicit cross join enabled.

      I realize the join I've written is not "correct" in the sense that it should be left outer join instead of an inner join (since some of the aggregates are not available for all states), but that doesn't explain Spark's behavior.

        Attachments

        1. zombie-analysis.py
          2 kB
          Nicholas Chammas
        2. states.csv
          0.1 kB
          Nicholas Chammas
        3. persons.csv
          0.1 kB
          Nicholas Chammas
        4. output-without-implicit-cross-join.txt
          18 kB
          Nicholas Chammas
        5. output-with-implicit-cross-join.txt
          13 kB
          Nicholas Chammas
        6. expected-output.txt
          0.3 kB
          Nicholas Chammas

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                nchammas Nicholas Chammas
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated: