Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35652

Different Behaviour join vs joinWith in self joining

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 3.1.2
    • 3.2.0, 3.1.3, 3.0.4
    • SQL
    • None
    • Spark 3.1.2

      Scala 2.12

    Description

      It seems like spark inner join is performing a cartesian join in self joining using `joinWith` and an inner join using `join` 

      Snippet:

       

      scala> val df = spark.range(0,5) 
      df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
      
      scala> df.show 
      +---+ 
      | id| 
      +---+ 
      | 0|
      | 1| 
      | 2| 
      | 3| 
      | 4| 
      +---+ 
      
      scala> df.join(df, df("id") === df("id")).count 
      21/06/04 16:01:39 WARN Column: Constructing trivially true equals predicate, 'id#1649L = id#1649L'. Perhaps you need to use aliases. 
      res21: Long = 5
      
      scala> df.joinWith(df, df("id") === df("id")).count
      21/06/04 16:01:47 WARN Column: Constructing trivially true equals predicate, 'id#1649L = id#1649L'. Perhaps you need to use aliases. 
      res22: Long = 25
      
      

      According to the comment in code source, joinWith is expected to manage this case, right?

      def joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)] = {
          // Creates a Join node and resolve it first, to get join condition resolved, self-join resolved,
          // etc.
      

      I find it weird that join and joinWith haven't the same behaviour.

      Attachments

        Activity

          People

            dc-heros dgd_contributor
            walmaaoui Wassim Almaaoui
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: