Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29682

Failure when resolving conflicting references in Join:

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.1.3, 2.2.3, 2.3.4, 2.4.3
    • Fix Version/s: 2.4.5, 3.0.0
    • Component/s: SQL
    • Labels:
      None

      Description

      When I try to self join a parentDf with multiple childDf say childDf1 ... ... 

      where childDfs are derived after a cube or rollup and are filtered based on group bys,

      I get and error 

      Failure when resolving conflicting references in Join: 

      This shows a long error message which is quite unreadable. On the other hand, if I replace cube or rollup with old groupBy, it works without issues.

       

      Sample code: 

      val numsDF = sc.parallelize(Seq(1,2,3,4,5,6)).toDF("nums")
      
      
      val cubeDF = numsDF
          .cube("nums")
          .agg(
              max(lit(0)).as("agcol"),
              grouping_id().as("gid")
          )
          
      val group0 = cubeDF.filter(col("gid") <=> lit(0))
      val group1 = cubeDF.filter(col("gid") <=> lit(1))
      
      cubeDF.printSchema
      group0.printSchema
      group1.printSchema
      
      
      //Recreating cubeDf
      cubeDF.select("nums").distinct
          .join(group0, Seq("nums"), "inner")
          .join(group1, Seq("nums"), "inner")
          .show
      

      Sample output:

      numsDF: org.apache.spark.sql.DataFrame = [nums: int]
      cubeDF: org.apache.spark.sql.DataFrame = [nums: int, agcol: int ... 1 more field]
      group0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, agcol: int ... 1 more field]
      group1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, agcol: int ... 1 more field]
      root
       |-- nums: integer (nullable = true)
       |-- agcol: integer (nullable = true)
       |-- gid: integer (nullable = false)
      root
       |-- nums: integer (nullable = true)
       |-- agcol: integer (nullable = true)
       |-- gid: integer (nullable = false)
      root
       |-- nums: integer (nullable = true)
       |-- agcol: integer (nullable = true)
       |-- gid: integer (nullable = false)
      org.apache.spark.sql.AnalysisException:
      Failure when resolving conflicting references in Join:
      'Join Inner
      :- Deduplicate [nums#220]
      :  +- Project [nums#220]
      :     +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS agcol#216, spark_grouping_id#218 AS gid#217]
      :        +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], [nums#212, nums#220, spark_grouping_id#218]
      :           +- Project [nums#212, nums#212 AS nums#219]
      :              +- Project [value#210 AS nums#212]
      :                 +- SerializeFromObject [input[0, int, false] AS value#210]
      :                    +- ExternalRDD [obj#209]
      +- Filter (gid#217 <=> 0)
         +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS agcol#216, spark_grouping_id#218 AS gid#217]
            +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], [nums#212, nums#220, spark_grouping_id#218]
               +- Project [nums#212, nums#212 AS nums#219]
                  +- Project [value#210 AS nums#212]
                     +- SerializeFromObject [input[0, int, false] AS value#210]
                        +- ExternalRDD [obj#209]
      Conflicting attributes: nums#220
      ;;
      'Join Inner
      :- Deduplicate [nums#220]
      :  +- Project [nums#220]
      :     +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS agcol#216, spark_grouping_id#218 AS gid#217]
      :        +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], [nums#212, nums#220, spark_grouping_id#218]
      :           +- Project [nums#212, nums#212 AS nums#219]
      :              +- Project [value#210 AS nums#212]
      :                 +- SerializeFromObject [input[0, int, false] AS value#210]
      :                    +- ExternalRDD [obj#209]
      +- Filter (gid#217 <=> 0)
         +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS agcol#216, spark_grouping_id#218 AS gid#217]
            +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], [nums#212, nums#220, spark_grouping_id#218]
               +- Project [nums#212, nums#212 AS nums#219]
                  +- Project [value#210 AS nums#212]
                     +- SerializeFromObject [input[0, int, false] AS value#210]
                        +- ExternalRDD [obj#209]
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42)
        at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:96)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:335)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
        at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:85)
        at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:96)
        at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:109)
        at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:106)
        at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:202)
        at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:106)
        at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:68)
        at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:66)
        at org.apache.spark.sql.Dataset.join(Dataset.scala:939)
        ... 46 elided
      

        Attachments

          Activity

            People

            • Assignee:
              imback82 Terry Kim
              Reporter:
              sandeshyapuram sandeshyapuram

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment