Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21380

Join with Columns thinks inner join is cross join even when aliased

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • 2.1.0, 2.1.1
    • None
    • Optimizer, SQL

    Description

      While this seemed to work in Spark 2.0.2, it fails in 2.1.0 and 2.1.1.

      Even after aliasing both the table names and all the columns, joining Datasets using a criteria assembled from Columns rather than the with the join(.... usingColumns) method variants errors complaining that a join is a cross join / cartesian product even when it isn't.

      Example:

          Dataset<Row> left = spark.sql("select 'bob' as name, 23 as age");
          left = left
              .alias("l")
              .select(
                  left.col("name").as("l_name"),
                  left.col("age").as("l_age"));
      
          Dataset<Row> right = spark.sql("select 'bob' as name, 'bobco' as company");
          right = right
              .alias("r")
              .select(
                  right.col("name").as("r_name"),
                  right.col("company").as("r_age"));
      
          Dataset<Row> result = left.join(
              right,
              left.col("l_name").equalTo(right.col("r_name")),
              "inner");
      
          result.show();
      

      Results in

      org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
      Project [bob AS l_name#22, 23 AS l_age#23]
      +- OneRowRelation$
      and
      Project [bob AS r_name#33, bobco AS r_age#34]
      +- OneRowRelation$
      Join condition is missing or trivial.
      Use the CROSS JOIN syntax to allow cartesian products between these relations.;
      
      	at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1067)
      	at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1064)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
      	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257)
      	at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1064)
      	at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1049)
      	at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
      	at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
      	at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
      	at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
      	at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
      	at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
      	at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
      	at scala.collection.immutable.List.foreach(List.scala:381)
      	at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
      	at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
      	at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)
      	at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
      	at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
      	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
      	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
      	at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2814)
      	at org.apache.spark.sql.Dataset.head(Dataset.scala:2127)
      	at org.apache.spark.sql.Dataset.take(Dataset.scala:2342)
      	at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
      	at org.apache.spark.sql.Dataset.show(Dataset.scala:638)
      	at org.apache.spark.sql.Dataset.show(Dataset.scala:597)
      	at org.apache.spark.sql.Dataset.show(Dataset.scala:606)
      	at com.nuna.platform.common.spark.util.JoinBuilderIntegrationTest.testSimpleJoin(JoinBuilderIntegrationTest.java:129)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
      	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
      	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
      	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
      	at org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:239)
      	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
      	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
      	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
      	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
      	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
      	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
      	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
      	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
      	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
      	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
      	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
      	at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
      	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
      	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
      	at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
      	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
      	at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51)
      	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
      	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
      

      This is related to other issues like SPARK-14854.

      I feel like in many of these cases, Spark shouldn't be considering these joins as Cartesian products. Usually, I run across this when one table is derived from another, but in this case it happens even with the two tables have fully distinct lineages.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              everett Everett Toews
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: