Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10925

Exception when joining DataFrames

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 1.5.0, 1.5.1
    • None
    • SQL
    • Tested with Spark 1.5.0 and Spark 1.5.1

    Description

      I get an exception when joining a DataFrame with another DataFrame. The second DataFrame was created by performing an aggregation on the first DataFrame.

      My complete workflow is:

      1. read the DataFrame
      2. apply an UDF on column "name"
      3. apply an UDF on column "surname"
      4. apply an UDF on column "birthDate"
      5. aggregate on "name" and re-join with the DF
      6. aggregate on "surname" and re-join with the DF

      If I remove one step, the process completes normally.

      Here is the exception:

      Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS birthDate_cleaned#8];
      	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
      	at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
      	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
      	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
      	at scala.collection.immutable.List.foreach(List.scala:318)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
      	at scala.collection.immutable.List.foreach(List.scala:318)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
      	at scala.collection.immutable.List.foreach(List.scala:318)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
      	at scala.collection.immutable.List.foreach(List.scala:318)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
      	at scala.collection.immutable.List.foreach(List.scala:318)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
      	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
      	at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
      	at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
      	at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:132)
      	at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
      	at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553)
      	at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520)
      	at TestCase2$.main(TestCase2.scala:51)
      	at TestCase2.main(TestCase2.scala)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:497)
      	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
      

      I'm attaching a test case that I tried with Spark 1.5.0 and 1.5.1. Please note it used to work with version 1.4.1

      Attachments

        1. TestCase2.scala
          2 kB
          Alexis Seigneurin
        2. TestCase.scala
          2 kB
          Haifeng Li
        3. Photo 05-10-2015 14 31 16.jpg
          171 kB
          Alexis Seigneurin

        Issue Links

          Activity

            People

              Unassigned Unassigned
              aseigneurin Alexis Seigneurin
              Votes:
              14 Vote for this issue
              Watchers:
              32 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: