Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
2.0.0
-
None
Description
Found a change of behavior on spark-2.0.0, which breaks a query in our code base.
The following works on previous spark versions, 1.6.1 up to 2.0.0-preview :
val dfa = Seq((1, 2), (2, 3)).toDF("id", "a").alias("dfa") val dfb = Seq((1, 0), (1, 1)).toDF("id", "b").alias("dfb") dfa.join(dfb, dfa("id") === dfb("id")).dropDuplicates(Array("dfa.id", "dfb.id"))
but fails with spark-2.0.0 with the exception :
Cannot resolve column name "dfa.id" among (id, a, id, b); org.apache.spark.sql.AnalysisException: Cannot resolve column name "dfa.id" among (id, a, id, b); at org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819) at org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1818) at org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1817) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1817) at org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1814) at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594) at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1814) at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1840) ...
Attachments
Issue Links
- is duplicated by
-
SPARK-17037 distinct() operator fails on Dataframe with column names containing periods
- Resolved
- links to