Description
The following two queries should be equivalent, but the second crashes:
sqlContext.read.json(sqlContext.sparkContext.makeRDD( """{"a": {"b": 1, "a": {"a": 1}}, "c": [{"d": 1}]}""" :: Nil)) .registerTempTable("nestedOrder") checkAnswer(sql("SELECT a.b FROM nestedOrder ORDER BY a.b"), Row(1)) checkAnswer(sql("select * from nestedOrder").select("a.b").orderBy("a.b"), Row(1))
Here's the stacktrace:
Cannot resolve column name "a.b" among (b); org.apache.spark.sql.AnalysisException: Cannot resolve column name "a.b" among (b); at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159) at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:651) at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:640) at org.apache.spark.sql.DataFrame$$anonfun$sort$1.apply(DataFrame.scala:593) at org.apache.spark.sql.DataFrame$$anonfun$sort$1.apply(DataFrame.scala:593) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.DataFrame.sort(DataFrame.scala:593) at org.apache.spark.sql.DataFrame.orderBy(DataFrame.scala:624) at org.apache.spark.sql.SQLQuerySuite$$anonfun$96.apply$mcV$sp(SQLQuerySuite.scala:1389)
Per marmbrus, the problem may be that DataFrame.resolve calls resolveQuoted, causing the nested field to be treated as a single field named a.b.
UPDATE: here's a shorter one-liner reproduction:
val df = sqlContext.read.json(sqlContext.sparkContext.makeRDD("""{"a": {"b": 1}}""" :: Nil)) checkAnswer(df.select("a.b").filter("a.b = a.b"), Row(1))