Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-9323

DataFrame.orderBy gives confusing analysis errors when ordering based on nested columns

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.3.1, 1.4.1, 1.5.0
    • 1.5.0
    • SQL
    • None

    Description

      The following two queries should be equivalent, but the second crashes:

      sqlContext.read.json(sqlContext.sparkContext.makeRDD(
          """{"a": {"b": 1, "a": {"a": 1}}, "c": [{"d": 1}]}""" :: Nil))
        .registerTempTable("nestedOrder")
         checkAnswer(sql("SELECT a.b FROM nestedOrder ORDER BY a.b"), Row(1))
         checkAnswer(sql("select * from nestedOrder").select("a.b").orderBy("a.b"), Row(1))
      

      Here's the stacktrace:

      Cannot resolve column name "a.b" among (b);
      org.apache.spark.sql.AnalysisException: Cannot resolve column name "a.b" among (b);
      	at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
      	at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
      	at scala.Option.getOrElse(Option.scala:120)
      	at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
      	at org.apache.spark.sql.DataFrame.col(DataFrame.scala:651)
      	at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:640)
      	at org.apache.spark.sql.DataFrame$$anonfun$sort$1.apply(DataFrame.scala:593)
      	at org.apache.spark.sql.DataFrame$$anonfun$sort$1.apply(DataFrame.scala:593)
      	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
      	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
      	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
      	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
      	at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
      	at scala.collection.AbstractTraversable.map(Traversable.scala:105)
      	at org.apache.spark.sql.DataFrame.sort(DataFrame.scala:593)
      	at org.apache.spark.sql.DataFrame.orderBy(DataFrame.scala:624)
      	at org.apache.spark.sql.SQLQuerySuite$$anonfun$96.apply$mcV$sp(SQLQuerySuite.scala:1389)
      

      Per marmbrus, the problem may be that DataFrame.resolve calls resolveQuoted, causing the nested field to be treated as a single field named a.b.

      UPDATE: here's a shorter one-liner reproduction:

          val df = sqlContext.read.json(sqlContext.sparkContext.makeRDD("""{"a": {"b": 1}}""" :: Nil))
          checkAnswer(df.select("a.b").filter("a.b = a.b"), Row(1))
      

      Attachments

        Activity

          People

            marmbrus Michael Armbrust
            joshrosen Josh Rosen
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: