Description
spark.range(1).withColumnRenamed("id", "x").sort(col("id")) // works spark.range(1).withColumnRenamed("id", "x").sort($"id") // works spark.range(1).withColumnRenamed("id", "x").sort('id) // works spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with: org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among (x); ...
It looks like the Dataset API functions taking String use the basic resolver that only look at the columns at that level, whereas all the other means of expressing an attribute are lazily resolved during the analyzer.
The reason why the first 3 calls work is explained in the docs for object ResolveMissingReferences:
/** * In many dialects of SQL it is valid to sort by attributes that are not present in the SELECT * clause. This rule detects such queries and adds the required attributes to the original * projection, so that they will be available during sorting. Another projection is added to * remove these attributes after sorting. * * The HAVING clause could also used a grouping columns that is not presented in the SELECT. */
For consistency, it would be good to use the same attribute resolution mechanism everywhere.