[SPARK-21538] Attribute resolution inconsistency in Dataset API - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.0
Fix Version/s: 2.2.1, 2.3.0
Component/s: SQL
Labels:
None

Description

spark.range(1).withColumnRenamed("id", "x").sort(col("id"))  // works
spark.range(1).withColumnRenamed("id", "x").sort($"id")  // works
spark.range(1).withColumnRenamed("id", "x").sort('id) // works
spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with:
org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among (x);
...

It looks like the Dataset API functions taking String use the basic resolver that only look at the columns at that level, whereas all the other means of expressing an attribute are lazily resolved during the analyzer.

The reason why the first 3 calls work is explained in the docs for object ResolveMissingReferences:

  /**
   * In many dialects of SQL it is valid to sort by attributes that are not present in the SELECT
   * clause.  This rule detects such queries and adds the required attributes to the original
   * projection, so that they will be available during sorting. Another projection is added to
   * remove these attributes after sorting.
   *
   * The HAVING clause could also used a grouping columns that is not presented in the SELECT.
   */

For consistency, it would be good to use the same attribute resolution mechanism everywhere.

Attachments

Activity

People

Assignee:: Anton Okolnychyi

Reporter:: Adrian Ionescu

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/Jul/17 11:17

Updated:: 27/Jul/17 23:51

Resolved:: 27/Jul/17 23:51