Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21538

Attribute resolution inconsistency in Dataset API

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.0
    • 2.2.1, 2.3.0
    • SQL
    • None

    Description

      spark.range(1).withColumnRenamed("id", "x").sort(col("id"))  // works
      spark.range(1).withColumnRenamed("id", "x").sort($"id")  // works
      spark.range(1).withColumnRenamed("id", "x").sort('id) // works
      spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with:
      org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among (x);
      ...
      

      It looks like the Dataset API functions taking String use the basic resolver that only look at the columns at that level, whereas all the other means of expressing an attribute are lazily resolved during the analyzer.

      The reason why the first 3 calls work is explained in the docs for object ResolveMissingReferences:

        /**
         * In many dialects of SQL it is valid to sort by attributes that are not present in the SELECT
         * clause.  This rule detects such queries and adds the required attributes to the original
         * projection, so that they will be available during sorting. Another projection is added to
         * remove these attributes after sorting.
         *
         * The HAVING clause could also used a grouping columns that is not presented in the SELECT.
         */
      

      For consistency, it would be good to use the same attribute resolution mechanism everywhere.

      Attachments

        Activity

          People

            aokolnychyi Anton Okolnychyi
            a.ionescu Adrian Ionescu
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: