Details
Description
Suppose we have a table:
CREATE TABLE DEMO_DATA ( ID VARCHAR(10), NAME VARCHAR(10), BATCH VARCHAR(10), TEAM VARCHAR(1) ) STORED AS PARQUET;
and some data in it:
0: jdbc:hive2://HOSTNAME:10000> SELECT T.* FROM DEMO_DATA T; +-------+---------+-------------+---------+ | t.id | t.name | t.batch | t.team | +-------+---------+-------------+---------+ | 1 | mike | 2020-07-08 | A | | 2 | john | 2020-07-07 | B | | 3 | rose | 2020-07-06 | B | | .... | +-------+---------+-------------+---------+
If I put query hint in va or vb and run it in spark-shell:
sql(""" WITH VA AS (SELECT T.ID, T.NAME, T.BATCH, T.TEAM FROM DEMO_DATA T WHERE T.TEAM = 'A'), VB AS (SELECT /*+ REPARTITION(3) */ T.ID, T.NAME, T.BATCH, T.TEAM FROM VA T) SELECT T.ID, T.NAME, T.BATCH, T.TEAM FROM VB T """).show
In Spark-2.4.4 it works fine.
But in Spark-3.0.0, it throws AnalysisException with Unrecognized hint warning:
20/07/09 13:51:14 WARN analysis.HintErrorLogger: Unrecognized hint: REPARTITION(3) org.apache.spark.sql.AnalysisException: cannot resolve '`T.ID`' given input columns: [T.BATCH, T.ID, T.NAME, T.TEAM]; line 8 pos 7; 'Project ['T.ID, 'T.NAME, 'T.BATCH, 'T.TEAM] +- SubqueryAlias T +- SubqueryAlias VB +- Project [ID#0, NAME#1, BATCH#2, TEAM#3] +- SubqueryAlias T +- SubqueryAlias VA +- Project [ID#0, NAME#1, BATCH#2, TEAM#3] +- Filter (TEAM#3 = A) +- SubqueryAlias T +- SubqueryAlias spark_catalog.default.demo_data +- Relation[ID#0,NAME#1,BATCH#2,TEAM#3] parquet at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:143) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:140) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:333) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333) at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUp$1(QueryPlan.scala:106) at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:118) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:118) at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:129) at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:134) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.immutable.List.map(List.scala:298) at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:134) at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:139) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237) at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:139) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:106) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:140) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:92) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:177) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:92) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:89) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:130) at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:156) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201) at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:153) at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:68) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:133) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:133) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:68) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:66) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:58) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:606) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:601) ... 56 elided
I think the analysis procedure should not be disturbed even though there was any hint could not be recognized.
Attachments
Issue Links
- is duplicated by
-
SPARK-32535 Query with broadcast hints fail when query has a WITH clause
- Resolved
-
SPARK-32347 BROADCAST hint makes a weird message that "column can't be resolved" (it was OK in Spark 2.4)
- Closed
- links to