Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.3.7, 3.1.2, 4.0.0
Description
With complex multi-stage CTEs, Hive can start a latter stage before its previous stage finishes.
That's because `SemanticAnalyzer#toRealRootTasks` can fail to resolve dependency between multistage materialized CTEs when a non-materialized CTE cuts in.
https://github.com/apache/hive/blob/425e1ff7c054f87c4db87e77d004282d529599ae/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L1414
For example, when submitting this query,
SET hive.optimize.cte.materialize.threshold=2; SET hive.optimize.cte.materialize.full.aggregate.only=false; WITH x AS ( SELECT 'x' AS id ), -- not materialized a1 AS ( SELECT 'a1' AS id ), -- materialized by a2 and the root a2 AS ( SELECT 'a2 <- ' || id AS id FROM a1) -- materialized by the root SELECT * FROM a1 UNION ALL SELECT * FROM x UNION ALL SELECT * FROM a2 UNION ALL SELECT * FROM a2;
`toRealRootTask` will traverse the CTEs in order of `a1`, `x`, and `a2`. It means the dependency between `a1` and `a2` will be ignored and `a2` can start without waiting for `a1`. As a result, the above query returns the following result.
+-----+ | id | +-----+ | a1 | | x | +-----+
For your information, I ran this test with revision = 425e1ff7c054f87c4db87e77d004282d529599ae.