Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.4.0
-
None
Description
With SPARK-39748, Spark includes the origin logical plan for LogicalRDD if it comes from DataFrame, to achieve carrying-over stats as well as providing information to possibly connect two disconnected logical plans into one.
After we introduced the change, we figured out several issues:
1. One of major use case for DataFrame.checkpoint is ML, especially "iterative algorithm", which purpose is to "prune" the logical plan. That is against the purpose of including origin logical plan and we have a risk to have nested LogicalRDDs which grows the size of logical plan infinitely.
2. We leverage logical plan to carry over stats, but the correct stats information is in optimized plan.
3. (Not an issue but missing spot) constraints is also something we can carry over.
To address above issues, it would be better if we include stats and constraints in LogicalRDD rather than logical plan.