Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-39834

Include the origin stats and constraints for LogicalRDD if it comes from DataFrame

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0
    • 3.4.0
    • SQL, Structured Streaming
    • None

    Description

      With SPARK-39748, Spark includes the origin logical plan for LogicalRDD if it comes from DataFrame, to achieve carrying-over stats as well as providing information to possibly connect two disconnected logical plans into one.

      After we introduced the change, we figured out several issues:

      1. One of major use case for DataFrame.checkpoint is ML, especially "iterative algorithm", which purpose is to "prune" the logical plan. That is against the purpose of including origin logical plan and we have a risk to have nested LogicalRDDs which grows the size of logical plan infinitely.

      2. We leverage logical plan to carry over stats, but the correct stats information is in optimized plan.

      3. (Not an issue but missing spot) constraints is also something we can carry over.

      To address above issues, it would be better if we include stats and constraints in LogicalRDD rather than logical plan.

      Attachments

        Activity

          People

            kabhwan Jungtaek Lim
            kabhwan Jungtaek Lim
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: