Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-48481

OptimizeOneRowPlan should not be effective for streaming DataFrame

    XMLWordPrintableJSON

Details

    Description

      This bug is introduce from Apache Spark 3.3.0.

      OptimizeOneRowPlan aggressively rewrites operators or removes operators if the rule figures out that the operator has a chance to be optimized with the stats that there will be max 1 row in the input.

      This is problematic for streaming, because we aren't seeing the whole data but a part of data in the current microbatch and optimizer is not aware of this.

      There are various viable approaches to deal with, but maybe fixing the rule to disable this effectively with streaming DataFrame would have least effect.

      (There is a separate wider effort to achieve better stability between QO and streaming. Since it would take a considerable time, we still need point fixes during the time.)

      Attachments

        Issue Links

          Activity

            People

              kabhwan Jungtaek Lim
              kabhwan Jungtaek Lim
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: