Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-49699

Disable PruneFilters for streaming workloads

    XMLWordPrintableJSON

Details

    Description

      PruneFilters replaces the null / false filter with an empty relation, which means the subtree of the filter is also lost. The optimization does not care about whichever operator is in the subtree, hence some important operators like stateful operator, watermark node, observe node could be lost.

      The filter could be evaluated to null / false selectively among microbatches in various reasons (one simple example is the modification of the query during restart), which means stateful operator might not be available for batch N and be available for batch N + 1. For this case, streaming query will fail as batch N + 1 cannot load the state from batch N, and it's not recoverable in most cases.

      We have to disable the rule for streaming workloads, with the consideration of backward compatibility - we should avoid breaking existing query.

      Attachments

        Activity

          People

            n-young-db Nick Young
            n-young-db Nick Young
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: