Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-37447

Cache LogicalPlan.isStreaming() in a lazy val

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.2.0
    • 3.3.0
    • Optimizer
    • None

    Description

      The default implementation of `LogicalPlan.isStreaming()` calls `children.exists(_.isStreaming)`. This can be expensive for large trees, so as a performance optimization I think we should cache the result in a private lazy val.

      This is especially important for programs that programmatically construct huge query plans because that will result in multiple analysis passes (and therefore multiple invocations of rules which call `isStreaming`). For example, this the `isStreaming` check accounts for a significant portion of the time in `DeduplicateRelations` (> 20% in my local tests).

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            joshrosen Josh Rosen
            joshrosen Josh Rosen
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Issue deployment