[SPARK-47818] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 4.0.0
Fix Version/s: 4.0.0
Component/s: Connect
Labels:
- pull-request-available

Description

While building the DataFrame step by step, each time a new DataFrame is generated with an empty schema, which is lazily computed on access. However, if a user's code frequently accesses the schema of these new DataFrames using methods such as `df.columns`, it will result in a large number of Analyze requests to the server. Each time, the entire plan needs to be reanalyzed, leading to poor performance, especially when constructing highly complex plans.

Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the overhead of repeated analysis during this process. This is achieved by saving significant computation if the resolved logical plan of a subtree of can be cached.

A minimal example of the problem:

import pyspark.sql.functions as F
df = spark.range(10)
for i in range(200):
  if str(i) not in df.columns: # <-- The df.columns call causes a new Analyze request in every iteration
    df = df.withColumn(str(i), F.col("id") + i)
df.show()

With this patch, the performance of the above code improved from ~110s to ~5s.

Attachments

Issue Links

links to

GitHub Pull Request #46012

GitHub Pull Request #46098

GitHub Pull Request #46638

Activity

People

Assignee:: Xi Lyu

Reporter:: Xi Lyu

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 11/Apr/24 15:33

Updated:: 17/May/24 20:06

Resolved:: 16/Apr/24 07:27