[SPARK-40281] Memory Profiler on Executors - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 3.4.0
Component/s: PySpark
Labels:
None

Description

The ticket proposes to implement PySpark memory profiling on executors. See more design.

There are many factors in a PySpark program’s performance. Memory, as one of the key factors of a program’s performance, had been missing in PySpark profiling. A PySpark program on the Spark driver can be profiled with Memory Profiler as a normal Python process, but there was not an easy way to profile memory on Spark executors.

PySpark UDFs, one of the most popular Python APIs, enable users to run custom code on top of the Apache Spark™ engine. However, it is difficult to optimize UDFs without understanding memory consumption.

The ticket proposes to introduce the PySpark memory profiler, which profiles memory on executors. It provides information about total memory usage and pinpoints which lines of code in a UDF attribute to the most memory usage. That will help optimize PySpark UDFs and reduce the likelihood of out-of-memory errors.

Attachments

Issue Links

links to

[Github] Pull Request #38584 (xinrong-meng)

Sub-Tasks

1.	Install memory-profiler in the CI	Resolved	Xinrong Meng
2.	Document debugging with PySpark memory profiler	Resolved	Xinrong Meng
3.	Skip MemoryProfilerTests when pandas is not installed	Resolved	Dongjoon Hyun

Activity

People

Assignee:: Xinrong Meng

Reporter:: Xinrong Meng

Votes:: 2 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 30/Aug/22 23:18

Updated:: 10/Jan/24 22:12

Resolved:: 11/Nov/22 02:59