The ticket proposes to implement PySpark memory profiling on executors. See more design.
There are many factors in a PySpark program’s performance. Memory, as one of the key factors of a program’s performance, had been missing in PySpark profiling. A PySpark program on the Spark driver can be profiled with Memory Profiler as a normal Python process, but there was not an easy way to profile memory on Spark executors.
PySpark UDFs, one of the most popular Python APIs, enable users to run custom code on top of the Apache Spark™ engine. However, it is difficult to optimize UDFs without understanding memory consumption.
The ticket proposes to introduce the PySpark memory profiler, which profiles memory on executors. It provides information about total memory usage and pinpoints which lines of code in a UDF attribute to the most memory usage. That will help optimize PySpark UDFs and reduce the likelihood of out-of-memory errors.
|Install memory-profiler in the CI||Resolved|
|Document debugging with PySpark memory profiler||Resolved|
|Skip MemoryProfilerTests when pandas is not installed||Resolved|