[SPARK-34689] Spark Thrift Server: Memory leak for SparkSession objects - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 3.0.1, 3.1.1
Fix Version/s: None
Component/s: Spark Core, SQL
Labels:
None

Description

When running the Spark Thrift Server (3.0.1, standalone cluster), we have noticed that each new JDBC connection creates a new SparkSession object. This object (and anything being referenced by it), however, remains in memory indefinitely even though the JDBC connection is closed, and full GCs do not remove it. After about 18 hours of heavy use, we get more than 46.000 such objects (heap_sparksession.png).

In a small local installation test, I replicated the behavior by simply opening a JDBC connection, executing SHOW SCHEMAS and closing the connection (heapdump_local_attempt.png). For each connection, a new SparkSession object is created and never removed. I have noticed the same behavior in Spark 3.1.1 as well.

Our settings are as follows. Please note that this was occuring even before we added the ExplicitGCInvokesConcurrent option (i.e. it happened even when a full GC was performed every 20 minutes).

spark-defaults.conf:

spark.master                    spark://...:7077,...:7077
spark.master.rest.enabled       true
spark.eventLog.enabled          false
spark.eventLog.dir              file:///...

spark.driver.cores             1
spark.driver.maxResultSize     4g
spark.driver.memory            5g
spark.executor.memory          1g

spark.executor.logs.rolling.maxRetainedFiles   2
spark.executor.logs.rolling.strategy           size
spark.executor.logs.rolling.maxSize            1G

spark.local.dir ...

spark.sql.ui.retainedExecutions=10
spark.ui.retainedDeadExecutors=10
spark.worker.ui.retainedExecutors=10
spark.worker.ui.retainedDrivers=10
spark.ui.retainedJobs=30
spark.ui.retainedStages=100
spark.ui.retainedTasks=500
spark.appStateStore.asyncTracking.enable=false

spark.sql.shuffle.partitions=200
spark.default.parallelism=200
spark.task.reaper.enabled=true
spark.task.reaper.threadDump=false

spark.memory.offHeap.enabled=true
spark.memory.offHeap.size=4g

spark-env.sh:

HADOOP_CONF_DIR="/.../hadoop/etc/hadoop"

SPARK_WORKER_CORES=28
SPARK_WORKER_MEMORY=54g

SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.appDataTtl=172800 -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=40 "

SPARK_DAEMON_JAVA_OPTS="-Dlog4j.configuration=file:///.../log4j.properties -Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.dir="..." -Dspark.deploy.zookeeper.url=...:2181,...:2181,...:2181 -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=40"

start-thriftserver.sh:

export SPARK_DAEMON_MEMORY=16g

exec "${SPARK_HOME}"/sbin/spark-daemon.sh submit $CLASS 1 \
  --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
  --conf "spark.ui.retainedJobs=30" \
  --conf "spark.ui.retainedStages=100" \
  --conf "spark.ui.retainedTasks=500" \
  --conf "spark.sql.ui.retainedExecutions=10" \
  --conf "spark.appStateStore.asyncTracking.enable=false" \
  --conf "spark.cleaner.periodicGC.interval=20min" \
  --conf "spark.sql.autoBroadcastJoinThreshold=-1" \
  --conf "spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseG1GC -XX:MaxGCPauseMillis=200" \
  --conf "spark.driver.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:/.../thrift_driver_gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=7 -XX:GCLogFileSize=35M -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=11990 -XX:+ExplicitGCInvokesConcurrent" \
  --conf "spark.metrics.namespace=..." --name "..." --packages io.delta:delta-core_2.12:0.7.0 --hiveconf spark.ui.port=4038 --hiveconf spark.cores.max=22 --hiveconf spark.executor.cores=3 --hiveconf spark.executor.memory=6144M --hiveconf spark.scheduler.mode=FAIR --hiveconf spark.scheduler.allocation.file=.../conf/thrift-scheduler.xml \
  --conf spark.sql.thriftServer.incrementalCollect=true "$@"

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

heap_sparksession.png
10/Mar/21 11:22
154 kB
Dimitris Batis
heapdump_local_attempt_250_closed_connections.png
10/Mar/21 14:16
525 kB
Dimitris Batis
test_patch.diff
11/Mar/21 14:50
2 kB
Dimitris Batis

Issue Links

duplicates

SPARK-34087 a memory leak occurs when we clone the spark session

Resolved

Spark Thrift Server: Memory leak for SparkSession objects

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates