[SPARK-29055] Spark UI storage memory increasing overtime - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.3
Fix Version/s: 2.4.5, 3.0.0
Component/s: Block Manager, Spark Core
Labels:
None

Flags:

Important

Description

I used Spark 2.1.1 and I upgraded into new versions. After Spark version 2.3.3, I observed from Spark UI that the driver memory is increasing continuously.

In more detail, the driver memory and executors memory have the same used memory storage and after each iteration the storage memory is increasing. You can reproduce this behavior by running the following snippet code. The following example, is very simple, without any dataframe persistence, but the memory consumption is not stable as it was in former Spark versions (Specifically until Spark 2.3.2).

Also, I tested with Spark streaming and structured streaming API and I had the same behavior. I tested with an existing application which reads from Kafka source and do some aggregations, persist dataframes and then unpersist them. The persist and unpersist it works correct, I see the dataframes in the storage tab in Spark UI and after the unpersist, all dataframe have removed. But, after the unpersist the executors memory is not zero, BUT has the same value with the driver memory. This behavior also affects the application performance because the memory of the executors is increasing as the driver increasing and after a while the persisted dataframes are not fit in the executors memory and I have spill to disk.

Another error which I had after a long running, was java.lang.OutOfMemoryError: GC overhead limit exceeded, but I don't know if its relevant with the above behavior or not.

HOW TO REPRODUCE THIS BEHAVIOR:

Create a very simple application(streaming count_file.py) in order to reproduce this behavior. This application reads CSV files from a directory, count the rows and then remove the processed files.

import time
import os

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T

target_dir = "..."

spark=SparkSession.builder.appName("DataframeCount").getOrCreate()

while True:
    for f in os.listdir(target_dir):
        df = spark.read.load(target_dir + f, format="csv")
        print("Number of records: {0}".format(df.count()))
        time.sleep(15)

Submit code:

spark-submit 
--master spark://xxx.xxx.xx.xxx
--deploy-mode client
--executor-memory 4g
--executor-cores 3
streaming count_file.py

TESTED CASES WITH THE SAME BEHAVIOUR:

I tested with default settings (spark-defaults.conf)
Add spark.cleaner.periodicGC.interval 1min (or less)
Turn spark.cleaner.referenceTracking.blocking=false
Run the application in cluster mode
Increase/decrease the resources of the executors and driver
I tested with extraJavaOptions in driver and executor -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12

DEPENDENCIES

Operation system: Ubuntu 16.04.3 LTS
Java: jdk1.8.0_131 (tested also with jdk1.8.0_221)
Python: Python 2.7.12

NOTE: In Spark 2.1.1 the driver memory consumption (Storage Memory tab) was extremely low and after the run of ContextCleaner and BlockManager the memory was decreasing.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

test_csvs.zip
16/Sep/19 12:58
1.29 MB
George Papa

Issue Links

is cloned by

SPARK-29321 Possible memory leak in Spark

Resolved

supercedes

SPARK-27648 In Spark2.4 Structured Streaming：The executor storage memory increasing over time

Resolved

SPARK-29301 Removing block is not reflected to the driver/executor's storage memory

Resolved

links to

GitHub Pull Request #25973

Activity

People

Assignee:: Jungtaek Lim

Reporter:: George Papa

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 11/Sep/19 13:13

Updated:: 16/Mar/20 07:24

Resolved:: 01/Oct/19 16:49