[SPARK-30462] Structured Streaming _spark_metadata fills up Spark Driver memory when having lots of objects - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.3, 2.4.4, 3.0.0
Fix Version/s: 3.1.0
Component/s: Structured Streaming
Labels:
None

Description

Hi,

With the current implementation of the Spark Structured Streaming it does not seem to be possible to have a constantly running stream, writing millions of files, without increasing the spark driver's memory to dozens of GB's.

In our scenario we are using Spark structured streaming to consume messages from a Kafka cluster, transform them, and write them as compressed Parquet files in an S3 Objectstore Service.
Each 30 seconds a new batch of the spark-streaming is writing hundreds of objects, which respectively results within time to millions of objects in S3.
As all written objects are recorded in the _spark_metadata, the size of the compact files there grows to GB's that eventually fill up the Spark Driver's memory and lead to OOM errors.

We need the functionality to configure the spark structured streaming to run without loading all the historically accumulated metadata in its memory.
Regularly resetting the _spark_metadata and the checkpoint folders is not an option in our use-case, as we are using the information from the _spark_metadata to have a register of the objects for faster querying and search of the written objects.

Attachments

Issue Links

relates to

SPARK-24295 Purge Structured streaming FileStreamSinkLog metadata compact file data.

In Progress

SPARK-29995 Structured Streaming file-sink log grow indefinitely

Resolved

SPARK-27188 FileStreamSink: provide a new option to have retention on output files

Resolved

links to

[Github] Pull Request #28904 (HeartSaVioR)

Activity

People

Assignee:: Jungtaek Lim

Reporter:: Vladimir Yankov

Votes:: 4 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 08/Jan/20 14:57

Updated:: 20/Aug/20 09:31

Resolved:: 20/Aug/20 09:31