[SPARK-46702] Spark Cluster Crashing - ASF JIRA

XML

Word

Printable

JSON

We have a spark cluster installed over a k8s cluster with one driver and multiple executors (120).
We configure our batch duration to 30 seconds.
The Spark Cluster is reading from a 120 partition topic at Kafka and writing to an hourly index at ElasticSearch.
ES has 30 DataNodes, 1 shard per DataNode for each index.
Configuration of Driver STS is in Appendix.
The driver is observed periodically restarting every 10 mins, although the restart do not necessarily occur each 10mins, but when it happens it happens each 10 mins.
The restarts frequency increase with the increase of the throughput.
When the restarts are happening, we see OptionalDataException, attached “logs_cveshv-events-streaming-core-cp-type2-filter-driver-0_051023_1500_prev.log” is the log resulting in a restart of the driver.

We’ve done a test with 250 K Records/second, and the processing was good between 15 and 20 seconds.
We were able to avoid all the restarts by simply disabling liveness checks.
This resulted in NO RESTARTS to Streaming Core, we tried the above with two scenarios:

Speculation Disabled --> After 10 to 20 minutes the batch duration increased to minutes and eventually processing was very slow, during which, main error logs observed are about The executor with id 7 exited with exit code 50(Uncaught exception)., logs at WARN level and TRACE level were collected:

WARN: Logs attached “cveshv-events-streaming-core-cp-type2-filter-driver-0_liveness_300000_failed_120124_0336_2.log”
TRACE: Logs attached “cveshv-events-streaming-TRACE (2).zip”

Speculation Enabled --> the batch duration increased to minutes (big lag) only after around 2 hours, logs related are “cveshv-events-streaming-core-cp-type2-filter-driver-0_zrdm71bnrt201_1201240818.log”.

The liveness check is failing and thus causing the restarts.
The logs indicates that there are some unhandled exceptions to executors.
Issue can be somewhere else as well, below is the liveness check that was disabled and that was causing the restarts initially every 10 mins after 3 occurrences.

Please help us identify the RC of the issue, we’ve tried too many configurations and with 2 different spark versions 3.4 and 3.5 and we’re not able to avoid the issue.

Appendix: