Details
-
Bug
-
Status: Open
-
Blocker
-
Resolution: Unresolved
-
3.4.0, 3.5.0
-
None
Description
Description:
- We have a spark cluster installed over a k8s cluster with one driver and multiple executors (120).
- We configure our batch duration to 30 seconds.
- The Spark Cluster is reading from a 120 partition topic at Kafka and writing to an hourly index at ElasticSearch.
- ES has 30 DataNodes, 1 shard per DataNode for each index.
- Configuration of Driver STS is in Appendix.
- The driver is observed periodically restarting every 10 mins, although the restart do not necessarily occur each 10mins, but when it happens it happens each 10 mins.
- The restarts frequency increase with the increase of the throughput.
- When the restarts are happening, we see OptionalDataException, attached “logs_cveshv-events-streaming-core-cp-type2-filter-driver-0_051023_1500_prev.log” is the log resulting in a restart of the driver.
Analysis:
- We’ve done a test with 250 K Records/second, and the processing was good between 15 and 20 seconds.
- We were able to avoid all the restarts by simply disabling liveness checks.
- This resulted in NO RESTARTS to Streaming Core, we tried the above with two scenarios:
- Speculation Disabled --> After 10 to 20 minutes the batch duration increased to minutes and eventually processing was very slow, during which, main error logs observed are about The executor with id 7 exited with exit code 50(Uncaught exception)., logs at WARN level and TRACE level were collected:
- WARN: Logs attached “cveshv-events-streaming-core-cp-type2-filter-driver-0_liveness_300000_failed_120124_0336_2.log”
- TRACE: Logs attached “cveshv-events-streaming-TRACE (2).zip”
- Speculation Enabled --> the batch duration increased to minutes (big lag) only after around 2 hours, logs related are “cveshv-events-streaming-core-cp-type2-filter-driver-0_zrdm71bnrt201_1201240818.log”.
Conclusion:
- The liveness check is failing and thus causing the restarts.
- The logs indicates that there are some unhandled exceptions to executors.
- Issue can be somewhere else as well, below is the liveness check that was disabled and that was causing the restarts initially every 10 mins after 3 occurrences.
Next Action:
- Please help us identify the RC of the issue, we’ve tried too many configurations and with 2 different spark versions 3.4 and 3.5 and we’re not able to avoid the issue.
Appendix: